Stopwords List Construction for Automatic Indexing of Persian Texts
Author(s):
Abstract:
The Aim of this study was to identify nonconceptual or stop words in Persian Language and to develop a list of these words for automatic indexing of Persian texts in the fields of Psychology, Educational Sciences and Library and Information Science. The research was done based on content analysis method. The research population consisted the articles in the latest issues of the scientific journals of psychology, Education and Library and Information Science published in 1385. Findings showed that: 1- Copula and auxiliary verbs, Adverbs, Pronouns, Characters (Prepositions, Conjunctions and Interjection), Sounds, numbers and Punctuation marks are among non-conceptual or stop words in the Persian language. 2- Without including Punctuation marks, 39/96 percent of educational sciences, 38/57 percent of psychology and 38/12 percent of library and information science texts are constructed of non-conceptual words. 3- High frequency stopwords in these fields are approximately the same. 4- 38/94 percent of the text analyzed words are stop words. 5- Comparing Persian list with the stopword list of Fox in English language showed that there is 28/5% overlap between these two lists. The result of this survey showed that about 40% of the words in Persian language texts can be ignored in text analysis and automatic indexing.
Keywords:
Language:
Persian
Published:
Library and Information Science, Volume:12 Issue: 4, 2010
Page:
9
https://www.magiran.com/p633751