A Tentative Method of Tokenizing Persian Corpus based on Language Modelling

Author(s):
Message:
Article Type:
Research/Original Article (دارای رتبه معتبر)
Abstract:
A digital Persian text suffers from two simple but important problems. The first problem concerns multi-token units to which the individual words are attached. The other problem concerns multi-unit tokens that result from the detachment of elements of a word. This paper introduces an algorithm to reduce these problems automatically and to achieve a standard text. The proposed algorithm has three steps. In the first step, the multi-token units are split into individual words and the multi-unit tokens are then attached together[p1] . For this step, a core algorithm based on language modeling is introduced to split multi-token units into independent words. The algorithm is modified with respect to the possible challenges of improving the performance[m2] . Furthermore, this step utilizes a morphological analyzer to study derivational and inflectional affixes and exact matching in a word list to resolve the problem of the multi-token units. In the second step, an exact word matching strategy is used to resolve the multi-token unit problem of verbs. The third step repeats the algorithm in the first step to fix new problems raised by running the second step. The introduced algorithm was tested in tokenizing the data in the Persian Linguistic DataBase (PLDB). The algorithm achieved 72.04% correction of the errors in the test set with 97.8% accuracy and 0.02% error production in the spelling.
Language:
Persian
Published:
Language and Linguistics, Volume:14 Issue: 27, 2019
Pages:
21 to 50
magiran.com/p1998000  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با عضویت و پرداخت آنلاین حق اشتراک یک‌ساله به مبلغ 1,390,000ريال می‌توانید 70 عنوان مطلب دانلود کنید!
اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!