Using Computational Methods for Persian Collocations Identification and Extraction

Message:
Article Type:
Research/Original Article (دارای رتبه معتبر)
Abstract:
In this article we study collocations in Persian. Previous researches in this field have been mostly statistical and comparative. The purpose of this research is to identify collocations using a corpus-based and computational method. In this research, after reviewing the definition of a collocation given by Iranian and non-Iranian linguists, researches conducted by Iranians or non-Iranians in this field are presented. In this paper, the Persian language database is used as the corpus. Also, as no dictionary of Persian collocations exist, a dataset of collocations has been compiled based on the Advanced Learners' Persian Dictionary. Using FastText embedded vectors, the language model is trained with a Long Short-Term Memory network. The results are then evaluated using several methods. Also, by fine-tuning ParsBert, the call of this language model is calculated using a thousand-item lists of collocations and non-collocations. Finally, a comparative analysis of collocation translation in Google Translate is conducted by translating a thousand Persian sentences into English. The following results are obtained from the examination of collocations in the language model trained with the Long Short-Term Memory network and ParsBert: in both models, collocations can be predicted, but ParsBert proved a stronger model in investigating language problems such as collocation examination. In the comparative analysis of the accuracy of Google Translate's collocation translation, three results were obtained: (1) the translation was correct; (2) the translation was literal and word for word; (3) The translation of collocations was ignored.
Language:
Persian
Published:
Journal of Information Processing and Management, Volume:40 Issue: 2, 2024
Pages:
577 to 604
https://www.magiran.com/p2842933  
سامانه نویسندگان
  • Maleki Vika، Mina
    Author (2)
    Maleki Vika, Mina
    (1402) کارشناسی ارشد زبانشناسی رایانشی، دانشگاه تهران
اطلاعات نویسنده(گان) توسط ایشان ثبت و تکمیل شده‌است. برای مشاهده مشخصات و فهرست همه مطالب، صفحه رزومه را ببینید.