Identifying Persian Words’ Senses Automatically by Utilizing the Word Embedding Method

Author(s):
Message:
Article Type:
Research/Original Article (دارای رتبه معتبر)
Abstract:

A word is the smallest unit in the language that has 'form' and 'meaning'. The word might have more than one meaning in which its exact meaning is determined according to the context it is appeared. Collecting all words’ senses manually is a tedious and time consuming task. Moreover, it is possible that the words’ meanings change over time such that the meaning of an existing word will become unusable or a new meaning will be added to the word. Computational methods is one of the approaches used for identifying words’ senses with respect to the linguistic contexts.In this paper, we put an effort to propose an algorithm to identify senses of Persian words automatically without a human supervision. To reach this goal, we utilize the word embedding method in a vector space model. To build words’ vectors, we use an algorithm based on the neural network approach to gather the context information of the words in the vectors. In the proposed model of this research, the divisive clustering algorithm as one of hierarchical clustering algorithms fits with the requirements of our research question. In the proposed model, two modes, namely the Sentence-based and the Context-based, are introduced to identify words’ senses. In the Sentence-based mode, all of the words in a sentence that contain the target word are involved to build the sentence vector; while in the Context-based mode, only a limited number of surrounding words of the target word is involved to build the sentence vector. Two evaluation methods, namely internal and external, are required to evaluate the performance of the clustering algorithm. The silhouette score for each cluster is computed as the internal evaluation metric for both modes of the proposed model. The external evaluation requires a gold standard data for which a data set containing 20 ambiguous words and 100 sentences for each target word is developed.According to the obtained results of the internal evaluation, the Sentence-based mode has higher density of clusters than the Context-based mode, and the difference between them is statistically significant. According to the V- and F-measure evaluation metrics in the external evaluation, the Context-based mode has obtained higher performance against the baselines with statistically significant difference.

Language:
Persian
Published:
Journal of Information Processing and Management, Volume:35 Issue: 1, 2019
Pages:
25 to 50
magiran.com/p2066535  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با عضویت و پرداخت آنلاین حق اشتراک یک‌ساله به مبلغ 1,390,000ريال می‌توانید 70 عنوان مطلب دانلود کنید!
اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
دسترسی سراسری کاربران دانشگاه پیام نور!
اعضای هیئت علمی و دانشجویان دانشگاه پیام نور در سراسر کشور، در صورت ثبت نام با ایمیل دانشگاهی، تا پایان فروردین ماه 1403 به مقالات سایت دسترسی خواهند داشت!
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!