Comparing k-means clusters on parallel Persian-English corpus

Author(s):
Abstract:
This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of documents based on their content, it is expected that the answer to this question is yes. On the other hand, many differences between various languages can cause the answer to this question to be no. This research has focused on k-means that is one of the basic and popular document clustering methods. We want to know whether the clusters of aligned Persian and English texts obtained by the k-means are similar. To find an answer to this question, Mizan English-Persian Parallel Corpus was considered as benchmark. After features extraction using text mining techniques and applying the PCA dimension reduction method, the k-means clustering was performed. The morphological difference between English and Persian languages caused the larger feature vector length for Persian. So almost in all experiments, the English results were slightly richer than those in Persian. Aside from these differences, the overall behavior of Persian and English clusters was similar. These similar behaviors showed that results of k-means research on English can be expanded to Persian. Finally, there is hope that despite many differences between various languages, clustering methods may be extendable to other languages.
Language:
English
Published:
Journal of Artificial Intelligence and Data Mining, Volume:3 Issue: 2, Summer-Autumn 2015
Pages:
203 to 208
magiran.com/p1538765  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با عضویت و پرداخت آنلاین حق اشتراک یک‌ساله به مبلغ 1,390,000ريال می‌توانید 70 عنوان مطلب دانلود کنید!
اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
دسترسی سراسری کاربران دانشگاه پیام نور!
اعضای هیئت علمی و دانشجویان دانشگاه پیام نور در سراسر کشور، در صورت ثبت نام با ایمیل دانشگاهی، تا پایان فروردین ماه 1403 به مقالات سایت دسترسی خواهند داشت!
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!