Extracting Parallel English/Persian Sentences from Comparable Corpora using Syntactic Information

Message:
Article Type:
Research/Original Article (دارای رتبه معتبر)
Abstract:
Parallel corpora have always been considered among the richest resources in the field of natural language processing. These corpora include translated texts from two or more languages that are usually aligned at the different levels of word, clause, or sentence. Notwithstanding the many uses of these corpora in different studies such as linguistic researches, statistical machine translation, and cross language information retrieval; unfortunately parallel corpora have always been rare and limited in number and quality.. Accordingly, in this paper an automatic method for extracting parallel sentences from comparable resources is introduced which exploits syntactic information. In this method, by using syntactic information of the sentences, an alignment model is trained. The highest practical accuracy of the alignment model on the test set (208 pairs of sentences) was measured to be 77% and the highest precision on the training set (830 pairs of sentences) was 97.7%. Considering the tiny size of the golden corpora, n-fold cross validation technique was used in all training algorithms. To attain higher precision, a new similarity search algorithm was implemented which increased the practical accuracy on the test set from77% to 85.15%. The final outcome of this research was an alignment toolkit and framework which was named "Isfahan University Parallel Corpus Framework" or IPCF, which can be used by the researchers in the field of computational processing of Persian language to construct standard parallel corpora.
Language:
Persian
Published:
Research in Linguistics, Volume:10 Issue: 2, 2018
Pages:
15 to 36
magiran.com/p1988978  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با عضویت و پرداخت آنلاین حق اشتراک یک‌ساله به مبلغ 1,390,000ريال می‌توانید 70 عنوان مطلب دانلود کنید!
اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!