Word classification to use in Persian class-based N-gram

Abstract:
Statistical language models (SLM) have been widely used in speech recognition systems. Among them, N-gram language model is the most popular ones. Off course, in the case of large vocabulary systems, while estimating the parameters of n-gram language models, as a result of insignificant size of the used corpus, usually the sparse data problem occurs. By assigning the words to some restricted number of classes, the size of the model parameters will be reduced and a not very large corpus could be used to reach to a class-based n-gram model. In this research, we are going to implement some known automatic word classification methods on Persian and modify them to find better classification results. The first method is known as Brown method which exploits a statistical parameter named "mutual information" to evaluate word classification result. The second method, represented by Martin, follows perplexity decrement via a displacement algorithm. The third method finds classes by using a statistical similarity parameter between words and a bottom-up algorithm. We implemented all of these methods on Persian and compared them in the area of the resulted perplexity of class-based bigrams stated on the word classification results. To modify these known methods then two new methods are introduced. In the first one, the initial point of the Brown algorithm is modified which finally leads to a smaller perplexity on test data. In the second method, a complex of the displacement algorithm and choosing a threshold level to verify classes combination is used which leads to a smaller perplexity against original Brown method in addition of finding automatically the best number of word classes, depending on the selected threshold.
Language:
Persian
Published:
Signal and Data Processing, Volume:4 Issue: 2, 2008
Page:
37
magiran.com/p883435  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با عضویت و پرداخت آنلاین حق اشتراک یک‌ساله به مبلغ 1,390,000ريال می‌توانید 70 عنوان مطلب دانلود کنید!
اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!