Detecting Similarity in Paraphrased Persian Texts using Semantic and Probabilistic Methods

Message:
Article Type:
Research/Original Article (دارای رتبه معتبر)
Abstract:

Plagiarism detection is the process of locating instances of plagiarism within a work or document. The main component of a plagiarism detection system is its text alignment algorithm aiming at detecting paraphrased passages of texts in a suspicious document, using a small set of candidate source documents. As text alignment algorithms are highly language-dependent, thus the numerous existing algorithms for other languages rather than Pesian cannot be employed for Persian plagiarism detection puposes. There are different text alignment algorithms for Persian text, while most of them are only able to detect exactly identical passages shared between texts. However, in many cases of plagiarism detection we are coping with the problem of finding similar pasaages that are already paraphrased. In this paper, we propose two new text alignment algorithms which are able to detect paraphrased texts in Persian language. The first one is a semantic algorithm that employs a dictionary to detect paraphrased sentences and the second one is a probabilistic algorithm that uses the statistical information obtained from a large corpus of Persian texts to detect similar texts. Compared to the other existing semantic text alignment algorithms, the proposed algorithms use different measures to check the similarity between the text sentences. Furthermore, the probabilistic algorithm is the first probabilistic text alignment algorithm proposed for the Persian language. Moreover, while all the existing text alignment algorithms check the similarity between any two sentences of the text separately, the proposed algorithms consider the similarity neighboring sentences in the text as well. The implementation results indicate that while the quality of both algorithms in detecting paraphrased texts is high enough and almost the same as each other, the proposed probabilistic method is more efficient than the proposed semantic algorithm, in terms of computation time.

Language:
Persian
Published:
Journal of Information Processing and Management, Volume:34 Issue: 4, 2019
Pages:
1823 to 1848
magiran.com/p2031437  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با عضویت و پرداخت آنلاین حق اشتراک یک‌ساله به مبلغ 1,390,000ريال می‌توانید 70 عنوان مطلب دانلود کنید!
اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!