Comparison of the performance of approaches in discovering and extracting e-book topics

Message:
Article Type:
Research/Original Article (دارای رتبه معتبر)
Abstract:
Keyword extraction is one of the most important issues in text processing and analysis and provides a high-level and accurate summary of the text. Therefore, choosing the right method to extract keywords from the text is important. The aim of the present study was to compare the performance of three approaches in discovering and extracting the subject keywords of e-books using text mining and machine learning techniques. In this regard, three experimental approaches have been introduced and compared; including the successive implementation of the clustering process, improving the quality of clusters in terms of semantics and enriching the stop words of a specific field; Use of specialized keyword template; Finally, the use of important parts of the text in discovering and extracting key words and important topics of the text. The statistical population includes 1000 e-book titles from the subject fields of library and information science based on the congress classification system. bibliographic information of EBooks was obtained from the congress library database, then the original text was prepared. The extraction of topic keywords and clustering of training data was performed using the non-negative matrix factorization algorithm with three experimental approaches. The quality and performance of the subject clusters resulting from the implementation of three approaches in the automatic classification of experimental data were compared using a support vector machine. The findings showed that the Hamming loss (0.020) and in other words the error rate in the correct classification of experimental texts in the third approach is far less than the other two approaches. Also, the F1 score (0.82), which is the average of the two criteria of Precision (0.87) and recall (0.78) and is a reflection of the correct performance of the classification process in topic labeling of texts, is better in the third approach than the other two approaches. The results showed that the quality and semantic coherence of the subject clusters obtained from the third approach, ie the use of important parts of the text in discovering and extracting the subject, was better compared to the other two approaches. In this approach, by focusing on the main parts of the data, which represent the main content and theme of the text, more meaningful topic clusters were obtained. In addition, the keywords obtained from the topic cluster of the third approach can be used in unspecified and unknown collections in order to extract the unknown thematic content of the whole collection. The results of third approach also was better in terms of accuracy and readability (0.79) and the rate of classification error (0.020) of texts, in comparison of other two approaches.
Language:
Persian
Published:
Journal of Information Processing and Management, Volume:38 Issue: 4, 2023
Pages:
1369 to 1393
magiran.com/p2597275  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با عضویت و پرداخت آنلاین حق اشتراک یک‌ساله به مبلغ 1,390,000ريال می‌توانید 70 عنوان مطلب دانلود کنید!
اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!