Design and production of Persian news data set IHU-PersianNewsDataSetJavadzade-et-al Imam Hossein Comprehensive University

Message:
Article Type:
Research/Original Article (بدون رتبه معتبر)
Abstract:

Although the lack of data is one of the important challenges for research in the field of natural language processing, but this challenge is more acute in the Persian language, so finding a high-quality and comprehensive dataset in the Persian language is a difficult task. In addition to that, having some problems such as the ability to categorize and not complying with the storage standard are among the problems of the existing datasets, each of which can affect the learning rate of the model, the results, and the error rate in the experiments. For this reason, all these reasons made us seek to collect and prepare a dataset that covers all such problems and reduces the amount of error when using data in different models. In this research, we have designed and used a crawler to collect textual data. By crawling on one of the news bases, it has been able to collect data sets in five columns: title, summary, text, tag, and publication date. The textual data has been normalized with the help of one of the Persian language libraries in the Python programming language and stored in csv and xml formats and made available to fellow researchers. The tags in this dataset include 13 main tags of sports, art and media, culture, science and progress, political, foreign policy, life, family, society, education and training, international, economic and provinces. Among the tasks that can be done on this data set are text classification, text extraction, text summarization and title recognition. Also, one of the prominent features of this data set is its comprehensiveness, the amount of suitable data, the existence of useful features, having unique features, as well as storage in a standard format. This dataset is a product of the Language Processing Department of Imam Hossein Comprehensive University and can be downloaded and used through the link mentioned in the footnote of the next page and with respect to copyright.

Language:
Persian
Published:
Journal of New Achievements in Electrical, Computer and Technology, Volume:2 Issue: 3, 2022
Pages:
103 to 121
magiran.com/p2487409  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با عضویت و پرداخت آنلاین حق اشتراک یک‌ساله به مبلغ 1,390,000ريال می‌توانید 70 عنوان مطلب دانلود کنید!
اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!