Document Clustering Based On Ontology and Fuzzy Approach

Message:

Abstract:

Data mining، also known as knowledge discovery in database، is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining، which is the unsupervised classification of similar documents into different groups. The most important steps in document clustering are how documents are represented and the measurement of similarities between them. By giving a new ontological representation and a similarity measure، this research focuses on improving the performance of text clustering. The text clustering algorithm has been investigated in three aspects: ontological representation of documents، documents similarity measure، fuzzy inference system to measuring the final similarities. Ultimately، the clustering is carried out by bottom-up hierarchical clustering. In the first step، documents are represented as ontological graph according to domain knowledge. In contrast to keywords method، this method is based on domain concepts and represents a document as subgraph of domain ontology. The extracted concepts of document are the graph nodes. Weight is measured for each node in terms of concept frequency. The relation between documents’ concepts specifies the graph edges and the scope of the concepts’ relation determines the edge’s weight. In the second step، a new similarity measure has been presented proportional to the ontological representation. For each document، main and detailed concepts and main edges are determined. The similarity of each couple of documents is computed in three amounts and according to these three factors. In the third step، the fuzzy inference system with three inputs and one output has been designed. Inputs are the similarities of main concepts، detailed concepts and the main edges of two documents and the output is final similarities of the two documents. In final step، a bottom-up hierarchical clustering algorithm is used to clustering the documents according to final similarity matrix. In order to evaluate، the offered method has been compared with the results of Naïve Bayes method and ontology based algorithms. The results indicate that the proposed method improves the precision، recall، F-measure and accuracy and produces more meaningful results.

Keywords:

Document Clustering , Ontological Graph , Similarity Measure , Fuzzy Inference System

Language:

Persian

Published:

Journal of Information and Communication Technology, Volume:5 Issue: 17, 2014

Pages:

73 to 96

magiran.com/p1385999

دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:

اشتراک شخصی

با عضویت و پرداخت آنلاین حق اشتراک یک‌ساله به مبلغ 1,390,000ريال می‌توانید 70 عنوان مطلب دانلود کنید!

اشتراک سازمانی

به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!

اطلاعات بیشتر

توجه!

حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.

In order to view content subscription is required

Personal subscription

Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.

Organization subscription

Please contact us to subscribe your university or library for unlimited access!

More information

علمی مصوب

نشریه فناوری اطلاعات و ارتباطات ایران

Journal of Information and Communication Technology

دوفصلنامه فنی مهندسی

آخرین شماره | آرشیو

صاحب امتیاز:

انجمن فناوری اطلاعات و ارتباطات ایران

مدیر مسئول:

دکتر مسعود شفیعی

سردبیر:

دکتر محمدشهرام معین

تلفن نشریه: ۰۲۱-۶۶۴۸۵۸۵۶

اطلاعات بیشتر نشریه

درباره نشریه پیام به نشریه سایت اختصاصی نشریه پذیرش الکترونیکی مقاله راهنمای نویسندگان

به جمع مشترکان مگیران بپیوندید!

Document Clustering Based On Ontology and Fuzzy Approach

Document Clustering , Ontological Graph , Similarity Measure , Fuzzy Inference System

نشریه فناوری اطلاعات و ارتباطات ایران

Journal of Information and Communication Technology