Presenting a new method for mixed data clustering based on the number of similar features

Author(s):

Negin Daneshpour*

Message:

Article Type:

Research/Original Article (دارای رتبه معتبر)

Abstract:

Clustering is an operation in which a set of data samples is categorized according to the degree of similarity. Examples of clustering data are numerical or a mixture of numerical and non-numerical (nominal) data. Finding similarities and measuring distances is one of the challenges of mixed data clustering. In the related works, to detect the degree of similarity and obtain the distance value, only the parameter of the distance value was considered and the cluster was selected based on its value. Clustering in this way, especially for mixed data, has not had very accurate results. In this paper, we have tried to pay attention to the parameter "number of similar features" in calculating the degree of similarity and determining the distance. In assigning each sample to a cluster in cases where the distances are equal or close, the number of common features of the samples will determine the appropriate cluster. That is, we will pay attention to the "number of similar features" in addition to the distance to select the cluster. This idea believes that in cases where the distance of the cluster centers is close to the data object, it is better to choose the cluster center that has more features similar to the data object. Logically and also according to the proposed algorithm, the amount of similarity should be in a larger number of features, not just a few limited features but with high similarity. The parameter of the "number of similar features" has a specific definition and is obtained with a suitable threshold. If the distance value of two features is less than the threshold, those two features are considered as similar features. To calculate the distance in the algorithm, the normalized numerical difference for numerical properties and the Hamming distance for non-numerical properties are used. Determining the initial cluster centers, like many methods, is done randomly, and in subsequent iterations of the algorithm, more appropriate samples are selected as the cluster centers. The algorithm is compared with 5 other algorithms in 5 datasets. In examining the results, three criteria of Accuracy, RI and F-Measure have been used. According to the test results, in the mixed and integer datasets, the algorithm performs at least two percent better than the two algorithms and one percent better than the other algorithm. In another data set, the proposed algorithm had results equal to or close to one percent better accuracy than the superior algorithm. In the last data set, the proposed algorithm was ranked second among 5 algorithms. In general, the proposed algorithm won the top rank in most of the results, and in the rest of the cases, it won the second rank out of the five tested algorithms.

Keywords:

Clustering , Mixed Data , Distance Of Values , Similarity Of Values , Cluster Center

Language:

Persian

Published:

Signal and Data Processing, Volume:21 Issue: 1, 2024

Pages:

39 to 52

https://www.magiran.com/p2747982

دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:

اشتراک شخصی

با ثبت ایمیلتان و پرداخت حق اشتراک سالانه به مبلغ 1,950,000 ريال، بلافاصله متن این مقاله را دریافت کنید.اعتبار دانلود 70 مقاله نیز در حساب کاربری شما لحاظ خواهد شد.

پرداخت حق اشتراک به معنای پذیرش "شرایط خدمات" پایگاه مگیران از سوی شماست.

پست الکترونیکی

اگر مقاله ای از شما در مگیران نمایه شده، برای استفاده از اعتبار اهدایی سامانه نویسندگان با ایمیل منتشرشده ثبت نام کنید. ثبت نام

اشتراک سازمانی

به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!

اطلاعات بیشتر ثبت نام با ایمیل دانشگاهی/سازمانی

توجه!

حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.

In order to view content subscription is required

Personal subscription

Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.

Organization subscription

Please contact us to subscribe your university or library for unlimited access!

More information

سامانه نویسندگان

Negin Daneshpour

Corresponding Author (2)

Associate Professor Computer Engineering, Software, Computer Engineering, Shahid Rajaee Teacher Training University, Tehran, Iran

اطلاعات نویسنده(گان) توسط ایشان ثبت و تکمیل شده‌است. برای مشاهده مشخصات و فهرست همه مطالب، صفحه رزومه را ببینید.

مقالات دیگری از این نویسنده (گان)

Scalable unsupervised feature selection via matrix learning and bipartite graph theory
Kosar Salehnezhad, Negin Daneshpour*
Journal of Iranian Association of Electrical and Electronics Engineers,
An Approximate Binary Tree-Based Solution to Speed Up the Search for the Nearest Neighbor in Big Data
Hosein Kalateh
Iranian Journal of Electrical and Computer Engineering,

علمی مصوب

فصلنامه پردازش علائم و داده ها

Signal and Data Processing

فصلنامه فنی مهندسی

آخرین شماره | آرشیو

ISSN: 2538-4201 eISSN: 2538-421X

صاحب امتیاز:

پژوهشگاه توسعه فناوری های پیشرفته خواجه نصیرالدین طوسی

مدیر مسئول:

دکتر جواد شیخ زادگان

سردبیر:

دکتر محمدحسن قاسمیان

تلفن نشریه: ۰۲۱-۸۳۸۵۷۶۰۵

اطلاعات بیشتر نشریه

درباره نشریه پیام به نشریه سایت اختصاصی نشریه پذیرش الکترونیکی مقاله