Presenting a new method for mixed data clustering based on the number of similar features

Author(s):
Message:
Article Type:
Research/Original Article (دارای رتبه معتبر)
Abstract:

Clustering is an operation in which a set of data samples is categorized according to the degree of similarity. Examples of clustering data are numerical or a mixture of numerical and non-numerical (nominal) data. Finding similarities and measuring distances is one of the challenges of mixed data clustering. In the related works, to detect the degree of similarity and obtain the distance value, only the parameter of the distance value was considered and the cluster was selected based on its value. Clustering in this way, especially for mixed data, has not had very accurate results. In this paper, we have tried to pay attention to the parameter "number of similar features" in calculating the degree of similarity and determining the distance. In assigning each sample to a cluster in cases where the distances are equal or close, the number of common features of the samples will determine the appropriate cluster. That is, we will pay attention to the "number of similar features" in addition to the distance to select the cluster. This idea believes that in cases where the distance of the cluster centers is close to the data object, it is better to choose the cluster center that has more features similar to the data object. Logically and also according to the proposed algorithm, the amount of similarity should be in a larger number of features, not just a few limited features but with high similarity. The parameter of the "number of similar features" has a specific definition and is obtained with a suitable threshold. If the distance value of two features is less than the threshold, those two features are considered as similar features. To calculate the distance in the algorithm, the normalized numerical difference for numerical properties and the Hamming distance for non-numerical properties are used. Determining the initial cluster centers, like many methods, is done randomly, and in subsequent iterations of the algorithm, more appropriate samples are selected as the cluster centers. The algorithm is compared with 5 other algorithms in 5 datasets. In examining the results, three criteria of Accuracy, RI and F-Measure have been used. According to the test results, in the mixed and integer datasets, the algorithm performs at least two percent better than the two algorithms and one percent better than the other algorithm. In another data set, the proposed algorithm had results equal to or close to one percent better accuracy than the superior algorithm. In the last data set, the proposed algorithm was ranked second among 5 algorithms. In general, the proposed algorithm won the top rank in most of the results, and in the rest of the cases, it won the second rank out of the five tested algorithms.

Language:
Persian
Published:
Signal and Data Processing, Volume:21 Issue: 1, 2024
Pages:
39 to 52
https://www.magiran.com/p2747982  
سامانه نویسندگان
  • Daneshpour، Negin
    Corresponding Author (2)
    Daneshpour, Negin
    Associate Professor Computer Engineering, Software, Computer Engineering, Shahid Rajaee Teacher Training University, Tehran, Iran
اطلاعات نویسنده(گان) توسط ایشان ثبت و تکمیل شده‌است. برای مشاهده مشخصات و فهرست همه مطالب، صفحه رزومه را ببینید.
مقالات دیگری از این نویسنده (گان)