فهرست مطالب

Information Technology Management - Volume:10 Issue: 4, Winter 2018

Journal of Information Technology Management
Volume:10 Issue: 4, Winter 2018

  • تاریخ انتشار: 1397/10/13
  • تعداد عناوین: 6
|
  • Yumeng Ye *, John Talburt Pages 1-11
    This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise linking decisions, not just the pairwise classifications alone. Part of the problem is that the measures of precision and recall as calculated in data mining classification algorithms such as logistic regression is different from applying these measures to entity resolution (ER) results.. As a classifier, logistic regression precision and recall measure the algorithm’s pairwise decision performance. When applied to ER, precision and recall measure how accurately the set of input references were partitioned into subsets (clusters) referencing the same entity. When applied to datasets containing more than two references, ER is a two-step process. Step One is to classify pairs of records as linked or not linked. Step Two applies transitive closure to these linked pairs to find the maximally connected subsets (clusters) of equivalent references. The precision and recall of the final ER result will generally be different from the precision and recall measures of the pairwise classifier used to power the ER process. The experiments described in the paper were performed using a well-tested set of synthetic customer data for which the correct linking is known. The best F-measure of precision and recall for the final ER result was obtained by substantially increasing the threshold of the logistic regression pairwise classifier.
    Keywords: Entity resolution, Record linking, Machine learning, Logistic regression, Transitive closure
  • Awaad Al Sarkhi *_John R Talburt Pages 12-26

    This paper discusses recent research on methods for estimating configuration parameters for the Matrix Comparator used for linking unstandardized or heterogeneously standardized references. The matrix comparator computes the aggregate similarity between the tokens (words) in a pair of references. The two most critical parameters for the matrix comparator for obtaining the best linking results are the value of the similarity threshold and the list of stop words to exclude from the comparison. Earlier research has shown that the standard deviation of the token frequency distribution is strongly predictive of how useful stop words will be in improving linking performance. The research results presented here demonstrate a method for using statistics from token frequency distribution to estimate the threshold value and stop word selection likely to give the best linking results. The model was made using linear regression and validated with independent datasets.

    Keywords: Entity resolution, Record linking, Matrix comparator, Stop words, Token frequency, F-measure
  • Imelda Aritonang_Achmad Nizar Hidayanto_Nur Fitriah Ayuning Budi *_Rahmat M Samik Ibrahim_Solikin Solikin Pages 27-40

    The Central Statistics Agency (BPS) is a government institution that has the authority to carry out statistical activities in the form of censuses and surveys, to produce statistical data needed by the government, the private sector and the general public, as a reference in planning, monitoring, and evaluation of development results. Therefore, providing quality statistical data is very decisive because it will have an impact on the effectiveness of decision making. This paper aims to develop a framework to determine priority of solutions in overcoming data quality problems using the Analytic Hierarchy Process (AHP). The framework is built by conducting interviews and Focus Group Discussion (FGD) on experts to get the interrelationship between problems and solutions. The model that has been built is then tested in a case study, namely the Central Jakarta Central Bureau of Statistics (BPS). The results of the study indicate that the proposed model can be used to formulate solutions to data problems in BPS.

    Keywords: Data quality, Analytical hierarchy process, AHP, Central Statistics Agency the Republic of Indonesia
  • Junaid Ali Reshi *, Satwinder Singh Pages 41-63
    The quest for improving the software quality has given rise to various studies which focus on the enhancement of the quality of software through various processes. Code smells, which are indicators of the software quality have not been put to an extensive study for as to determine their role in the prediction of defects in the software. This study aims to investigate the role of code smells in prediction of non-faulty classes. We examine the Eclipse software with four versions (3.2, 3.3, 3.6, and 3.7) for metrics and smells. Further, different code smells, derived subjectively through iPlasma, are taken into conjugation and three efficient, but subjective models are developed to detect code smells on each of Random Forest, J48 and SVM machine learning algorithms. This model is then used to detect the absence of defects in the four Eclipse versions. The effect of balanced and unbalanced datasets is also examined for these four versions. The results suggest that the code smells can be a valuable feature in discriminating absence of defects in a software.
    Keywords: Preventive maintenance, Code smells, Machine learning, Random forest
  • Ahmad Khalilijafarabad * Pages 64-71
    Over the last 20 years, and particularly with the advent of Big Data and analytics, the research area around Data and Information Quality (DIQ) is still a fast growing research area. There are many views and streams in DIQ research, generally aiming at improving the effectiveness of decision making in organizations. Although there are a lot of researches aimed at clarifying the role of BIG data quality for organizations, there is no comprehensive literature review that shows the main differences between traditional data quality researches and Big Data quality researches. This paper analyzed the papers published in Big data quality and find out that there is almost no new mainstream about Big Data quality. It is shown in this paper that the main concepts of data quality does not changes in Big Data context and that only some new issues have been added to this area.
    Keywords: Big data, Big data quality, Data quality, Text mining
  • Markus Helfert, Mouzhi Ge * Pages 72-83
    Despite the increasing importance of data and information quality, current research related to Big Data quality is still limited. It is particularly unknown how to apply previous data quality models to Big Data. In this paper we review Big Data quality research from several perspectives and apply a known quality model with its elements of conformance to specification and design in the context of Big Data. Furthermore, we extend this model and demonstrate it utility by analyzing the impact of three Big Data characteristics such as volume, velocity and variety in the context of smart cities. This paper intends to build a foundation for further empirical research to understand Big Data quality and its implications in the design and execution of smart service ecosystems.
    Keywords: Big data quality, Information quality, Smart cities, Service design, Smart services, Data quality model, Smart service ecosystem