Predicting employee turnover using tree-based ensemble ‎learning algorithms ‎

Message:
Article Type:
Research/Original Article (دارای رتبه معتبر)
Abstract:

Key employee's turnover is one of the most important concerns of Human Resource Managers (HRM); Because the organization by losing its valuable staff, suffers from the loss of skills and experience gained over the years, so predicting employee turnover helps HRMs to hire and retain permanent employees. One of the effective tools in this regard is the use of different data mining methods. Many researchers have done research in this field. This study reviewes recently published articles based on machine learning models, using Kaggle Human Resource (HR) databases [1-5] to compare them with this proposed models. In the article [9], the authors have selected 11 of the most important features by collecting common features from previous articles and filtering them using feature review and selection algorithms. After converting non-numerical variables to numerical and normalizing the data in the range [0,1], those attrition prediction approach is based on machine, deep and ensemble learning models and is experimented on a large-sized and a medium-sized simulated HR datasets and then a real small-sized dataset from a total of 450 responses. Those approach achieves higher Accuracy (0.96, 0.98 and 0.99 respectively) for the three datasets when compared previous solutions. In 2021, authors examined the relationship between features using Pearson correlation coefficient and selected 11 features with the highest correlation coefficient. Then used from six different machine learning algorithms including Random Forest (RF), Logistic Regression (LR), …, to predict employee turnover. The highest accuracy they obtained was 0.85 for RF [3]. In the article[1], the authors used two IBM datasets and a database containing HR information from a regional bank in the USA to predict employees turnover. After cleaning and preprocessing the data, the performance of 10 different machine learning algorithms such as Decision Tree (DT), RF, LR, Neural Network, …, was evaluated using ROC criteria on 10 small, medium, and large subsets of randomly selected, unassigned primary datasets. The average accuracy of algorithms is 0.83 in small datasets, 0.81 in medium datasets and 0.86 in large datasets. The authors of the paper [4] used three main experiments on IBM Watson simulated datasets to predict employees turnover. The first experiment involved training the original class-imbalanced dataset with the following machine learning models: support vector machine with several kernel functions, random forest and K-nearest neighbour (KNN). The second experiment focused on using adaptive synthetic (ADASYN) approach to overcome class imbalance, then retraining on the new dataset using the abovementioned machine learning models. As a result, training an ADASYN-balanced dataset with KNN (K = 3) achieved the highest performance, with 0.93 F1-score. this turnover prediction approach is based on tree-based ensemble learning models and is experimented on a large-sized standard simulated HR dataset (hr_data), including 15,000 samples with 10 features and a medium-sized (IBM) including 1470 samples with 34 features. The employees turnover rate in the IBM is 16.1% and in the hr_data is 23.8%, so datasets are unbalanced. To balance the data, the random-under-sampling technique and its combination of random-over-sampling with a ratio of 0.5965 for the IBM and 0.6558 for the hr_data has been used. In the preprocessing stage, Features with zero variance and samples containing the missing value were also removed. Then categorical (non-numeric) values ​​were converted to binary fields and then All features were scaled using data normalization in [0,1]. In order to reduce the feature dimensions in the IBM dataset, we used the "Non-negative Matrix Factorization" (NMF) technique (n_components=17, max_iter=500) and For initialization, non-negative singular value analysis method with zeros filled with X value has been used. After reviewing and cleaning the data, in the processing stage, six different classification algorithms, including KNN (k=1), RF (number of trees= 1500), DT, ExtraTreesClassifier (number of trees= 1000) and Support Vector Classifier were training on 70% of data. The optimal value of the hyperparameters for the algorithms, was set using RandomizedSearchCV and GridSearchCV techniques. In order to investigate the effect of balancing and Dimensionality Reduction on the performance of models, experiments were performed in 3 stages (befor balancing, after balancing befor Dimensionality Reduction, after balancing and Dimensionality Reduction) on 30% of the remaining data. The results shown in Table (2-4) indicate that this proposed model, which uses tree-based optimized ensemble learning algorithms with data balancing and NMF dimensionality reduction method, increases the f1score of turnover prediction. In the hr_data dataset, the best f1score for the RandomForest algorithm was 99.52% and for the IBM HR dataset, the best f1score for the ExtraTreesClassifier algorithm was 95.82%, which is higher than previous research. Table 5 compares the results of previous research with this research. Since, the prediction of employee attrition will not be enough without finding the characteristics that affect it, therefore, after building models and evaluating their performance, using a combined feature selection method by averaging the results of the single-variable feature selection method called "SelectKBest", and A wrapper feature selection method called "Recursive feature elimination" (RFE) with four learning algorithms RF, DT, ExtraTreesClassifier and AdaBoost, the most effective features were selected. SelectKBest combines the chi2 univariate statistical test with the selection of K features based on the statistical result between the features and the target variable. Also, in the RFE method, machine learning algorithms are used to remove the least important features after recursive training, so that finally the number of features reaches the set number (17 features in this article). The performance results of the models based on the selected features are shown in Table 6. The most effective characteristics are "age", "daily rate", "over  time", "NumCompaniesWorked" and, "monthly income" .

Language:
Persian
Published:
Signal and Data Processing, Volume:20 Issue: 3, 2023
Pages:
73 to 86
https://www.magiran.com/p2672073  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با ثبت ایمیلتان و پرداخت حق اشتراک سالانه به مبلغ 1,390,000ريال، بلافاصله متن این مقاله را دریافت کنید.اعتبار دانلود 70 مقاله نیز در حساب کاربری شما لحاظ خواهد شد.

پرداخت حق اشتراک به معنای پذیرش "شرایط خدمات" پایگاه مگیران از سوی شماست.

اگر مقاله ای از شما در مگیران نمایه شده، برای استفاده از اعتبار اهدایی سامانه نویسندگان با ایمیل منتشرشده ثبت نام کنید. ثبت نام

اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!