Classifying whether or not employees are likely to resign¶

First import all the libraries we are going to use

import pandas as pd
import numpy as np

from sklearn import  neighbors, svm
from sklearn.linear_model import LogisticRegressionCV
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier,\
    AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB 

from sklearn import preprocessing
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score,\
    mean_absolute_error, confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, train_test_split,\
    learning_curve
from sklearn.feature_selection import RFE

%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

import itertools
import pickle

DIR_DATA = "data"
DIR_PROCESSED = "processed"
CV_FOLDS = 5

HR_comma_sep.csv¶

All our data is stored in this file, here's a slightly altered description of the features provided with the competition.

Features¶

Employee satisfaction level
Last evaluation
Number of projects
Average monthly hours
Time spent at the company
Whether they have had a work accident
Whether they have had a promotion in the last 5 years
Department
Salary

Target¶

Whether the employee has left

Now let's load and look at the actual data

hr = pd.read_csv(DIR_DATA + '/HR_comma_sep.csv')
hr.head()

hr.describe()

print('Percent who left: {:.2f}'.format(np.sum(hr.left) / len(hr.left) * 100))

Percent who left: 23.81

First, let's turn sales departments and salary (high/medium/low) into category types so that scikit-learn's algorithms can handle them

hr.sales = hr.sales.astype('category').cat.codes
hr.salary = hr.salary.astype('category').cat.codes
hr[["sales","salary"]].head()

Off-the-shelf Classification¶

Let's create a quick test of basic classification algorithms to see how well each performs with mostly unaltered data

def predict_left(df, clf, test_size=0.2):
    X = df.drop(['left'],1)
    y = df.left 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
    
    
    clf.fit(X_train, y_train)
    
    train_pred = clf.predict(X_train)
    test_pred = clf.predict(X_test)
    print('Training Score: {:.3f}'.format(clf.score(X_train, y_train)))
    print('Testing Score: %.3f' % (clf.score(X_test, y_test)))
    
    print()

classifiers = [RandomForestClassifier(n_jobs=-1),
               RandomForestClassifier(criterion='entropy', n_jobs=-1),
               svm.SVC(),
               LogisticRegressionCV(n_jobs=-1), 
               AdaBoostClassifier(), 
               GradientBoostingClassifier(),
               neighbors.KNeighborsClassifier(n_jobs=-1),
               MultinomialNB(class_prior=[76.19, 23.81])] 

np.random.seed(0)

for i, clf in enumerate(classifiers):
    print('Classifier ', i)
    
    predict_left(hr, clf, test_size=0.4)

Classifier  0
Training Score: 0.998
Testing Score: 0.988

Classifier  1
Training Score: 0.997
Testing Score: 0.986

Classifier  2
Training Score: 0.962
Testing Score: 0.949

Classifier  3
Training Score: 0.780
Testing Score: 0.774

Classifier  4
Training Score: 0.962
Testing Score: 0.964

Classifier  5
Training Score: 0.977
Testing Score: 0.974

Classifier  6
Training Score: 0.951
Testing Score: 0.927

Classifier  7
Training Score: 0.760
Testing Score: 0.765

They all seem to do well, besides svm, with Random Forest doing best regardless of it's criterion at ~98.7% predictive accuracy.

K-Fold Cross Validation¶

Let's make sure they consistently produce that score using K-Fold Cross-Validation.

def cross_val_left(hr, clf, cv_folds=CV_FOLDS, drop=['left']):
    X = hr.drop(drop, 1)
    y = hr.left 
    scores = cross_val_score(clf, X, y, cv=cv_folds, n_jobs=-1)#, scoring='roc_auc')
    
    
    
    print('Cross val score: ', sum(scores) / cv_folds )
    print(scores)
    
    print()

classifiers = [RandomForestClassifier(n_estimators=500 ,n_jobs=-1), RandomForestClassifier(n_estimators=500, criterion='entropy', n_jobs=-1)]

for i, clf in enumerate(classifiers):
    print('Classifier ', i)
    
    cross_val_left(hr, clf)

Classifier  0
Cross val score:  0.992066333178
[ 0.99900033  0.981       0.98633333  0.99933311  0.99466489]

Classifier  1
Cross val score:  0.991866333163
[ 0.99866711  0.98133333  0.98566667  0.99899967  0.99466489]

Great, so they tend to do even better on average, with a score of 99.2% each over 5 cross-validation folds.

Learning Curve¶

Now let's see how the classifier improves with more and more training data by plotting a learning curve.

Plot code based on example on this page: https://goo.gl/pd1A8K

clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
X = hr.drop(['left'],1)
y = hr.left 
train_sizes, train_scores, valid_scores = learning_curve(clf, X, y, train_sizes=np.linspace(.1, 1.0, 10), cv=5, n_jobs=-1)

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(valid_scores, axis=1)
test_scores_std = np.std(valid_scores, axis=1)
train_sizes = np.linspace(.1,1.0,10)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")

plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

plt.xlabel("Training examples")
plt.ylabel("Score")

plt.grid()
plt.legend(loc="best")
plt.show()

So there seems to be a huge improvement between getting 10% and 20% of the data, with mild improvements after that.

Let's zoom in on the improvements after 40% of the training data is being trained on to get a better picture.

The blue outline around the green cross-validation score is the standard deviation of the predicted scores.

train_scores_mean = np.mean(train_scores, axis=1)[3:] # 0.4 on
train_scores_std = np.std(train_scores, axis=1)[3:]
test_scores_mean = np.mean(valid_scores, axis=1)[3:]
test_scores_std = np.std(valid_scores, axis=1)[3:]
train_sizes = np.linspace(.1,1.0,10)[3:]

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")

plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

plt.xlabel("Training examples")
plt.ylabel("Score")

plt.grid()
plt.legend(loc="best")
plt.show()

Feature Selection¶

Now let's do some feature selection by removing the features that least matter

np.random.seed(0)

clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
X = hr.drop(['left'],1)
y = hr.left 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    
clf.fit(X_train, y_train)
print(X.columns)
clf.feature_importances_

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident',
       'promotion_last_5years', 'sales', 'salary'],
      dtype='object')

array([ 0.3475117 ,  0.12315145,  0.17318094,  0.14422289,  0.18418062,
        0.00558285,  0.00098859,  0.01256505,  0.00861591])

Let's start by dropping all features with lower than 0.01 relevance.

It's possible that a feature with a low "relevance" can still matter if the magnitude is great enough, however most of these values are in the same range near zero, so that shouldn't matter.

drop = ['left', 'promotion_last_5years', 'Work_accident', 'sales', 'salary']
for i, clf in enumerate(classifiers):
    print('Classifier ', i)
    
    cross_val_left(hr, clf, drop=drop)

Classifier  0
Cross val score:  0.991532933111
[ 0.99800067  0.98133333  0.98633333  0.99833278  0.99366455]

Classifier  1
Cross val score:  0.991532910896
[ 0.99833389  0.98166667  0.98566667  0.99833278  0.99366455]

So it seems that dropping all with < 0.01 relevance seems to not affect score much:

99.20% -> 99.16%

Let's investigate further.

hr.corr()["left"]

satisfaction_level      -0.388375
last_evaluation          0.006567
number_project           0.023787
average_montly_hours     0.071287
time_spend_company       0.144822
Work_accident           -0.154622
left                     1.000000
promotion_last_5years   -0.061788
sales                    0.032105
salary                  -0.001294
Name: left, dtype: float64

Will we get the same results by looking at other metrics?

Let's see what happens when we drop everything with abs(correlation) < 0.005

drop = ['left',  'sales', 'salary']
classifiers = [RandomForestClassifier(n_estimators=500 ,n_jobs=-1), RandomForestClassifier(n_estimators=500, criterion='entropy', n_jobs=-1)]#, svm.SVC()]#xgb.XGBClassifier(n_estimators=500, nthread=-1)]# svm.SVC()]

for i, clf in enumerate(classifiers):
    print('Classifier ', i)
    
    cross_val_left(hr, clf, drop=drop)

Classifier  0
Cross val score:  0.991666355363
[ 0.99800067  0.981       0.986       0.99899967  0.99433144]

Classifier  1
Cross val score:  0.991866310918
[ 0.99833389  0.98166667  0.98633333  0.99866622  0.99433144]

The decrease in accuracy is the same as when we pulled out four features before, even though we only took out half as many.

What happens if we only take out the other two features and leave in the "least correlated"?

drop = ['left', 'promotion_last_5years', 'Work_accident']
for i, clf in enumerate(classifiers):
    print('Classifier ', i)
    
    cross_val_left(hr, clf, drop=drop)

Classifier  0
Cross val score:  0.992199666511
[ 0.99900033  0.98166667  0.98633333  0.99933311  0.99466489]

Classifier  1
Cross val score:  0.991866266489
[ 0.99900033  0.98133333  0.986       0.99899967  0.993998  ]

Hmm. So it seems removing Salary and Sales is what caused the 0.04 percentage point decrease, and still matter somewhat even though they're the least correlated.

The other two features though seem worthless and possibly counterproductive though since removing them managed to increase the RandomForestClassifier's (with Entropy) accuracy

Recursive Feature Selection¶

Recursive Feature Selection, or RFE, is a more predictable method of ranking features.

Let's see if we get the same results as before by looking at what the worst 4 features are

model = RandomForestClassifier(n_estimators=500, n_jobs=-1)
clf = RFE(model, 5)

X = hr.drop(['left'], 1)
y = hr.left 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    
clf.fit(X_train, y_train)
print(clf.support_)
print(clf.ranking_)
# Gets same result as feature_importance, which makes sense

[ True  True  True  True  True False False False False]
[1 1 1 1 1 4 5 2 3]

It's the same as before, with promotion_last_5years and Work_accident being the worst features, with Salary and Sales right behind.

That worked out nicely.

But is that true if we use other algorithms with similar accuracy (that RFE is willing to work with)?

classifiers = [AdaBoostClassifier(), 
               GradientBoostingClassifier(),
               LogisticRegressionCV(),
               MultinomialNB()] 

for i, model in enumerate(classifiers):
    clf = RFE(model, 5)

    X = hr.drop(['left'], 1)
    y = hr.left 
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    
    print("Classifier %d" % i)
    clf.fit(X_train, y_train)
    
    print(clf.ranking_)
    print()
    # Gets same result as feature_importance

Classifier 0
[1 1 1 1 1 5 4 2 3]

Classifier 1
[1 1 1 1 1 3 5 4 2]

Classifier 2
[1 1 2 4 1 1 1 3 5]

Classifier 3
[1 1 2 5 3 1 1 4 1]

Both the high accuracy classifiers (AdaBoost and GradientBoost) view those same 4 as being the lest useful, in similar if not identical orders. Logistic Regression and Multinomial Naive Bayes on the other hand, with their ~77% accuracy only agreed that sales was bad.

Given that all the high accuracy classifiers agree, I think it's safe to say that promotion_last_5years and Work_accident are the least useful features

PCA¶

Now let's look at what happens when reduce the relevant features down to 2 and 3 dimensions so that we can plot them.

2-D¶

drop = ['left',  'promotion_last_5years', 'Work_accident']

y = np.array(hr.left)

X = np.array(hr.drop(drop,1))
pca = PCA(n_components=2).fit(X)
X_pca = pca.transform(X)

Now that we've done the PCA transformation, let's see how much of the variance is explained per new feature, and overall

print(pca.explained_variance_ratio_)
print(sum(pca.explained_variance_ratio_))

[ 0.99518845  0.00328358]
0.998472031391

It looks like pretty much all of the variance is being handled by the first dimension, and that we should still get high accuracy since most of the variance in the data is still there.

# Still get good accuracy, ~97%
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
y = np.array(hr.left) 

scores = cross_val_score(clf, X_pca, y, cv=CV_FOLDS, n_jobs=-1)
    
    
    
print('Cross val score: ', sum(scores) / CV_FOLDS )
print(scores)

Cross val score:  0.9572663318
[ 0.96801066  0.92066667  0.93466667  0.97965989  0.98332778]

Despite catching most of the variance, the accuracy of the classifier goes down a lot (3.5 percentage points) and there is a lot more variance in the various cross folds (92.07% <-> 98.33%)

Now let's plot our 2-Dimensional data¶

colors = itertools.cycle('rb')
target_ids = range(2)
plt.figure()
for i, c, label in zip(target_ids, colors, ["stay","left"]):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],
                c=c, label=label)
plt.legend()
plt.show()

It seems to have plotted them into stratified lines of predictions where the same general chunks of each line are stay or left, with a few outliers. Someone might get pretty good accuracy by just looking at the x-values on the bottom, which makes sense since most of the variance is covered by the first pca-feature that we made

3-D¶

y = np.array(hr.left)

X = np.array(hr.drop(drop,1))
pca = PCA(n_components=3).fit(X)
X_pca = pca.transform(X)

colors = itertools.cycle('rb')
target_ids = range(2)
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for i, c, label in zip(target_ids, colors, ["stay","left"]):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],X_pca[y == i, 2],
                c=c, label=label)
    
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

plt.legend()
plt.show()

/home/altock/anaconda3/envs/py35/lib/python3.5/site-packages/matplotlib/collections.py:865: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor

Sicne almost all of the variance is in the first dimension, it makes sense that the 3d graph is essentially the same 2-D plane we got before.

Now let's check in on the accuracy of the 3-D version

clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
y = hr.left 

scores = cross_val_score(clf, X_pca, y, cv=CV_FOLDS, n_jobs=-1)
    
    
    
print('Cross val score: ', sum(scores) / CV_FOLDS )
print(scores)

Cross val score:  0.971266776704
[ 0.97500833  0.948       0.95666667  0.98399466  0.99266422]

Accuracy seems to be halfway between the 2-D accuracy and the original accuracy at 97.1%, which makes sense.

Confusion Matrix and Classification Report¶

Now, since the data is semi-unbalanced (only 23.81% left), let's just do a quick classification report to make sure that our accuracy metric isn't giving us a false impression of the quality of this classifier.

clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
X = hr.drop(drop,1)
y = hr.left 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Confusion Matrix¶

Plot code based on example on this page: https://goo.gl/3Mz7vL

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

class_names = ["Stay","Left"]

plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix')


plt.show()

As you can see, almost everything called a stay is a stay, and almost everything called left is a left, so recall and precision are both extremely high. So we definitely weren't getting a false impression.

Let's get the exact numbers below:

Classification Report¶

print(classification_report(y_test, y_pred, target_names=class_names))

             precision    recall  f1-score   support

       Stay       0.99      1.00      1.00      2306
       Left       1.00      0.98      0.99       694

avg / total       1.00      1.00      1.00      3000

Conclusion¶

99.2% is a pretty good accuracy, and it's not as if that's a bad metric since recall and precision are just as high even though the data is unbalanced. So I'm just going to leave it as is. However:

Other things to do (Feature Engineering)¶

Department level metrics. However, in my preliminary exploration in that area didn't seem to improve anything.
Feature Scaling on some of the metrics that are in completely different ranges than the rest of the features, like average_monthly_hours. Also didn't seem to improve anything on Random Forest at least.

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	Work_accident	left	promotion_last_5years
count	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000	14999.000000
mean	0.612834	0.716102	3.803054	201.050337	3.498233	0.144610	0.238083	0.021268
std	0.248631	0.171169	1.232592	49.943099	1.460136	0.351719	0.425924	0.144281
min	0.090000	0.360000	2.000000	96.000000	2.000000	0.000000	0.000000	0.000000
25%	0.440000	0.560000	3.000000	156.000000	3.000000	0.000000	0.000000	0.000000
50%	0.640000	0.720000	4.000000	200.000000	3.000000	0.000000	0.000000	0.000000
75%	0.820000	0.870000	5.000000	245.000000	4.000000	0.000000	0.000000	0.000000
max	1.000000	1.000000	7.000000	310.000000	10.000000	1.000000	1.000000	1.000000

	satisfaction_level	last_evaluation	number_project	average_montly_hours	time_spend_company	left	sales	salary
0	0.38	0.53	2	157	3	1	sales	low
1	0.80	0.86	5	262	6	1	sales	medium
2	0.11	0.88	7	272	4	1	sales	medium
3	0.72	0.87	5	223	5	1	sales	low
4	0.37	0.52	2	159	3	1	sales	low

99.2% Accuracy on Employee Resignations