Predicting income level using Random Forest

In this article, we will use a Random Forest Classifier to predict whether a person earns more than $50,000 a year based on demographical and financial data. The dataset used can be found on Kaggle.

Additionally, you can check the full script with outputs on my GitHub page (I will only show the most relevant code snippets in this article). Let’s get started!

Cleaning and encoding categorical variables

Our dataset contains the following variables/features:

Let’s explore the numerical variables first:

The maximum and minimum values seem to make sense for all of them (the 99 hours per week maximum looks like a clear outlier, but is still technically possible). Therefore, we will leave these numeric variables untouched.

Moving on to the categorical variables, we have identified some missing data while exploring the unique values for each column. We need to replace the string ‘ ?’ with the proper missing data values for two of the features:

# Replacing the question marks with None values
df.loc[df.JobType == ' ?', 'JobType'] = None
df.loc[df.occupation == ' ?', 'occupation'] = None

As we saw in the first screenshot, most of our variables are categorical, and only gender and SalStat (this one is our target variable) are binary. Let’s encode these two binary variables first:

# Encoding the target variable
df.loc[df.SalStat == ' greater than 50,000', 'SalStat'] = 1
df.loc[df.SalStat == ' less than or equal to 50,000', 'SalStat'] = 0

# Enconding the gender variable
df.loc[df.gender == ' Male', 'gender'] = 1
df.loc[df.gender == ' Female', 'gender'] = 0

For the rest of the variables, we will need to create dummy features in order to include them in the Random Forest model:

# Creating the dummy columns for all of the categorical variables
# that have more than 2 possible values
dummy_cols = [c for c in df.columns if df[c].dtype != 'int64' and c not in ['gender', 'SalStat']]

dummy_df = pd.get_dummies(data=df, columns=dummy_cols)

This makes the number of features go all the way up to 103, which is a very significant increase. If we wanted to reduce that number, we could manually create more broad categories for some of the variables (for education for example), but we are going to leave the dataset as-is for now.

Creating the model and optimizing it

Before splitting the dataset and training the model, we need to keep in mind that our dataset is very imbalanced:

# We have a very imbalanced dataset
df.SalStat.value_counts(normalize=True).plot(kind='bar')

More than 70% of the observations are for people who earn less than 50K a year. We will later see how this affects the performance of our model when it comes to correctly labeling the other class.

Let’s proceed to divide our dataset into features and target, as well as training and testing data:

# Dividing the features and target variables
X = dummy_df.drop('SalStat', axis='columns')
y = dummy_df['SalStat']

# Splitting the dataset in training and testing portions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Casting the y series to integer so we don't get any errors when running the
# model and predicting
y_train = y_train.astype('int')
y_test = y_test.astype('int')

We can now train our Random Forest Classifier:

# Training the Random Forest
model = RandomForestClassifier(n_jobs=-1, max_depth=19, max_features=10)
model.fit(X_train, y_train)

# Making the predictions
predictions = model.predict(X_test)

cm = confusion_matrix(y_test, predictions)

# Displaying the confusion matrix for our model:
ConfusionMatrixDisplay(cm).plot()

At first glance, we can already detect that the default model did not do a very good job of correctly classifying people who earn more than 50K. Let’s check our model scores to be more precise:

# Getting the accuracy score
acc_score = accuracy_score(y_test, predictions)

TP = sum((predictions + y_test) == 2)
FP = sum([pred == 1 and true == 0 for pred, true in zip(predictions, y_test)])
TN = sum((predictions + y_test) == 0)
FN = sum([pred == 0 and true == 1 for pred, true in zip(predictions, y_test)])

# Proportion of predicted positives that are actually positive
# (Salary > 50k)
precision = TP / (TP + FP)

# Proportion of true positives detected by the model
recall = TP / (TP + FN)

# Calculating f1 score
from sklearn.metrics import f1_score
f1_score_ = f1_score(y_test, predictions)

f1_score_

# Creating a dict that stores the scores of our model
model_scores = {'Accuracy': acc_score,
                'Precision': precision,
                'Recall': recall,
                'F1 score': f1_score_,
                'Model': 'Default model'}

model_scores

Our recall metric is not very good, as only 59% of all positives (people who earn more than 50K) in the testing data have been correctly classified as positives by our model. On the other hand, 95% of all true negatives (people who earn 50K or less) have been correctly classified. This is a consequence of our highly imbalanced dataset, making it much easier for the model to identify people that belong to the majority class.

There are many approaches to dealing with an imbalanced dataset, but we are going to choose a very simple one for our project; random over-sampling. It simply means that we are going to sample (with replacement) from our minority class (people who earn more than 50K) to balance the training dataset and improve the overall performance of our model:

# Oversampling the minority class to balance the training data
ros = RandomOverSampler(random_state=42)

X_ros, y_ros = ros.fit_resample(X, y)

# Checking the distribution of the target column for the new dataset
y_ros.value_counts(normalize=True)

Our training dataset is now perfectly balanced for the two classes. However, since we sampled with replacement, this means that the number of rows of our dataset has increased (all the way up to 48,566) and we now have many more duplicate rows:

# Number of duplicate rows in the feature dataset before oversampling
X.duplicated().sum()

# Output: 4,066

# A lot of duplicates introduced by over sampling the minority class
X_ros.duplicated().sum()

# Output: 20,654

Now, we need to train the model and evaluate its metrics in the same way we did before for our default model. These are the results for the new oversampled model:

Our overall model performance has increased dramatically by simply oversampling the minority class and therefore balancing our dataset. With this new Random Classifier, 95% of all positives in the testing data are correctly predicted. However, this also means that our model performs worse when identifying the negatives (defined as TN / (TN + FP)); from 95% accuracy in our previous model to 86% in this new one.

Let’s visualize the change in all four metrics to better understand the improvement:

scores_df = pd.DataFrame([model_scores, ros_model])

melted_df = pd.melt(scores_df, 
                    id_vars='Model', var_name='Score_type',
                    value_vars=scores_df.columns[:-1])

# Creating a barplot to compare the two models's scores
figure, ax = plt.subplots(figsize=(8, 8), dpi=80)

sns.barplot(data=melted_df, x='Score_type', y='value', hue='Model')

plt.ylim(0, 1)
plt.legend(loc='best')

plt.xlabel('Score type')
plt.ylabel('Value')
plt.title('Comparing the two models')

The recall and F1 score improvements are phenomenal and put our model in a very good position.

For the sake of simplicity, I have not performed cross-validation or hyperparameter tuning, as the computational resources needed quickly overcome what I have available. If these resources are available, we could further improve our model by:

Creating a dictionary with many different values for the most important parameters of our Random Forest Classifier, like the number of trees, the maximum depth allowed for each tree, the minimum samples needed in a node to split it, and the maximum number of features to consider at each split.
Perform a grid search (exhaustive if possible or random) on our previous dictionary to find which parameter combination yields the best-performing model (using a score metric of our choice).
Use 10-fold cross-validation to assess each of our previous models, which will give us a more robust evaluation of model performance.

Cleaning and encoding categorical variables

Creating the model and optimizing it

Comparte esto:

Relacionado

Deja un comentario Cancelar la respuesta