When it comes to feature engineering one technique that you definetly don’t want to miss out on is target encoding! So what actually is it…..
Well at it’s crux target encoding is a technique used to convert categorical features into numerical features. This is done by replacing each category with a value that is derived from the target variable.
The most common way to do this is to calculate the mean of the target variable for each unique combination of categories however this can be problematic when a category value doesn’t have many entries. To handle this we can weight each category by the number of occurences and factor in a smoothing value to account for the cases where you have undersampled categroies.
Mathematically you can define it as follows:
\[\begin{aligned} TE_{\text{target}}\left( \text{categories} \right) = \frac{ \left(count \left( \text{categories} \right) \times mean_{\text{target}} \left( \text{categories}\right) \right) + \left( \omega_{\text{smoothing}} \times mean_{\text{target}}\left( \text{global} \right) \right)}{count \left( \text{categories} \right) + \omega_{\text{smoothing} }} \end{aligned}\]Where:
- $TE_{\text{target}}\left( \text{categories} \right)$ is the target encoded output
- $count \left( \text{categories}\right)$ is the count of values across the unique combination of categories
- $\omega_{\text{smoothing}}$ is a smoothing factor which is defined as some constant
- $mean_{\text{target}} \left( \text{categories}\right)$ is the mean of the target value with respects to the categories
- $mean_{\text{target}}\left( \text{global} \right)$ is the global mean of the target value across all your data (no groupings)
Target encoding can be a very effective way to improve the performance of machine learning models. This is because it allows the model to learn the relationship between the categorical features and the target variable.
Resources to learn more
Plenty of awesome resources exist going into details on how you can implement this in practice, here are some noteworthy ones.
- Videos
- Articles / Posts
- Notebooks
- Docs
Caveats and potential issues
Eventhough lots of benefits are present for performing target encdoing there are some negatives that are worth being wary of
- It can also lead to overfitting.
- This is because the model is learning the relationship between the categorical features and the target variable, even if there is no causal relationship.
- To avoid overfitting, you could use a validation set to evaluate the performance of the model or even perform kfold target encdoing.
- The validation set should not be used to train the model, but it should be used to evaluate the performance of the model after it has been trained. If the model performs significantly better on the validation set than it does on the training set, then it is likely that the model is overfitting.
- It can be computationally expensive.
- This is because the model needs to be trained on the entire dataset in order to calculate the mean of the target variable for each category.
- It can be sensitive to outliers.
- This is because the mean of the target variable can be significantly affected by a small number of outliers.
Overall, target encoding is a powerful feature engineering technique that can be used to improve the performance of machine learning models. However, it is important to be aware of the potential drawbacks of target encoding, such as overfitting and sensitivity to outliers.
Here are some additional tips for using target encoding effectively:
- Use a validation set to evaluate the performance of the model making sure you only fit your target encoding to your training set and transform your validation set.
- Otherwise it wouldn’t work since you would be leaking information across your various data splits (also you wouldn’t have a target column when making real predicitons)
Here are some examples of how target encoding can be used to improve the performance of machine learning models:
- In a classification problem, target encoding can be used to improve the performance of a classifier by encoding categorical features that are related to the target variable. For example, if the target variable is “customer churn,” then target encoding could be used to encode categorical features such as “customer location” and “customer age.”
- In a regression problem, target encoding can be used to improve the performance of a regressor by encoding categorical features that are related to the target variable. For example, if the target variable is “house price,” then target encoding could be used to encode categorical features such as “house location” and “house size.”
Practice example
Say you have data relating to a binary classification problem in which case your target variable would be 0
or 1
. You could group based on your columns as follows (python sudo code)
# Already split datasets
df_train = pd.read_csv(file_path_to_train)
df_valid = pd.read_csv(file_path_to_valid)
df_test = pd.read_csv(file_path_to_test)
# Defining your values
w_smoothing = 15
mean_global = df_train.target.mean()
grouping_columns = list()
# Creating intermediate variables for target encoding along with final encoded column
te = df_train.groupby(grouping_columns)['target'].agg(['mean', 'count'].reset_index()
te['TE_grouping_columns'] = ((te['mean'] * te['count']) + (w_smoothing * mean_global)) / (te['count'] + w_smoothing)
# Merging columns back onto data
df_train = df_train.merge(te, on=grouping_columns, how='left')
df_valid = df_valid.merge(te, on=grouping_columns, how='left')
df_test = df_test.merge(te, on=grouping_columns, how='left')
# Filling null values if present in valid & test
df_valid['TE_grouping_columns'] = df_valid['TE_grouping_columns'].fillna(mean_global)
df_test['TE_grouping_columns'] = df_test['TE_grouping_columns'].fillna(mean_global)
It’s important to note:
- $ \omega_{\text{smoothing}}$ is picked by you, no hard and fast rule but 15-25 usually works.
- At the end you fill null values in the valid and test sets since the groupings are performing on train and therefore unique combos of the grouping might not be present in the others, in which case you use the global mean of the train set.
Another intuitive way to picture target encoding is that it is giving a model access to information in a way it may not have encountered naturally without it. If done in the above manner then the average can be thought of in a way as the “probability of getting that grouping combination” since you are normalising by the total. Thus closer the value to 1 the better $ TE_{\text{target}}\left(\text{categories}\right)$ and more weight is given to those columns.