How to Encode Categorical Values for Multiple Columns | Scikit-Learn

Label encoding is a data preprocessing technique used in machine learning to convert categorical values into numerical form, facilitating their use in algorithms that require numerical input. In label encoding, each category is assigned a unique integer based on alphabetical ordering, frequency, or any other criteria deemed appropriate for the task at hand. For instance, in a dataset with a categorical feature having three categories—‘Red’, ‘Blue’, and ‘Green’—label encoding might convert these to 0, 1, and 2, respectively.

Here is a helpful diagram of what exactly this means:

A diagram showing an example of how label encoding works. The diagram has two tables, both with columns 'Color', 'Size', and 'Price'. On the left, we have the original data, with colors such as blue and green, and sizes such as S, M, and XL. On the right, we have label encoded data, where for example blue has been replaced with the integer 0, green with 1, and XL with 3. The price data was numeric, to begin with, so nothing was done to it.
A diagram showing an example of how label encoding works. The diagram has two tables, both with columns ‘Color’, ‘Size’, and ‘Price’. On the left, we have the original data, with colors such as blue and green, and sizes such as S, M, and XL. On the right, we have label encoded data, where for example blue has been replaced with the integer 0, green with 1, and XL with 3. The price data was numeric, to begin with, so nothing was done to it.

Table of Contents:

  1. Label Encoding Across Multiple Columns in Scikit-Learn
  2. A deeper dive to the workings of LabelEncoder
  3. Feature Encoding Across Multiple Columns

1. Label Encoding Across Multiple Columns in Scikit-Learn

In the following example, we have a DataFrame object with three columns: ‘Color’, ‘Size’, and ‘Price’. The ‘Color’ and ‘Size’ columns are categorical, while the ‘Price’ column is numerical. We will perform label encoding on the ‘Color’ and ‘Size’ columns so that they can be used in machine learning. Luckily, scikit-learn comes with a handy class called LabelEncoder that can be used to do this.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Creating a sample DataFrame
data = {
    'Color': ['Blue', 'Green', 'Red', 'Green', 'Red'],
    'Size': ['L', 'M', 'S', 'XL', 'M'],
    'Price': [100, 150, 200, 120, 180]
}
df = pd.DataFrame(data)

# Displaying the original DataFrame
print("Original DataFrame:")
print(df)

# Initializing LabelEncoder
le = LabelEncoder()

# Applying LabelEncoder on categorical columns
df['Color'] = le.fit_transform(df['Color'])
df['Size'] = le.fit_transform(df['Size'])

# Displaying the DataFrame after label encoding
print("\nDataFrame after Label Encoding:")
print(df)

By running this code, you should get the following output:

Original DataFrame:
   Color Size  Price
0   Blue    L    100
1  Green    M    150
2    Red    S    200
3  Green   XL    120
4    Red    M    180

DataFrame after Label Encoding:
   Color  Size  Price
0      0     0    100
1      1     1    150
2      2     2    200
3      1     3    120
4      2     1    180

As we can see, the ‘Color’ and ‘Size’ columns are transformed from categorical data to numerical labels in the following manner:

  • Colors
    • All ‘Red’ values are converted to the integer 2.
    • All ‘Green’ values are converted to the integer 1.
    • All ‘Blue’ values are converted to the integer 0.
  • Sizes
    • All ‘S’ values are converted to the integer 2.
    • All ‘M’ values are converted to the integer 1.
    • All ‘L’ values are converted to the integer 0.
    • All ‘XL’ values are converted to the integer 3.

We do not encode the ‘Price’ column, as it is already numerical, so it remains unchanged. The fit_transform method of the LabelEncoder is used to fit the label encoder and return encoded labels. Note that when you run this code, the exact numerical labels assigned to the categories in ‘Color’ and ‘Size’ might differ depending on the internal working of LabelEncoder, as it assigns labels based on the alphabetical order of the unique values in the column.

In this example, label encoding was performed on two columns, but similar syntax can of course be used to perform it on any number of columns.

As a side note, you can use a little bit different syntax to achieve the exact same thing in fewer lines of code, if you are only interested in the encoding part (foreshadowing…):

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Creating a sample DataFrame
data = {
    'Color': ['Blue', 'Green', 'Red', 'Green', 'Red'],
    'Size': ['L', 'M', 'S', 'XL', 'M'],
    'Price': [100, 150, 200, 120, 180]
}
df = pd.DataFrame(data)

# Displaying the original DataFrame
print("Original DataFrame:")
print(df)

# Perform label encoding on Color and Size columns
df[['Color', 'Size']] = df[['Color', 'Size']].apply(LabelEncoder().fit_transform)

# Displaying the DataFrame after label encoding
print("\nDataFrame after Label Encoding:")
print(df)

2. A deeper dive to the workings of LabelEncoder

LabelEncoder is a utility class in scikit-learn’s preprocessing module, used to convert categorical values into numerical labels. Here is a summary of its methods:

  1. fit(y): This method is used for fitting the label encoder with the categorical values. The fit method takes one argument, y, which is the array-like structure of categorical values. After fitting, the unique categories are stored in an array in the classes_ attribute.
  2. transform(y): After the fit method has been called, transform can be used to transform the categorical values in y into numerical labels. The argument y is the array-like structure of categorical values that you want to transform. This method returns an array of the transformed labels.
  3. fit_transform(y): This is a convenience method that combines the fit and transform methods into one step. It fits the label encoder with the categorical values in y and returns the transformed numerical labels. This method is often used when you want to fit and transform the data in one step.
  4. inverse_transform(y): This method is used to transform the numerical labels back into the original categorical values. The argument y is the array-like structure of numerical labels that you want to transform back to categorical values. This method is useful when you want to convert the predicted labels of your model back into a human-readable form.
  5. get_params(deep=True): This method gets the parameters of the estimator. The optional deep parameter, if set to True, will return the parameters for this estimator and contained subobjects that are estimators. This is particularly useful for getting the configuration of an estimator in a pipeline.
  6. set_params(**params): This method sets the parameters of the estimator. The method takes any number of keyword arguments where the keys are the parameter names and the values are the parameter values. This is useful for setting the hyperparameters of an estimator.
  7. classes_: This is an attribute, not a method. After fitting the label encoder, this attribute stores the array of unique categories (labels).

Below is an example where the methods fit(y), transform(y), and inverse_transform(y) of LabelEncoder are used sequentially. In this example, we will use a simple list of colors as the data to be encoded.

from sklearn.preprocessing import LabelEncoder

# Initializing the LabelEncoder
le = LabelEncoder()

# Sample data
colors = ['Red', 'Blue', 'Green', 'Red', 'Green']

# Fitting the label encoder
le.fit(colors)
print("Classes found:", le.classes_)

# Transforming the categories to numerical labels
encoded_labels = le.transform(colors)
print("Encoded Labels:", encoded_labels)

# Inverse transforming the numerical labels back to categories
decoded_labels = le.inverse_transform(encoded_labels)
print("Decoded Labels:", decoded_labels)

In this example, the fit method is used to learn the unique categories in the ‘colors’ list. The transform method is used to convert the categories in ‘colors’ to numerical labels. Finally, the inverse_transform method is used to convert the numerical labels back to the original categories. Running this code you should get the following output:

Classes found: ['Blue' 'Green' 'Red']
Encoded Labels: [2 0 1 2 1]
Decoded Labels: ['Red' 'Blue' 'Green' 'Red' 'Green']

It’s important to note that LabelEncoder is intended to be used on the target variable (i.e., the y in fit(X, y)) rather than on the input features. If you need to perform label encoding on the input features, it might be more convenient to use OrdinalEncoder for that purpose as it can handle multiple columns at once.

An Important Note on Saving the LabelEncoder

There are several reasons to save your LabelEncoder after fitting:

  • Consistent Transformations: When you are working with machine learning models, it’s crucial to apply the same transformations to the new data as were applied to the training data. If you don’t save the LabelEncoder, you would need to fit it again on the new data, which might result in different numerical encodings if the new data has different categories or a different distribution of categories.
  • Easy Deployment: When deploying a machine learning model into production, the preprocessing steps, including label encoding, need to be applied to incoming data before making predictions. Saving the LabelEncoder ensures that you can easily apply the exact same transformations during the deployment phase as were applied during the model training phase.
  • Reproducibility: Saving the preprocessing objects, including LabelEncoder, ensures that you can reproduce your results at a later time, which is important for verification and auditing purposes.
  • Inverse Transformation for Interpretability: After making predictions on encoded data, you might want to convert the predictions (or some features) back to the original categorical format for interpretation or reporting purposes. Having the saved LabelEncoder allows you to perform this inverse transformation accurately.
  • Efficiency: Fitting a LabelEncoder can be computationally expensive, especially with a large number of categories or a large dataset. Saving the encoder allows you to avoid the computational cost of refitting it every time you need to transform data.

Saving the LabelEncoder, and more generally any preprocessing transformer, is a best practice in machine learning workflows, ensuring consistency, efficiency, and reproducibility.

You can save a LabelEncoder (or any other scikit-learn transformer) using libraries like joblib or pickle. Here is an example of how you might save and load a LabelEncoder using joblib:

import joblib
from sklearn.preprocessing import LabelEncoder

# Creating and fitting the LabelEncoder
le = LabelEncoder()
data = ['Red', 'Green', 'Blue']
le.fit(data)

# Transforming data with the original LabelEncoder
transformed_data = le.transform(data)

# Saving the LabelEncoder to a file
joblib.dump(le, 'label_encoder.joblib')

# Later on, loading the LabelEncoder from the file
loaded_le = joblib.load('label_encoder.joblib')

# Transforming data with the loaded LabelEncoder
loaded_transformed_data = loaded_le.transform(data)

# Verifying that transformations are identical
transformations_equal = (transformed_data == loaded_transformed_data).all()

print("Are the transformations identical?", transformations_equal)

This approach ensures that you can reuse the exact same LabelEncoder at a later time, maintaining consistency across your data preprocessing workflow. Running this code produces the following output:

Are the transformations identical? True

3. Alternatives to LabelEncoder for Feature Encoding

As LabelEncoder is intended to be used on categorical labels, and not features, there are other, similar, classes for handling features.

Ordinal Encoding

Here are some reasons to use OrdinalEncoder instead of LabelEncoder:

  1. Intended for encoding features: LabelEncoder, while it also encodes categories as integers, is primarily used for encoding target labels, not features.
  2. Preserving Ordinal Nature: When the categorical variables have an ordinal relationship (where the order of the categories matters), OrdinalEncoder is more suitable. It converts the categories to integers while preserving the order.

Below is an example of how to use OrdinalEncoder:

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# Sample DataFrame
data = {
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red'],
    'Size': ['S', 'M', 'L', 'S', 'XL'],
    'Price': [100, 150, 200, 120, 180]
}
df = pd.DataFrame(data)

print("Original data:\n", df)

# Initializing OrdinalEncoder
encoder = OrdinalEncoder(categories=[['Red', 'Green', 'Blue'], ['S', 'M', 'L', 'XL']])

# Fitting and transforming the features
encoded_data = encoder.fit_transform(df[['Color', 'Size']])

# Creating a DataFrame with the encoded data
encoded_df = pd.DataFrame(encoded_data, columns=['Color', 'Size'])

# Adding the 'Price' column back to the DataFrame
final_df = pd.concat([encoded_df, df['Price']], axis=1)

print("\nEncoded data:\n", final_df)

In this example, the ‘Color‘ and ‘Size‘ columns from the ‘df‘ DataFrame are encoded using ‘OrdinalEncoder‘. The ‘categories‘ parameter is used to specify the order in which categories should be encoded for each feature. The encoded data is then put back into a DataFrame along with the ‘Price’ column. Running this code you should get the following output:

Original data:
    Color Size  Price
0    Red    S    100
1  Green    M    150
2   Blue    L    200
3  Green    S    120
4    Red   XL    180

Encoded data:
    Color  Size  Price
0    0.0   0.0    100
1    1.0   1.0    150
2    2.0   2.0    200
3    1.0   0.0    120
4    0.0   3.0    180

After running this code, ‘final_df‘ will contain the ordinal encoded values for ‘Color’ and ‘Size’, preserving the order specified in the ‘categories‘ parameter.

One-hot encoding

While ordinal encoding is straightforward and efficient in terms of computational resources, it can introduce ordinality where none exists, potentially leading to misinterpretation by the machine learning model. This means that even though ‘Red’, ‘Blue’, and ‘Green’ are simply distinct categories without any inherent order, the model might interpret them as having an ordinal relationship, with ‘Green’ > ‘Blue’ > ‘Red’. To address this issue, one might use one-hot encoding or similar techniques that do not introduce unintended ordinal relationships.

In one-hot encoding, each unique category in the data is represented as a binary vector. For a particular data point, the vector corresponding to its category is set to 1, and all other vectors are set to 0. The number of vectors depends on the number of unique categories in the data.

For example, consider a dataset with a categorical feature ‘Color’ that has three categories: Red, Green, and Blue. Using one-hot encoding, we represent each color as a binary vector:

  • Red: [1, 0, 0]
  • Green: [0, 1, 0]
  • Blue: [0, 0, 1]

This method eliminates any ordinal relationship that might be misinterpreted by the algorithm, as each category is equally distant from all others in the encoding space. However, it also increases the dimensionality of the dataset, which can lead to the “curse of dimensionality” in cases with a high number of categories, and it may increase the computational cost. One-hot encoding is widely used for categorical data that doesn’t have an inherent order, and it’s a critical step for many machine learning algorithms to properly understand and process categorical inputs.

Here’s an example of how to perform one-hot encoding using pandas and scikit-learn. In this example, I’ll create a pandas DataFrame and then apply one-hot encoding to a categorical column.

Using Pandas

import pandas as pd

# Creating a sample DataFrame
data = {
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red'],
    'Size': ['S', 'M', 'L', 'S', 'XL'],
    'Price': [100, 150, 200, 120, 180]
}
df = pd.DataFrame(data)

print("Original data:\n", df)

# Applying one-hot encoding using pandas
one_hot_encoded_df = pd.get_dummies(df, columns=['Color', 'Size'], prefix=['Color', 'Size'])

print("\nOne-hot encoded data:\n", one_hot_encoded_df)

In the above example, the pd.get_dummies function is used to convert the ‘Color’ and ‘Size’ columns into one-hot encoded vectors. The columns parameter specifies which columns to encode, and the prefix parameter assigns a prefix to the generated columns, helping to identify them easily. Running this code, you should get the following output:

Original data:
    Color Size  Price
0    Red    S    100
1  Green    M    150
2   Blue    L    200
3  Green    S    120
4    Red   XL    180

One-hot encoded data:
    Price  Color_Blue  Color_Green  Color_Red  Size_L  Size_M  Size_S  Size_XL
0    100       False        False       True   False   False    True    False
1    150       False         True      False   False    True   False    False
2    200        True        False      False    True   False   False    False
3    120       False         True      False   False   False    True    False
4    180       False        False       True   False   False   False     True

Using Scikit-Learn

Using the sklearn OneHotEncoder is very similar to LabelEncoder:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Creating a sample DataFrame
data = {
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red'],
    'Size': ['S', 'M', 'L', 'S', 'XL'],
    'Price': [100, 150, 200, 120, 180]
}
df = pd.DataFrame(data)

print("Original data:\n", df)

# Initializing the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Fitting the encoder and transforming the data
one_hot_encoded_array = encoder.fit_transform(df[['Color', 'Size']])

# The transformed data is an array, so we need to convert it back to a DataFrame
one_hot_encoded_df = pd.DataFrame(one_hot_encoded_array, columns=encoder.get_feature_names_out(['Color', 'Size']))

# Concatenating the one-hot encoded columns to the original DataFrame
final_df = pd.concat([df, one_hot_encoded_df], axis=1).drop(['Color', 'Size'], axis=1)

print("\nOne-hot encoded data:\n", final_df)

In this scikit-learn example, we first initialize the OneHotEncoder and specify sparse=False to return a numpy array instead of a sparse matrix. Note: we could use the drop='first' parameter of OneHotEncoder to apply “dummy encoding” which drops the first category to avoid multicollinearity. We then fit the encoder and transform the ‘Color’ and ‘Size’ columns. The result is concatenated back to the original DataFrame, and the original categorical columns are dropped. Running this code should return the following output:

Original data:
    Color Size  Price
0    Red    S    100
1  Green    M    150
2   Blue    L    200
3  Green    S    120
4    Red   XL    180

One-hot encoded data:
    Price  Color_Blue  Color_Green  Color_Red  Size_L  Size_M  Size_S  Size_XL
0    100         0.0          0.0        1.0     0.0     0.0     1.0      0.0
1    150         0.0          1.0        0.0     0.0     1.0     0.0      0.0
2    200         1.0          0.0        0.0     1.0     0.0     0.0      0.0
3    120         0.0          1.0        0.0     0.0     0.0     1.0      0.0
4    180         0.0          0.0        1.0     0.0     0.0     0.0      1.0

Both approaches yield similar results, and you can choose either based on your preferences and specific requirements.


Scroll to Top