Label encoding is a data preprocessing technique used in machine learning to convert categorical values into numerical form, facilitating their use in algorithms that require numerical input. In label encoding, each category is assigned a unique integer based on alphabetical ordering, frequency, or any other criteria deemed appropriate for the task at hand. For instance, in a dataset with a categorical feature having three categories—‘Red’, ‘Blue’, and ‘Green’—label encoding might convert these to 0, 1, and 2, respectively.
Here is a helpful diagram of what exactly this means:

Table of Contents:
- Label Encoding Across Multiple Columns in Scikit-Learn
- A deeper dive to the workings of LabelEncoder
- Feature Encoding Across Multiple Columns
1. Label Encoding Across Multiple Columns in Scikit-Learn
In the following example, we have a DataFrame object with three columns: ‘Color’, ‘Size’, and ‘Price’. The ‘Color’ and ‘Size’ columns are categorical, while the ‘Price’ column is numerical. We will perform label encoding on the ‘Color’ and ‘Size’ columns so that they can be used in machine learning. Luckily, scikit-learn comes with a handy class called LabelEncoder
that can be used to do this.
import pandas as pd from sklearn.preprocessing import LabelEncoder # Creating a sample DataFrame data = { 'Color': ['Blue', 'Green', 'Red', 'Green', 'Red'], 'Size': ['L', 'M', 'S', 'XL', 'M'], 'Price': [100, 150, 200, 120, 180] } df = pd.DataFrame(data) # Displaying the original DataFrame print("Original DataFrame:") print(df) # Initializing LabelEncoder le = LabelEncoder() # Applying LabelEncoder on categorical columns df['Color'] = le.fit_transform(df['Color']) df['Size'] = le.fit_transform(df['Size']) # Displaying the DataFrame after label encoding print("\nDataFrame after Label Encoding:") print(df)
By running this code, you should get the following output:
Original DataFrame:
Color Size Price
0 Blue L 100
1 Green M 150
2 Red S 200
3 Green XL 120
4 Red M 180
DataFrame after Label Encoding:
Color Size Price
0 0 0 100
1 1 1 150
2 2 2 200
3 1 3 120
4 2 1 180
As we can see, the ‘Color’ and ‘Size’ columns are transformed from categorical data to numerical labels in the following manner:
- Colors
- All ‘Red’ values are converted to the integer 2.
- All ‘Green’ values are converted to the integer 1.
- All ‘Blue’ values are converted to the integer 0.
- Sizes
- All ‘S’ values are converted to the integer 2.
- All ‘M’ values are converted to the integer 1.
- All ‘L’ values are converted to the integer 0.
- All ‘XL’ values are converted to the integer 3.
We do not encode the ‘Price’ column, as it is already numerical, so it remains unchanged. The
method of the fit_transform
is used to fit the label encoder and return encoded labels. Note that when you run this code, the exact numerical labels assigned to the categories in ‘Color’ and ‘Size’ might differ depending on the internal working of LabelEncoder
, as it assigns labels based on the alphabetical order of the unique values in the column.LabelEncoder
In this example, label encoding was performed on two columns, but similar syntax can of course be used to perform it on any number of columns.
As a side note, you can use a little bit different syntax to achieve the exact same thing in fewer lines of code, if you are only interested in the encoding part (foreshadowing…):
import pandas as pd from sklearn.preprocessing import LabelEncoder # Creating a sample DataFrame data = { 'Color': ['Blue', 'Green', 'Red', 'Green', 'Red'], 'Size': ['L', 'M', 'S', 'XL', 'M'], 'Price': [100, 150, 200, 120, 180] } df = pd.DataFrame(data) # Displaying the original DataFrame print("Original DataFrame:") print(df) # Perform label encoding on Color and Size columns df[['Color', 'Size']] = df[['Color', 'Size']].apply(LabelEncoder().fit_transform) # Displaying the DataFrame after label encoding print("\nDataFrame after Label Encoding:") print(df)
2. A deeper dive to the workings of LabelEncoder
LabelEncoder
is a utility class in scikit-learn’s preprocessing module, used to convert categorical values into numerical labels. Here is a summary of its methods:
: This method is used for fitting the label encoder with the categorical values. Thefit(y)
method takes one argument,fit
, which is the array-like structure of categorical values. After fitting, the unique categories are stored in an array in they
attribute.classes_
: After thetransform(y)
fit
method has been called,transform
can be used to transform the categorical values iny
into numerical labels. The argumenty
is the array-like structure of categorical values that you want to transform. This method returns an array of the transformed labels.
: This is a convenience method that combines thefit_transform(y)
andfit
methods into one step. It fits the label encoder with the categorical values intransform
and returns the transformed numerical labels. This method is often used when you want to fit and transform the data in one step.y
: This method is used to transform the numerical labels back into the original categorical values. The argumentinverse_transform(y)
is the array-like structure of numerical labels that you want to transform back to categorical values. This method is useful when you want to convert the predicted labels of your model back into a human-readable form.y
: This method gets the parameters of the estimator. The optionalget_params(deep=True)
deep
parameter, if set toTrue
, will return the parameters for this estimator and contained subobjects that are estimators. This is particularly useful for getting the configuration of an estimator in a pipeline.
: This method sets the parameters of the estimator. The method takes any number of keyword arguments where the keys are the parameter names and the values are the parameter values. This is useful for setting the hyperparameters of an estimator.set_params(**params)
: This is an attribute, not a method. After fitting the label encoder, this attribute stores the array of unique categories (labels).classes_
Below is an example where the methods
, fit(y)
, and transform(y)
of inverse_transform(y)
are used sequentially. In this example, we will use a simple list of colors as the data to be encoded.LabelEncoder
from sklearn.preprocessing import LabelEncoder # Initializing the LabelEncoder le = LabelEncoder() # Sample data colors = ['Red', 'Blue', 'Green', 'Red', 'Green'] # Fitting the label encoder le.fit(colors) print("Classes found:", le.classes_) # Transforming the categories to numerical labels encoded_labels = le.transform(colors) print("Encoded Labels:", encoded_labels) # Inverse transforming the numerical labels back to categories decoded_labels = le.inverse_transform(encoded_labels) print("Decoded Labels:", decoded_labels)
In this example, the
method is used to learn the unique categories in the ‘colors’ list. The fit
method is used to convert the categories in ‘colors’ to numerical labels. Finally, the transform
method is used to convert the numerical labels back to the original categories. Running this code you should get the following output:inverse_transform
Classes found: ['Blue' 'Green' 'Red']
Encoded Labels: [2 0 1 2 1]
Decoded Labels: ['Red' 'Blue' 'Green' 'Red' 'Green']
It’s important to note that
is intended to be used on the target variable (i.e., the y in LabelEncoder
) rather than on the input features. If you need to perform label encoding on the input features, it might be more convenient to use fit(X, y)
for that purpose as it can handle multiple columns at once.OrdinalEncoder
An Important Note on Saving the LabelEncoder
There are several reasons to save your LabelEncoder
after fitting:
- Consistent Transformations: When you are working with machine learning models, it’s crucial to apply the same transformations to the new data as were applied to the training data. If you don’t save the
, you would need to fit it again on the new data, which might result in different numerical encodings if the new data has different categories or a different distribution of categories.LabelEncoder
- Easy Deployment: When deploying a machine learning model into production, the preprocessing steps, including label encoding, need to be applied to incoming data before making predictions. Saving the
ensures that you can easily apply the exact same transformations during the deployment phase as were applied during the model training phase.LabelEncoder
- Reproducibility: Saving the preprocessing objects, including
, ensures that you can reproduce your results at a later time, which is important for verification and auditing purposes.LabelEncoder
- Inverse Transformation for Interpretability: After making predictions on encoded data, you might want to convert the predictions (or some features) back to the original categorical format for interpretation or reporting purposes. Having the saved
allows you to perform this inverse transformation accurately.LabelEncoder
- Efficiency: Fitting a
can be computationally expensive, especially with a large number of categories or a large dataset. Saving the encoder allows you to avoid the computational cost of refitting it every time you need to transform data.LabelEncoder
Saving the
, and more generally any preprocessing transformer, is a best practice in machine learning workflows, ensuring consistency, efficiency, and reproducibility.LabelEncoder
You can save a
(or any other scikit-learn transformer) using libraries like LabelEncoder
joblib
or pickle
. Here is an example of how you might save and load a
using LabelEncoder
joblib
:
import joblib from sklearn.preprocessing import LabelEncoder # Creating and fitting the LabelEncoder le = LabelEncoder() data = ['Red', 'Green', 'Blue'] le.fit(data) # Transforming data with the original LabelEncoder transformed_data = le.transform(data) # Saving the LabelEncoder to a file joblib.dump(le, 'label_encoder.joblib') # Later on, loading the LabelEncoder from the file loaded_le = joblib.load('label_encoder.joblib') # Transforming data with the loaded LabelEncoder loaded_transformed_data = loaded_le.transform(data) # Verifying that transformations are identical transformations_equal = (transformed_data == loaded_transformed_data).all() print("Are the transformations identical?", transformations_equal)
This approach ensures that you can reuse the exact same
at a later time, maintaining consistency across your data preprocessing workflow. Running this code produces the following output:LabelEncoder
Are the transformations identical? True
3. Alternatives to LabelEncoder for Feature Encoding
As LabelEncoder is intended to be used on categorical labels, and not features, there are other, similar, classes for handling features.
Ordinal Encoding
Here are some reasons to use OrdinalEncoder
instead of LabelEncoder
:
- Intended for encoding features:
, while it also encodes categories as integers, is primarily used for encoding target labels, not features.LabelEncoder
- Preserving Ordinal Nature: When the categorical variables have an ordinal relationship (where the order of the categories matters),
is more suitable. It converts the categories to integers while preserving the order.OrdinalEncoder
Below is an example of how to use
:OrdinalEncoder
import pandas as pd from sklearn.preprocessing import OrdinalEncoder # Sample DataFrame data = { 'Color': ['Red', 'Green', 'Blue', 'Green', 'Red'], 'Size': ['S', 'M', 'L', 'S', 'XL'], 'Price': [100, 150, 200, 120, 180] } df = pd.DataFrame(data) print("Original data:\n", df) # Initializing OrdinalEncoder encoder = OrdinalEncoder(categories=[['Red', 'Green', 'Blue'], ['S', 'M', 'L', 'XL']]) # Fitting and transforming the features encoded_data = encoder.fit_transform(df[['Color', 'Size']]) # Creating a DataFrame with the encoded data encoded_df = pd.DataFrame(encoded_data, columns=['Color', 'Size']) # Adding the 'Price' column back to the DataFrame final_df = pd.concat([encoded_df, df['Price']], axis=1) print("\nEncoded data:\n", final_df)
In this example, the ‘
‘ and ‘Color
‘ columns from the ‘Size
‘ DataFrame are encoded using ‘df
‘. The ‘OrdinalEncoder
‘ parameter is used to specify the order in which categories should be encoded for each feature. The encoded data is then put back into a DataFrame along with the ‘Price’ column. Running this code you should get the following output:categories
Original data:
Color Size Price
0 Red S 100
1 Green M 150
2 Blue L 200
3 Green S 120
4 Red XL 180
Encoded data:
Color Size Price
0 0.0 0.0 100
1 1.0 1.0 150
2 2.0 2.0 200
3 1.0 0.0 120
4 0.0 3.0 180
After running this code, ‘
‘ will contain the ordinal encoded values for ‘Color’ and ‘Size’, preserving the order specified in the ‘final_df
‘ parameter.categories
One-hot encoding
While ordinal encoding is straightforward and efficient in terms of computational resources, it can introduce ordinality where none exists, potentially leading to misinterpretation by the machine learning model. This means that even though ‘Red’, ‘Blue’, and ‘Green’ are simply distinct categories without any inherent order, the model might interpret them as having an ordinal relationship, with ‘Green’ > ‘Blue’ > ‘Red’. To address this issue, one might use one-hot encoding or similar techniques that do not introduce unintended ordinal relationships.
In one-hot encoding, each unique category in the data is represented as a binary vector. For a particular data point, the vector corresponding to its category is set to 1, and all other vectors are set to 0. The number of vectors depends on the number of unique categories in the data.
For example, consider a dataset with a categorical feature ‘Color’ that has three categories: Red, Green, and Blue. Using one-hot encoding, we represent each color as a binary vector:
- Red: [1, 0, 0]
- Green: [0, 1, 0]
- Blue: [0, 0, 1]
This method eliminates any ordinal relationship that might be misinterpreted by the algorithm, as each category is equally distant from all others in the encoding space. However, it also increases the dimensionality of the dataset, which can lead to the “curse of dimensionality” in cases with a high number of categories, and it may increase the computational cost. One-hot encoding is widely used for categorical data that doesn’t have an inherent order, and it’s a critical step for many machine learning algorithms to properly understand and process categorical inputs.
Here’s an example of how to perform one-hot encoding using pandas and scikit-learn. In this example, I’ll create a pandas DataFrame and then apply one-hot encoding to a categorical column.
Using Pandas
import pandas as pd # Creating a sample DataFrame data = { 'Color': ['Red', 'Green', 'Blue', 'Green', 'Red'], 'Size': ['S', 'M', 'L', 'S', 'XL'], 'Price': [100, 150, 200, 120, 180] } df = pd.DataFrame(data) print("Original data:\n", df) # Applying one-hot encoding using pandas one_hot_encoded_df = pd.get_dummies(df, columns=['Color', 'Size'], prefix=['Color', 'Size']) print("\nOne-hot encoded data:\n", one_hot_encoded_df)
In the above example, the
function is used to convert the ‘Color’ and ‘Size’ columns into one-hot encoded vectors. The pd.get_dummies
parameter specifies which columns to encode, and the columns
parameter assigns a prefix to the generated columns, helping to identify them easily. Running this code, you should get the following output:prefix
Original data:
Color Size Price
0 Red S 100
1 Green M 150
2 Blue L 200
3 Green S 120
4 Red XL 180
One-hot encoded data:
Price Color_Blue Color_Green Color_Red Size_L Size_M Size_S Size_XL
0 100 False False True False False True False
1 150 False True False False True False False
2 200 True False False True False False False
3 120 False True False False False True False
4 180 False False True False False False True
Using Scikit-Learn
Using the sklearn
OneHotEncoder
is very similar to LabelEncoder
:
from sklearn.preprocessing import OneHotEncoder import pandas as pd # Creating a sample DataFrame data = { 'Color': ['Red', 'Green', 'Blue', 'Green', 'Red'], 'Size': ['S', 'M', 'L', 'S', 'XL'], 'Price': [100, 150, 200, 120, 180] } df = pd.DataFrame(data) print("Original data:\n", df) # Initializing the OneHotEncoder encoder = OneHotEncoder(sparse_output=False) # Fitting the encoder and transforming the data one_hot_encoded_array = encoder.fit_transform(df[['Color', 'Size']]) # The transformed data is an array, so we need to convert it back to a DataFrame one_hot_encoded_df = pd.DataFrame(one_hot_encoded_array, columns=encoder.get_feature_names_out(['Color', 'Size'])) # Concatenating the one-hot encoded columns to the original DataFrame final_df = pd.concat([df, one_hot_encoded_df], axis=1).drop(['Color', 'Size'], axis=1) print("\nOne-hot encoded data:\n", final_df)
In this scikit-learn example, we first initialize the OneHotEncoder
and specify sparse=False
to return a numpy array instead of a sparse matrix. Note: we could use the drop='first'
parameter of OneHotEncoder
to apply “dummy encoding” which drops the first category to avoid multicollinearity. We then fit the encoder and transform the ‘Color’ and ‘Size’ columns. The result is concatenated back to the original DataFrame, and the original categorical columns are dropped. Running this code should return the following output:
Original data:
Color Size Price
0 Red S 100
1 Green M 150
2 Blue L 200
3 Green S 120
4 Red XL 180
One-hot encoded data:
Price Color_Blue Color_Green Color_Red Size_L Size_M Size_S Size_XL
0 100 0.0 0.0 1.0 0.0 0.0 1.0 0.0
1 150 0.0 1.0 0.0 0.0 1.0 0.0 0.0
2 200 1.0 0.0 0.0 1.0 0.0 0.0 0.0
3 120 0.0 1.0 0.0 0.0 0.0 1.0 0.0
4 180 0.0 0.0 1.0 0.0 0.0 0.0 1.0
Both approaches yield similar results, and you can choose either based on your preferences and specific requirements.