How to Convert a Scikit-Learn Dataset to a Pandas DataFrame

Cartoon drawing of a cute panda sitting on the floor with a big book titled 'Datasets'. The panda is flipping through the pages, and above its head are dreamy bubbles filled with graphs, pie charts, and bar diagrams.

By reading this article, you will learn the best way to convert a scikit-learn dataset to a pandas DataFrame object in Python.

If you only want the quick 10 second instructions, here they are:

dataset = load_digits() # Load a dataset of your choosing

# Convert to DataFrame
df = pd.DataFrame(dataset.data, columns=dataset.feature_names) # Convert data matrix to a DataFrame object
df['target'] = dataset.target # Append target (or labels) to the DataFrame

Scikit-learn, often referred to as sklearn, is an open-source machine learning library in Python that provides simple and efficient tools for data mining and data analysis. Scikit-learn offers a wide variety of supervised and unsupervised learning algorithms through a consistent interface, making it a popular choice among data scientists, researchers, and students alike. Its user-friendly API, comprehensive documentation, and active community support have made it one of the go-to libraries for developing machine learning models in Python.

Pandas is an open-source Python library designed for data manipulation and analysis. It introduces two main data structures: “Series” for one-dimensional data and “DataFrame” for two-dimensional data, accommodating various data types. With capabilities like data alignment, handling missing values, merging datasets, and robust IO tools, Pandas has become a cornerstone for data scientists and analysts in Python.

In the world of Python data science, Scikit-Learn stands out for machine learning, while Pandas shines for data manipulation and analysis. For this reason, one might want to convert a Scikit-Learn dataset to a Pandas DataFrame. It’s about leveraging the best of both worlds. With a Pandas DataFrame, you gain superior data exploration tools, easy data manipulation capabilities, seamless integration with visualization libraries, and a more human-readable format. Additionally, the DataFrame structure ensures wider compatibility with other Python tools and simplifies feature engineering.

So, how can it be done?

Scikit-learn datasets typically come as Bunch objects, which are similar to dictionaries. You can convert these into Pandas DataFrames quite easily. Let’s use the digits dataset as an example.

Convert Data from Bunch Object to DataFrame

You can directly convert the data and target values to a DataFrame. In this example, I import the necessary libraries, load the digits dataset by calling the load_digits() -function and save it to the digits variable. The type of this variable is sklearn.utils._bunch.Bunch, which has the following attributes: data, target, frame, target_names, DESCR, feature_names, filename, data_module. In this example, we are mainly interested in the data, target, and feature_names attributes, which return the data matrix for the dataset, the target values (labels) for the dataset, and a list of column names for the data respectively.

from sklearn.datasets import load_digits
import pandas as pd

# Load the digits dataset
digits = load_digits()

# Print the first rows of the dataset to get an idea what the data looks like
print("Feature names of the digits dataset")
print(digits.feature_names)
print("\nFirst 2 rows of the dataset:\n")
print(digits.data[:2])

# Convert to DataFrame
df_data = pd.DataFrame(digits.data, columns=digits.feature_names)
df_target = pd.DataFrame(digits.target, columns=['target'])

# Combine data and target
df = pd.concat([df_data, df_target], axis=1)

# Print the first rows of the dataframe to check that the conversion worked
print("\nFirst rows of the DataFrame:\n", df.head())

Explanation of this code:

digits = load_digits():
- This line calls the load_digits function to load the ‘digits’ dataset and assigns it to the variable digits. The dataset is stored as a Bunch object, which is similar to a dictionary.
df_data = pd.DataFrame(digits.data, columns=digits.feature_names):
- This line creates a Pandas DataFrame from the data in the digits dataset. The data is taken from digits.data, and the column names are set using the feature names from the dataset.
df_target = pd.DataFrame(digits.target, columns=['target']):
- This creates another DataFrame specifically for the target values (i.e., the labels or actual digit values) of the dataset. The column in this DataFrame is named ‘target’.
df = pd.concat([df_data, df_target], axis=1):
- This combines the data and target DataFrames side by side (along axis=1, which refers to columns). The resulting df DataFrame has the data and target values in one table, where the last column will be the target values.

The provided code essentially loads the ‘digits’ dataset, prints some basic information to understand its structure, and then converts the dataset into a single Pandas DataFrame for easier manipulation and analysis.

Running this code you should get the following output:

Feature names of the digits dataset
['pixel_0_0', 'pixel_0_1', 'pixel_0_2', 'pixel_0_3', 'pixel_0_4', 'pixel_0_5', 'pixel_0_6', 'pixel_0_7', 'pixel_1_0', 'pixel_1_1', 'pixel_1_2', 'pixel_1_3', 'pixel_1_4', 'pixel_1_5', 'pixel_1_6', 'pixel_1_7', 'pixel_2_0', 'pixel_2_1', 'pixel_2_2', 'pixel_2_3', 'pixel_2_4', 'pixel_2_5', 'pixel_2_6', 'pixel_2_7', 'pixel_3_0', 'pixel_3_1', 'pixel_3_2', 'pixel_3_3', 'pixel_3_4', 'pixel_3_5', 'pixel_3_6', 'pixel_3_7', 'pixel_4_0', 'pixel_4_1', 'pixel_4_2', 'pixel_4_3', 'pixel_4_4', 'pixel_4_5', 'pixel_4_6', 'pixel_4_7', 'pixel_5_0', 'pixel_5_1', 'pixel_5_2', 'pixel_5_3', 'pixel_5_4', 'pixel_5_5', 'pixel_5_6', 'pixel_5_7', 'pixel_6_0', 'pixel_6_1', 'pixel_6_2', 'pixel_6_3', 'pixel_6_4', 'pixel_6_5', 'pixel_6_6', 'pixel_6_7', 'pixel_7_0', 'pixel_7_1', 'pixel_7_2', 'pixel_7_3', 'pixel_7_4', 'pixel_7_5', 'pixel_7_6', 'pixel_7_7']

First 2 rows of the dataset:

[[ 0.  0.  5. 13.  9.  1.  0.  0.  0.  0. 13. 15. 10. 15.  5.  0.  0.  3.
  15.  2.  0. 11.  8.  0.  0.  4. 12.  0.  0.  8.  8.  0.  0.  5.  8.  0.
   0.  9.  8.  0.  0.  4. 11.  0.  1. 12.  7.  0.  0.  2. 14.  5. 10. 12.
   0.  0.  0.  0.  6. 13. 10.  0.  0.  0.]
 [ 0.  0.  0. 12. 13.  5.  0.  0.  0.  0.  0. 11. 16.  9.  0.  0.  0.  0.
   3. 15. 16.  6.  0.  0.  0.  7. 15. 16. 16.  2.  0.  0.  0.  0.  1. 16.
  16.  3.  0.  0.  0.  0.  1. 16. 16.  6.  0.  0.  0.  0.  1. 16. 16.  6.
   0.  0.  0.  0.  0. 11. 16. 10.  0.  0.]]

First rows of the DataFrame:
    pixel_0_0  pixel_0_1  pixel_0_2  pixel_0_3  pixel_0_4  ...  pixel_7_4  pixel_7_5  pixel_7_6  pixel_7_7  target
0        0.0        0.0        5.0       13.0        9.0  ...       10.0        0.0        0.0        0.0       0
1        0.0        0.0        0.0       12.0       13.0  ...       16.0       10.0        0.0        0.0       1
2        0.0        0.0        0.0        4.0       15.0  ...       11.0       16.0        9.0        0.0       2
3        0.0        0.0        7.0       15.0       13.0  ...       13.0        9.0        0.0        0.0       3
4        0.0        0.0        0.0        1.0       11.0  ...       16.0        4.0        0.0        0.0       4

[5 rows x 65 columns]

Simplification for the Conversion Code

The above code can be simplified a little bit by first of all excluding the print() statements, and removing the pd.concat() -line and instead building the entire DataFrame at once. Here is the code to do that:

from sklearn.datasets import load_digits
import pandas as pd

# Load the digits dataset
digits = load_digits()

# Convert to DataFrame
df = pd.DataFrame(digits.data, columns=digits.feature_names) # This is familiar from the earlier example
df['target'] = digits.target # This appends a new column to the dataframe named target and puts the digits.target values into this column

Further Simplification With No Feature Names

If you have no need for the feature names, the conversion from a Bunch object to a DataFrame can be further simplified:

df = pd.DataFrame(digits.data)
df['target'] = digits.target