
By reading this article, you will learn the best way to convert a scikit-learn dataset to a pandas DataFrame object in Python.
If you only want the quick 10 second instructions, here they are:
dataset = load_digits() # Load a dataset of your choosing # Convert to DataFrame df = pd.DataFrame(dataset.data, columns=dataset.feature_names) # Convert data matrix to a DataFrame object df['target'] = dataset.target # Append target (or labels) to the DataFrame
Scikit-learn, often referred to as sklearn, is an open-source machine learning library in Python that provides simple and efficient tools for data mining and data analysis. Scikit-learn offers a wide variety of supervised and unsupervised learning algorithms through a consistent interface, making it a popular choice among data scientists, researchers, and students alike. Its user-friendly API, comprehensive documentation, and active community support have made it one of the go-to libraries for developing machine learning models in Python.
Pandas is an open-source Python library designed for data manipulation and analysis. It introduces two main data structures: “Series” for one-dimensional data and “DataFrame” for two-dimensional data, accommodating various data types. With capabilities like data alignment, handling missing values, merging datasets, and robust IO tools, Pandas has become a cornerstone for data scientists and analysts in Python.
In the world of Python data science, Scikit-Learn stands out for machine learning, while Pandas shines for data manipulation and analysis. For this reason, one might want to convert a Scikit-Learn dataset to a Pandas DataFrame. It’s about leveraging the best of both worlds. With a Pandas DataFrame, you gain superior data exploration tools, easy data manipulation capabilities, seamless integration with visualization libraries, and a more human-readable format. Additionally, the DataFrame structure ensures wider compatibility with other Python tools and simplifies feature engineering.
So, how can it be done?
Scikit-learn datasets typically come as Bunch objects, which are similar to dictionaries. You can convert these into Pandas DataFrames quite easily. Let’s use the digits
dataset as an example.
Convert Data from Bunch Object to DataFrame
You can directly convert the data and target values to a DataFrame. In this example, I import the necessary libraries, load the digits dataset by calling the load_digits()
-function and save it to the digits
variable. The type of this variable is sklearn.utils._bunch.Bunch, which has the following attributes: data, target, frame, target_names, DESCR, feature_names, filename, data_module. In this example, we are mainly interested in the data, target, and feature_names attributes, which return the data matrix for the dataset, the target values (labels) for the dataset, and a list of column names for the data respectively.
from sklearn.datasets import load_digits import pandas as pd # Load the digits dataset digits = load_digits() # Print the first rows of the dataset to get an idea what the data looks like print("Feature names of the digits dataset") print(digits.feature_names) print("\nFirst 2 rows of the dataset:\n") print(digits.data[:2]) # Convert to DataFrame df_data = pd.DataFrame(digits.data, columns=digits.feature_names) df_target = pd.DataFrame(digits.target, columns=['target']) # Combine data and target df = pd.concat([df_data, df_target], axis=1) # Print the first rows of the dataframe to check that the conversion worked print("\nFirst rows of the DataFrame:\n", df.head())
Explanation of this code:
:digits = load_digits()
- This line calls the
function to load the ‘digits’ dataset and assigns it to the variableload_digits
. The dataset is stored as a Bunch object, which is similar to a dictionary.digits
- This line calls the
:df_data = pd.DataFrame(digits.data, columns=digits.feature_names)
- This line creates a Pandas DataFrame from the data in the
digits
dataset. The data is taken from
, and the column names are set using the feature names from the dataset.digits.data
- This line creates a Pandas DataFrame from the data in the
:df_target = pd.DataFrame(digits.target, columns=['target'])
- This creates another DataFrame specifically for the target values (i.e., the labels or actual digit values) of the dataset. The column in this DataFrame is named ‘target’.
:df = pd.concat([df_data, df_target], axis=1)
- This combines the data and target DataFrames side by side (along
axis=1
, which refers to columns). The resulting
DataFrame has the data and target values in one table, where the last column will be the target values.df
- This combines the data and target DataFrames side by side (along
The provided code essentially loads the ‘digits’ dataset, prints some basic information to understand its structure, and then converts the dataset into a single Pandas DataFrame for easier manipulation and analysis.
Running this code you should get the following output:
Feature names of the digits dataset
['pixel_0_0', 'pixel_0_1', 'pixel_0_2', 'pixel_0_3', 'pixel_0_4', 'pixel_0_5', 'pixel_0_6', 'pixel_0_7', 'pixel_1_0', 'pixel_1_1', 'pixel_1_2', 'pixel_1_3', 'pixel_1_4', 'pixel_1_5', 'pixel_1_6', 'pixel_1_7', 'pixel_2_0', 'pixel_2_1', 'pixel_2_2', 'pixel_2_3', 'pixel_2_4', 'pixel_2_5', 'pixel_2_6', 'pixel_2_7', 'pixel_3_0', 'pixel_3_1', 'pixel_3_2', 'pixel_3_3', 'pixel_3_4', 'pixel_3_5', 'pixel_3_6', 'pixel_3_7', 'pixel_4_0', 'pixel_4_1', 'pixel_4_2', 'pixel_4_3', 'pixel_4_4', 'pixel_4_5', 'pixel_4_6', 'pixel_4_7', 'pixel_5_0', 'pixel_5_1', 'pixel_5_2', 'pixel_5_3', 'pixel_5_4', 'pixel_5_5', 'pixel_5_6', 'pixel_5_7', 'pixel_6_0', 'pixel_6_1', 'pixel_6_2', 'pixel_6_3', 'pixel_6_4', 'pixel_6_5', 'pixel_6_6', 'pixel_6_7', 'pixel_7_0', 'pixel_7_1', 'pixel_7_2', 'pixel_7_3', 'pixel_7_4', 'pixel_7_5', 'pixel_7_6', 'pixel_7_7']
First 2 rows of the dataset:
[[ 0. 0. 5. 13. 9. 1. 0. 0. 0. 0. 13. 15. 10. 15. 5. 0. 0. 3.
15. 2. 0. 11. 8. 0. 0. 4. 12. 0. 0. 8. 8. 0. 0. 5. 8. 0.
0. 9. 8. 0. 0. 4. 11. 0. 1. 12. 7. 0. 0. 2. 14. 5. 10. 12.
0. 0. 0. 0. 6. 13. 10. 0. 0. 0.]
[ 0. 0. 0. 12. 13. 5. 0. 0. 0. 0. 0. 11. 16. 9. 0. 0. 0. 0.
3. 15. 16. 6. 0. 0. 0. 7. 15. 16. 16. 2. 0. 0. 0. 0. 1. 16.
16. 3. 0. 0. 0. 0. 1. 16. 16. 6. 0. 0. 0. 0. 1. 16. 16. 6.
0. 0. 0. 0. 0. 11. 16. 10. 0. 0.]]
First rows of the DataFrame:
pixel_0_0 pixel_0_1 pixel_0_2 pixel_0_3 pixel_0_4 ... pixel_7_4 pixel_7_5 pixel_7_6 pixel_7_7 target
0 0.0 0.0 5.0 13.0 9.0 ... 10.0 0.0 0.0 0.0 0
1 0.0 0.0 0.0 12.0 13.0 ... 16.0 10.0 0.0 0.0 1
2 0.0 0.0 0.0 4.0 15.0 ... 11.0 16.0 9.0 0.0 2
3 0.0 0.0 7.0 15.0 13.0 ... 13.0 9.0 0.0 0.0 3
4 0.0 0.0 0.0 1.0 11.0 ... 16.0 4.0 0.0 0.0 4
[5 rows x 65 columns]
Simplification for the Conversion Code
The above code can be simplified a little bit by first of all excluding the print()
statements, and removing the pd.concat()
-line and instead building the entire DataFrame at once. Here is the code to do that:
from sklearn.datasets import load_digits import pandas as pd # Load the digits dataset digits = load_digits() # Convert to DataFrame df = pd.DataFrame(digits.data, columns=digits.feature_names) # This is familiar from the earlier example df['target'] = digits.target # This appends a new column to the dataframe named target and puts the digits.target values into this column
Further Simplification With No Feature Names
If you have no need for the feature names, the conversion from a Bunch object to a DataFrame can be further simplified:
df = pd.DataFrame(digits.data) df['target'] = digits.target