Scikit-learn and data frames#
The sklearndf
package enhances scikit-learn for advanced support of data frames.
It addresses a common issue with scikit-learn: the outputs of transformers are numpy arrays, even when the input is a data frame. However, to inspect a model it is essential to keep track of the feature names.
sklearndf
enhances scikit-learn’s estimators to:
return data frames as results of transformations, preserving feature names as the column index
add additional estimator properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers
Using sklearndf
is very simple: Append DF
at the end of scikit-learn class names, and you will get enhanced data frame support.
[2]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearndf.classification import RandomForestClassifierDF
from sklearndf.pipeline import PipelineDF, RegressorPipelineDF
from sklearndf.regression import RandomForestRegressorDF
from sklearndf.regression.extra import LGBMRegressorDF
from sklearndf.transformation import ColumnTransformerDF, OneHotEncoderDF, SimpleImputerDF
from sklearndf.transformation.extra import BorutaDF
We load our data:
[3]:
housing_features_df: pd.DataFrame
housing_target_sr: pd.Series
housing_features_df, housing_target_sr = fetch_openml(data_id=42165, return_X_y=True, as_frame=True)
housing_features_df = housing_features_df.drop(["Id", "YrSold", "MoSold", "MSSubClass", "MiscVal"], axis=1)
The data set includes categorical features, e.g., garage types:
[4]:
housing_features_df["GarageType"].unique().tolist()
[4]:
['Attchd', 'Detchd', 'BuiltIn', 'CarPort', None, 'Basment', '2Types']
Contrasting a scikit-learn and sklearndf pipeline#
Let us build a preprocessing pipeline which:
for categorical variables fills missing values with the string ‘nan’ and then one-hot encodes
for numerical values fills missing values using median values
A scikit-learn pipeline#
We first build the preprocessing pipeline with native scikit-learn transformers. This is achievable with a few lines of code; however does not allow us to keep track of feature names.
[5]:
preprocessing_numeric = SimpleImputer(strategy="median", add_indicator=True)
preprocessing_categorical = Pipeline(
steps=[
('imputer', SimpleImputer(missing_values=None, strategy='constant', fill_value='<unknown>')),
('one-hot', OneHotEncoder(sparse=False))
]
)
preprocessing = ColumnTransformer(
transformers=[
('num', preprocessing_numeric, make_column_selector(dtype_include=np.number)),
('cat', preprocessing_categorical, make_column_selector(dtype_exclude=np.number)),
]
)
[6]:
preprocessing.fit_transform(X=housing_features_df, y=housing_target_sr)
[6]:
array([[6.500e+01, 8.450e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,
0.000e+00],
[8.000e+01, 9.600e+03, 6.000e+00, ..., 0.000e+00, 1.000e+00,
0.000e+00],
[6.800e+01, 1.125e+04, 7.000e+00, ..., 0.000e+00, 1.000e+00,
0.000e+00],
...,
[6.600e+01, 9.042e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,
0.000e+00],
[6.800e+01, 9.717e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,
0.000e+00],
[7.500e+01, 9.937e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,
0.000e+00]])
The strength of sklearndf
is to maintain the scikit-learn conventions and expressiveness, and to also preserve dataframes, hence keeping track of the feature names.
A scikit-learn pipeline#
The convention in sklearndf
is to append DF
at the end of each corresponding scikit-learn class. For instance, to reproduce the above example, we write:
[7]:
preprocessing_numeric_df = SimpleImputerDF(strategy="median", add_indicator=True)
preprocessing_categorical_df = PipelineDF(
steps=[
('imputer', SimpleImputerDF(missing_values=None, strategy='constant', fill_value='<unknown>')),
('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore"))
]
)
preprocessing_df = ColumnTransformerDF(
transformers=[
('num', preprocessing_numeric_df, make_column_selector(dtype_include=np.number)),
('cat', preprocessing_categorical_df, make_column_selector(dtype_exclude=np.number)),
],
)
[8]:
transformed_df = preprocessing_df.fit_transform(X=housing_features_df, y=housing_target_sr)
transformed_df.head().T
[8]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
feature_out | |||||
num__LotFrontage | 65.0 | 80.0 | 68.0 | 60.0 | 84.0 |
num__LotArea | 8450.0 | 9600.0 | 11250.0 | 9550.0 | 14260.0 |
num__OverallQual | 7.0 | 6.0 | 7.0 | 7.0 | 8.0 |
num__OverallCond | 5.0 | 8.0 | 5.0 | 5.0 | 5.0 |
num__YearBuilt | 2003.0 | 1976.0 | 2001.0 | 1915.0 | 2000.0 |
... | ... | ... | ... | ... | ... |
cat__SaleCondition_AdjLand | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
cat__SaleCondition_Alloca | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
cat__SaleCondition_Family | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
cat__SaleCondition_Normal | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 |
cat__SaleCondition_Partial | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
303 rows × 5 columns
The feature_names_original_
attribute returns a series, mapping the output columns (the series’ index) to the input columns (the series’ values):
[9]:
preprocessing_df.feature_names_original_.to_frame()
[9]:
feature_in | |
---|---|
feature_out | |
num__LotFrontage | LotFrontage |
num__LotArea | LotArea |
num__OverallQual | OverallQual |
num__OverallCond | OverallCond |
num__YearBuilt | YearBuilt |
... | ... |
cat__SaleCondition_AdjLand | SaleCondition |
cat__SaleCondition_Alloca | SaleCondition |
cat__SaleCondition_Family | SaleCondition |
cat__SaleCondition_Normal | SaleCondition |
cat__SaleCondition_Partial | SaleCondition |
303 rows × 1 columns
You can therefore easily select all output features generated from a given input feature:
[10]:
garage_type_derivatives = preprocessing_df.feature_names_original_ == "GarageType"
transformed_df.loc[:, garage_type_derivatives].head().T
[10]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
feature_out | |||||
cat__GarageType_2Types | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
cat__GarageType_<unknown> | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
cat__GarageType_Attchd | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 |
cat__GarageType_Basment | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
cat__GarageType_BuiltIn | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
cat__GarageType_CarPort | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
cat__GarageType_Detchd | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
Supervised learners#
Regressors#
As for transformers, scikit-learn regressors and classifiers have a sklearndf
sibling obtained by appending DF
to the class name, and the API remains the same. The result of any predict and decision function will be returned as a pandas
series (single output) or data frame (class probabilities or multi-output).
For a random forest regressor we get:
[11]:
# a simplified features vector (we will use a pipeline for more sophisticated pre-processing further down)
numerical_features_df = housing_features_df.select_dtypes(np.number).fillna(0)
For demonstration purposes, we create a simple train-test split:
[12]:
df_numerical_train, df_numerical_test, y_train, y_test = train_test_split(
numerical_features_df,
housing_target_sr,
random_state=42
)
Then we create, fit and score a DF-augmented random forest regressor:
[13]:
random_forest_regressor_df = RandomForestRegressorDF(
n_estimators=100,
max_depth=5,
random_state=42,
n_jobs=-3
)
random_forest_regressor_df.fit(X=df_numerical_train, y=y_train)
random_forest_regressor_df.score(X=df_numerical_test, y=y_test)
[13]:
0.8638857401761126
The DF-augmented regressor returns predictions as a series (we convert the series to a data frame using method .to_frame()
for better rendering in the jupyter notebook):
[14]:
random_forest_regressor_df.predict(df_numerical_test.head()).to_frame()
[14]:
prediction | |
---|---|
892 | 138678.817934 |
1105 | 305008.808241 |
413 | 133420.810780 |
522 | 171533.659061 |
1036 | 307214.384636 |
All methods and attributes of the native random forest regressor are also available in the DF-augmented regressor, e.g., .get_params()
and .set_params()
:
[15]:
random_forest_regressor_df.get_params()
[15]:
{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'squared_error',
'max_depth': 5,
'max_features': 'auto',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': -3,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}
[16]:
random_forest_regressor_df.set_params(max_depth=7)
[16]:
RandomForestRegressorDF(max_depth=7, n_jobs=-3, random_state=42)
The underlying scikit-learn regressor is available through the native_estimator
attribute:
[17]:
random_forest_regressor_df.native_estimator
[17]:
RandomForestRegressor(max_depth=7, n_jobs=-3, random_state=42)
Property is_fitted
tells if the regressor is fitted, and – for fitted estimators – property feature_names_in_
returns the names of the ingoing features as a pandas index:
[18]:
random_forest_regressor_df.is_fitted
[18]:
True
[19]:
random_forest_regressor_df.feature_names_in_
[19]:
Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea'],
dtype='object', name='feature_in')
Classifiers#
Classifiers follow a similar logic.
To get started, we create for house prices house below 100k, below 200k, and above 200k for multi-label classification.
[20]:
y_classes = housing_target_sr.apply(
lambda x:
">=200k" if x >= 200000
else ">=100k" if x >= 100000
else "<100k"
).astype("category")
We create a train-test split:
[21]:
df_numerical_train, df_numerical_test, y_classification_train, y_classification_test = train_test_split(
numerical_features_df,
y_classes,
random_state=42
)
We create, fit and score a DF-augmented classifier:
[22]:
random_forest_classifier_df = RandomForestClassifierDF(
n_estimators=100,
max_depth=5,
random_state=42,
n_jobs=-3
)
random_forest_classifier_df.fit(df_numerical_train, y_classification_train)
random_forest_classifier_df.score(df_numerical_test, y_classification_test)
[22]:
0.8767123287671232
The DF-augmented classifier returns predicted class labels as a series (once more, we invoke .to_frame()
for better rendering in Jupyter):
[23]:
random_forest_classifier_df.predict(df_numerical_test.head()).to_frame()
[23]:
prediction | |
---|---|
892 | >=100k |
1105 | >=200k |
413 | >=100k |
522 | >=100k |
1036 | >=200k |
Class probabilities are returned as a data frame, using class labels as the column names:
[24]:
random_forest_classifier_df.predict_proba(df_numerical_test.head())
[24]:
<100k | >=100k | >=200k | |
---|---|---|---|
892 | 0.056555 | 0.905298 | 0.038148 |
1105 | 0.001141 | 0.091114 | 0.907745 |
413 | 0.104472 | 0.860927 | 0.034602 |
522 | 0.047416 | 0.814917 | 0.137667 |
1036 | 0.000812 | 0.081764 | 0.917424 |
[25]:
random_forest_classifier_df.predict_log_proba(df_numerical_test.head())
[25]:
<100k | >=100k | >=200k | |
---|---|---|---|
892 | -2.872548 | -0.099491 | -3.266291 |
1105 | -6.775823 | -2.395643 | -0.096792 |
413 | -2.258840 | -0.149746 | -3.363852 |
522 | -3.048800 | -0.204669 | -1.982916 |
1036 | -7.116299 | -2.503917 | -0.086185 |
Pipeline#
We can combine the above steps to build a full predictive pipeline. sklearndf
provides two useful, specialised pipeline objects for this, RegressorPipelineDF
and ClassifierPipelineDF
. Both implement a special two-step pipeline with one preprocessing step and one prediction step, while staying compatible with the general sklearn pipeline idiom.
[26]:
pipeline_df = RegressorPipelineDF(
preprocessing=preprocessing_df,
regressor=RandomForestRegressorDF(
n_estimators=1000,
max_features=2/3,
max_depth=7,
random_state=42,
n_jobs=-3
)
)
[27]:
df_train, df_test, y_train, y_test = train_test_split(housing_features_df, housing_target_sr, random_state=42)
pipeline_df.fit(df_train, y_train)
pipeline_df.score(df_test, y_test)
[27]:
0.8883657928153087
Extras#
sklearndf
also provides some additional estimators developed by Gamma or third parties, which are useful additions to the scikit-learn repertoire, and which follow the scikit-learn idiom. These are provided in the .extra
modules:
sklearndf.regression.extra.LGBMRegressorDF
sklearndf.regression.extra.XGBRegressorDF
sklearndf.classification.extra.LGBMClassifierDF
sklearndf.classification.extra.XGBClassifierDF
sklearndf.transformation.extra.BorutaDF
In the interest of lean package dependencies, sklearndf
does not require packages lightgbm
, xgboost
, or boruta_py
to be installed by default, but will not provide any of these DF classes if their associated third-party package is not installed.
LightGBM regressor#
[28]:
lgbm_df = LGBMRegressorDF(n_estimators=100, max_depth=8)
lgbm_df.fit(df_numerical_train, y_train)
lgbm_df.predict(df_numerical_test.head()).to_frame()
[28]:
prediction | |
---|---|
892 | 139447.461334 |
1105 | 288031.725712 |
413 | 124917.506705 |
522 | 167320.406141 |
1036 | 315868.109901 |
Boruta#
Boruta is a smart feature selection method to eliminate all features whose predictive power is not better than random noise.
The sklearndf.transformation.extra.BorutaDF
transformer provides easy access to this powerful method. The basis of this is a tree-based learner, usually a random forest.
For the random forest, we rely on default parameters but set the maximum tree depth to 5 (for Boruta, setting a depth between 3 and 7 is highly recommended and depends on the number of features and expected complexity of the feature/target interactions). The number of trees is automatically managed by the Boruta feature selector (argument n_estimators="auto"
).
[29]:
boruta_pipeline = PipelineDF(
steps=[
('preprocess', preprocessing_df),
('boruta', BorutaDF(
estimator=RandomForestRegressorDF(max_depth=5, n_jobs=-3),
n_estimators="auto",
random_state=42,
verbose=1
)),
]
)
[30]:
boruta_pipeline.fit(X=housing_features_df, y=housing_target_sr)
Iteration: 1 / 100
Iteration: 2 / 100
Iteration: 3 / 100
Iteration: 4 / 100
Iteration: 5 / 100
Iteration: 6 / 100
Iteration: 7 / 100
Iteration: 8 / 100
Iteration: 9 / 100
Iteration: 10 / 100
Iteration: 11 / 100
Iteration: 12 / 100
Iteration: 13 / 100
Iteration: 14 / 100
Iteration: 15 / 100
Iteration: 16 / 100
Iteration: 17 / 100
Iteration: 18 / 100
Iteration: 19 / 100
Iteration: 20 / 100
Iteration: 21 / 100
Iteration: 22 / 100
Iteration: 23 / 100
Iteration: 24 / 100
Iteration: 25 / 100
Iteration: 26 / 100
Iteration: 27 / 100
Iteration: 28 / 100
Iteration: 29 / 100
Iteration: 30 / 100
Iteration: 31 / 100
Iteration: 32 / 100
Iteration: 33 / 100
Iteration: 34 / 100
Iteration: 35 / 100
Iteration: 36 / 100
Iteration: 37 / 100
Iteration: 38 / 100
Iteration: 39 / 100
Iteration: 40 / 100
Iteration: 41 / 100
Iteration: 42 / 100
Iteration: 43 / 100
Iteration: 44 / 100
Iteration: 45 / 100
Iteration: 46 / 100
Iteration: 47 / 100
Iteration: 48 / 100
Iteration: 49 / 100
Iteration: 50 / 100
Iteration: 51 / 100
Iteration: 52 / 100
Iteration: 53 / 100
Iteration: 54 / 100
Iteration: 55 / 100
Iteration: 56 / 100
Iteration: 57 / 100
Iteration: 58 / 100
Iteration: 59 / 100
Iteration: 60 / 100
Iteration: 61 / 100
Iteration: 62 / 100
Iteration: 63 / 100
Iteration: 64 / 100
Iteration: 65 / 100
Iteration: 66 / 100
Iteration: 67 / 100
Iteration: 68 / 100
Iteration: 69 / 100
Iteration: 70 / 100
Iteration: 71 / 100
Iteration: 72 / 100
Iteration: 73 / 100
Iteration: 74 / 100
Iteration: 75 / 100
Iteration: 76 / 100
Iteration: 77 / 100
Iteration: 78 / 100
Iteration: 79 / 100
Iteration: 80 / 100
Iteration: 81 / 100
Iteration: 82 / 100
Iteration: 83 / 100
Iteration: 84 / 100
Iteration: 85 / 100
Iteration: 86 / 100
Iteration: 87 / 100
Iteration: 88 / 100
Iteration: 89 / 100
Iteration: 90 / 100
Iteration: 91 / 100
Iteration: 92 / 100
Iteration: 93 / 100
Iteration: 94 / 100
Iteration: 95 / 100
Iteration: 96 / 100
Iteration: 97 / 100
Iteration: 98 / 100
Iteration: 99 / 100
BorutaPy finished running.
Iteration: 100 / 100
Confirmed: 16
Tentative: 0
Rejected: 286
[30]:
PipelineDF(steps=[('preprocess', ColumnTransformerDF(transformers=[('num', SimpleImputerDF(add_indicator=True, strategy='median'), <sklearn.compose._column_transformer.make_column_selector object at 0x13958f2b0>), ('cat', PipelineDF(steps=[('imputer', SimpleImputerDF(fill_value='<unknown>', missing_values=None, strategy='constant')), ('one-hot', OneHotEncoderDF(handle_unknown='ignore', sparse=False))]), <sklearn.compose._column_transformer.make_column_selector object at 0x13958fd60>)])), ('boruta', BorutaDF(estimator=RandomForestRegressor(max_depth=5, n_estimators=116, n_jobs=-3,
random_state=RandomState(MT19937) at 0x13B34BD40), n_estimators='auto', random_state=RandomState(MT19937) at 0x13B34BD40, verbose=1))])
Boruta is implemented as a scikit-learn transformer; its output features are all features that passed the Boruta test.
[31]:
boruta_pipeline.feature_names_out_.to_list()
[31]:
['num__LotFrontage',
'num__LotArea',
'num__OverallQual',
'num__YearBuilt',
'num__YearRemodAdd',
'num__MasVnrArea',
'num__BsmtFinSF1',
'num__TotalBsmtSF',
'num__1stFlrSF',
'num__2ndFlrSF',
'num__GrLivArea',
'num__FullBath',
'num__TotRmsAbvGrd',
'num__GarageCars',
'num__GarageArea',
'cat__BsmtQual_Ex']
sklearndf
allows us to trace outgoing features back to the original features from which they were derived, using the sklearndf.TransformerDF.features_original_
property. This is useful here as we want to know which features to eliminate before putting them into the pipeline.
In our example, feature BsmtQual_Ex
is a derivative of feature BsmtQual
, obtained through one-hot encoding:
[32]:
boruta_pipeline.feature_names_original_.to_frame()
[32]:
feature_in | |
---|---|
feature_out | |
num__LotFrontage | LotFrontage |
num__LotArea | LotArea |
num__OverallQual | OverallQual |
num__YearBuilt | YearBuilt |
num__YearRemodAdd | YearRemodAdd |
num__MasVnrArea | MasVnrArea |
num__BsmtFinSF1 | BsmtFinSF1 |
num__TotalBsmtSF | TotalBsmtSF |
num__1stFlrSF | 1stFlrSF |
num__2ndFlrSF | 2ndFlrSF |
num__GrLivArea | GrLivArea |
num__FullBath | FullBath |
num__TotRmsAbvGrd | TotRmsAbvGrd |
num__GarageCars | GarageCars |
num__GarageArea | GarageArea |
cat__BsmtQual_Ex | BsmtQual |
So, to obtain all features we want to select from the original data set, we can determine the unique relevant ingoing features from the original feature mapping:
[33]:
features_relevant = boruta_pipeline.feature_names_original_.unique()
features_relevant.tolist()
[33]:
['LotFrontage',
'LotArea',
'OverallQual',
'YearBuilt',
'YearRemodAdd',
'MasVnrArea',
'BsmtFinSF1',
'TotalBsmtSF',
'1stFlrSF',
'2ndFlrSF',
'GrLivArea',
'FullBath',
'TotRmsAbvGrd',
'GarageCars',
'GarageArea',
'BsmtQual']
When we look at model performance for the reduced set of relevant features with model performance for the full feature set, we notice that performance is almost the same despite a significantly lower feature count (bearing in mind that, for reasons of simplicity, we are validating based on a single test set; in a real-world example we’d want to use cross-validation for more accurate model scores).
[34]:
pipeline_df.fit(
# fit the model using the full feature set
df_train, y_train
)
score_all_features = pipeline_df.score(
# score the model using the full feature set
df_test, y_test
)
print(f"R2={score_all_features:.3f} with model fitted on {len(pipeline_df.final_estimator.feature_names_in_)} features")
R2=0.888 with model fitted on 299 features
[35]:
pipeline_df.fit(
# fit the model using the relevant feature set
df_train.loc[:, features_relevant], y_train
)
score_relevant_features = pipeline_df.score(
# score the model using the relevant feature set
df_test.loc[:, features_relevant], y_test
)
print(f"R2={score_relevant_features:.3f} with model fitted on {len(pipeline_df.final_estimator.feature_names_in_)} features")
R2=0.893 with model fitted on 22 features