Scikit-learn and data frames#

The sklearndf package enhances scikit-learn for advanced support of data frames.

It addresses a common issue with scikit-learn: the outputs of transformers are numpy arrays, even when the input is a data frame. However, to inspect a model it is essential to keep track of the feature names.

sklearndf enhances scikit-learn’s estimators to:

  • return data frames as results of transformations, preserving feature names as the column index

  • add additional estimator properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers

Using sklearndf is very simple: Append DF at the end of scikit-learn class names, and you will get enhanced data frame support.

[2]:
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.datasets import fetch_openml
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from sklearndf.classification import RandomForestClassifierDF
from sklearndf.pipeline import PipelineDF, RegressorPipelineDF
from sklearndf.regression import RandomForestRegressorDF
from sklearndf.regression.extra import LGBMRegressorDF
from sklearndf.transformation import ColumnTransformerDF, OneHotEncoderDF, SimpleImputerDF
from sklearndf.transformation.extra import BorutaDF

We load our data:

[3]:
housing_features_df: pd.DataFrame
housing_target_sr: pd.Series

housing_features_df, housing_target_sr = fetch_openml(data_id=42165, return_X_y=True, as_frame=True)
housing_features_df = housing_features_df.drop(["Id", "YrSold", "MoSold", "MSSubClass", "MiscVal"], axis=1)

The data set includes categorical features, e.g., garage types:

[4]:
housing_features_df["GarageType"].unique().tolist()
[4]:
['Attchd', 'Detchd', 'BuiltIn', 'CarPort', None, 'Basment', '2Types']

Contrasting a scikit-learn and sklearndf pipeline#

Let us build a preprocessing pipeline which:

  • for categorical variables fills missing values with the string ‘nan’ and then one-hot encodes

  • for numerical values fills missing values using median values

A scikit-learn pipeline#

We first build the preprocessing pipeline with native scikit-learn transformers. This is achievable with a few lines of code; however does not allow us to keep track of feature names.

[5]:
preprocessing_numeric = SimpleImputer(strategy="median", add_indicator=True)

preprocessing_categorical = Pipeline(
    steps=[
        ('imputer', SimpleImputer(missing_values=None, strategy='constant', fill_value='<unknown>')),
        ('one-hot', OneHotEncoder(sparse=False))
    ]
)

preprocessing = ColumnTransformer(
    transformers=[
        ('num', preprocessing_numeric, make_column_selector(dtype_include=np.number)),
        ('cat', preprocessing_categorical, make_column_selector(dtype_exclude=np.number)),
    ]
)
[6]:
preprocessing.fit_transform(X=housing_features_df, y=housing_target_sr)
[6]:
array([[6.500e+01, 8.450e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       [8.000e+01, 9.600e+03, 6.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       [6.800e+01, 1.125e+04, 7.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       ...,
       [6.600e+01, 9.042e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       [6.800e+01, 9.717e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00],
       [7.500e+01, 9.937e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,
        0.000e+00]])

The strength of sklearndf is to maintain the scikit-learn conventions and expressiveness, and to also preserve dataframes, hence keeping track of the feature names.

A scikit-learn pipeline#

The convention in sklearndf is to append DF at the end of each corresponding scikit-learn class. For instance, to reproduce the above example, we write:

[7]:
preprocessing_numeric_df = SimpleImputerDF(strategy="median", add_indicator=True)

preprocessing_categorical_df = PipelineDF(
    steps=[
        ('imputer', SimpleImputerDF(missing_values=None, strategy='constant', fill_value='<unknown>')),
        ('one-hot', OneHotEncoderDF(sparse=False, handle_unknown="ignore"))
    ]
)

preprocessing_df = ColumnTransformerDF(
    transformers=[
        ('num', preprocessing_numeric_df, make_column_selector(dtype_include=np.number)),
        ('cat', preprocessing_categorical_df, make_column_selector(dtype_exclude=np.number)),
    ],
)
[8]:
transformed_df = preprocessing_df.fit_transform(X=housing_features_df, y=housing_target_sr)
transformed_df.head().T
[8]:
0 1 2 3 4
feature_out
num__LotFrontage 65.0 80.0 68.0 60.0 84.0
num__LotArea 8450.0 9600.0 11250.0 9550.0 14260.0
num__OverallQual 7.0 6.0 7.0 7.0 8.0
num__OverallCond 5.0 8.0 5.0 5.0 5.0
num__YearBuilt 2003.0 1976.0 2001.0 1915.0 2000.0
... ... ... ... ... ...
cat__SaleCondition_AdjLand 0.0 0.0 0.0 0.0 0.0
cat__SaleCondition_Alloca 0.0 0.0 0.0 0.0 0.0
cat__SaleCondition_Family 0.0 0.0 0.0 0.0 0.0
cat__SaleCondition_Normal 1.0 1.0 1.0 0.0 1.0
cat__SaleCondition_Partial 0.0 0.0 0.0 0.0 0.0

303 rows × 5 columns

The feature_names_original_ attribute returns a series, mapping the output columns (the series’ index) to the input columns (the series’ values):

[9]:
preprocessing_df.feature_names_original_.to_frame()
[9]:
feature_in
feature_out
num__LotFrontage LotFrontage
num__LotArea LotArea
num__OverallQual OverallQual
num__OverallCond OverallCond
num__YearBuilt YearBuilt
... ...
cat__SaleCondition_AdjLand SaleCondition
cat__SaleCondition_Alloca SaleCondition
cat__SaleCondition_Family SaleCondition
cat__SaleCondition_Normal SaleCondition
cat__SaleCondition_Partial SaleCondition

303 rows × 1 columns

You can therefore easily select all output features generated from a given input feature:

[10]:
garage_type_derivatives = preprocessing_df.feature_names_original_ == "GarageType"

transformed_df.loc[:, garage_type_derivatives].head().T
[10]:
0 1 2 3 4
feature_out
cat__GarageType_2Types 0.0 0.0 0.0 0.0 0.0
cat__GarageType_<unknown> 0.0 0.0 0.0 0.0 0.0
cat__GarageType_Attchd 1.0 1.0 1.0 0.0 1.0
cat__GarageType_Basment 0.0 0.0 0.0 0.0 0.0
cat__GarageType_BuiltIn 0.0 0.0 0.0 0.0 0.0
cat__GarageType_CarPort 0.0 0.0 0.0 0.0 0.0
cat__GarageType_Detchd 0.0 0.0 0.0 1.0 0.0

Supervised learners#

Regressors#

As for transformers, scikit-learn regressors and classifiers have a sklearndf sibling obtained by appending DF to the class name, and the API remains the same. The result of any predict and decision function will be returned as a pandas series (single output) or data frame (class probabilities or multi-output).

For a random forest regressor we get:

[11]:
# a simplified features vector (we will use a pipeline for more sophisticated pre-processing further down)
numerical_features_df = housing_features_df.select_dtypes(np.number).fillna(0)

For demonstration purposes, we create a simple train-test split:

[12]:
df_numerical_train, df_numerical_test, y_train, y_test = train_test_split(
    numerical_features_df,
    housing_target_sr,
    random_state=42
)

Then we create, fit and score a DF-augmented random forest regressor:

[13]:
random_forest_regressor_df = RandomForestRegressorDF(
    n_estimators=100,
    max_depth=5,
    random_state=42,
    n_jobs=-3
)

random_forest_regressor_df.fit(X=df_numerical_train, y=y_train)
random_forest_regressor_df.score(X=df_numerical_test, y=y_test)
[13]:
0.8638857401761126

The DF-augmented regressor returns predictions as a series (we convert the series to a data frame using method .to_frame() for better rendering in the jupyter notebook):

[14]:
random_forest_regressor_df.predict(df_numerical_test.head()).to_frame()
[14]:
prediction
892 138678.817934
1105 305008.808241
413 133420.810780
522 171533.659061
1036 307214.384636

All methods and attributes of the native random forest regressor are also available in the DF-augmented regressor, e.g., .get_params() and .set_params():

[15]:
random_forest_regressor_df.get_params()
[15]:
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': 5,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': -3,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}
[16]:
random_forest_regressor_df.set_params(max_depth=7)
[16]:
RandomForestRegressorDF(max_depth=7, n_jobs=-3, random_state=42)

The underlying scikit-learn regressor is available through the native_estimator attribute:

[17]:
random_forest_regressor_df.native_estimator
[17]:
RandomForestRegressor(max_depth=7, n_jobs=-3, random_state=42)

Property is_fitted tells if the regressor is fitted, and – for fitted estimators – property feature_names_in_ returns the names of the ingoing features as a pandas index:

[18]:
random_forest_regressor_df.is_fitted
[18]:
True
[19]:
random_forest_regressor_df.feature_names_in_
[19]:
Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea'],
      dtype='object', name='feature_in')

Classifiers#

Classifiers follow a similar logic.

To get started, we create for house prices house below 100k, below 200k, and above 200k for multi-label classification.

[20]:
y_classes = housing_target_sr.apply(
    lambda x:
    ">=200k" if x >= 200000
    else ">=100k" if x >= 100000
    else "<100k"
).astype("category")

We create a train-test split:

[21]:
df_numerical_train, df_numerical_test, y_classification_train, y_classification_test = train_test_split(
    numerical_features_df,
    y_classes,
    random_state=42
)

We create, fit and score a DF-augmented classifier:

[22]:
random_forest_classifier_df = RandomForestClassifierDF(
    n_estimators=100,
    max_depth=5,
    random_state=42,
    n_jobs=-3
)
random_forest_classifier_df.fit(df_numerical_train, y_classification_train)
random_forest_classifier_df.score(df_numerical_test, y_classification_test)
[22]:
0.8767123287671232

The DF-augmented classifier returns predicted class labels as a series (once more, we invoke .to_frame() for better rendering in Jupyter):

[23]:
random_forest_classifier_df.predict(df_numerical_test.head()).to_frame()
[23]:
prediction
892 >=100k
1105 >=200k
413 >=100k
522 >=100k
1036 >=200k

Class probabilities are returned as a data frame, using class labels as the column names:

[24]:
random_forest_classifier_df.predict_proba(df_numerical_test.head())
[24]:
<100k >=100k >=200k
892 0.056555 0.905298 0.038148
1105 0.001141 0.091114 0.907745
413 0.104472 0.860927 0.034602
522 0.047416 0.814917 0.137667
1036 0.000812 0.081764 0.917424
[25]:
random_forest_classifier_df.predict_log_proba(df_numerical_test.head())
[25]:
<100k >=100k >=200k
892 -2.872548 -0.099491 -3.266291
1105 -6.775823 -2.395643 -0.096792
413 -2.258840 -0.149746 -3.363852
522 -3.048800 -0.204669 -1.982916
1036 -7.116299 -2.503917 -0.086185

Pipeline#

We can combine the above steps to build a full predictive pipeline. sklearndf provides two useful, specialised pipeline objects for this, RegressorPipelineDF and ClassifierPipelineDF. Both implement a special two-step pipeline with one preprocessing step and one prediction step, while staying compatible with the general sklearn pipeline idiom.

[26]:
pipeline_df = RegressorPipelineDF(
    preprocessing=preprocessing_df,
    regressor=RandomForestRegressorDF(
        n_estimators=1000,
        max_features=2/3,
        max_depth=7,
        random_state=42,
        n_jobs=-3
    )
)
[27]:
df_train, df_test, y_train, y_test = train_test_split(housing_features_df, housing_target_sr, random_state=42)
pipeline_df.fit(df_train, y_train)
pipeline_df.score(df_test, y_test)
[27]:
0.8883657928153087

Extras#

sklearndf also provides some additional estimators developed by Gamma or third parties, which are useful additions to the scikit-learn repertoire, and which follow the scikit-learn idiom. These are provided in the .extra modules:

  • sklearndf.regression.extra.LGBMRegressorDF

  • sklearndf.regression.extra.XGBRegressorDF

  • sklearndf.classification.extra.LGBMClassifierDF

  • sklearndf.classification.extra.XGBClassifierDF

  • sklearndf.transformation.extra.BorutaDF

In the interest of lean package dependencies, sklearndf does not require packages lightgbm, xgboost, or boruta_py to be installed by default, but will not provide any of these DF classes if their associated third-party package is not installed.

LightGBM regressor#

[28]:
lgbm_df = LGBMRegressorDF(n_estimators=100, max_depth=8)
lgbm_df.fit(df_numerical_train, y_train)
lgbm_df.predict(df_numerical_test.head()).to_frame()
[28]:
prediction
892 139447.461334
1105 288031.725712
413 124917.506705
522 167320.406141
1036 315868.109901

Boruta#

Boruta is a smart feature selection method to eliminate all features whose predictive power is not better than random noise.

The sklearndf.transformation.extra.BorutaDF transformer provides easy access to this powerful method. The basis of this is a tree-based learner, usually a random forest.

For the random forest, we rely on default parameters but set the maximum tree depth to 5 (for Boruta, setting a depth between 3 and 7 is highly recommended and depends on the number of features and expected complexity of the feature/target interactions). The number of trees is automatically managed by the Boruta feature selector (argument n_estimators="auto").

[29]:
boruta_pipeline = PipelineDF(
    steps=[
        ('preprocess', preprocessing_df),
        ('boruta', BorutaDF(
            estimator=RandomForestRegressorDF(max_depth=5, n_jobs=-3),
            n_estimators="auto",
            random_state=42,
            verbose=1
        )),
    ]
)
[30]:
boruta_pipeline.fit(X=housing_features_df, y=housing_target_sr)
Iteration: 1 / 100
Iteration: 2 / 100
Iteration: 3 / 100
Iteration: 4 / 100
Iteration: 5 / 100
Iteration: 6 / 100
Iteration: 7 / 100
Iteration: 8 / 100
Iteration: 9 / 100
Iteration: 10 / 100
Iteration: 11 / 100
Iteration: 12 / 100
Iteration: 13 / 100
Iteration: 14 / 100
Iteration: 15 / 100
Iteration: 16 / 100
Iteration: 17 / 100
Iteration: 18 / 100
Iteration: 19 / 100
Iteration: 20 / 100
Iteration: 21 / 100
Iteration: 22 / 100
Iteration: 23 / 100
Iteration: 24 / 100
Iteration: 25 / 100
Iteration: 26 / 100
Iteration: 27 / 100
Iteration: 28 / 100
Iteration: 29 / 100
Iteration: 30 / 100
Iteration: 31 / 100
Iteration: 32 / 100
Iteration: 33 / 100
Iteration: 34 / 100
Iteration: 35 / 100
Iteration: 36 / 100
Iteration: 37 / 100
Iteration: 38 / 100
Iteration: 39 / 100
Iteration: 40 / 100
Iteration: 41 / 100
Iteration: 42 / 100
Iteration: 43 / 100
Iteration: 44 / 100
Iteration: 45 / 100
Iteration: 46 / 100
Iteration: 47 / 100
Iteration: 48 / 100
Iteration: 49 / 100
Iteration: 50 / 100
Iteration: 51 / 100
Iteration: 52 / 100
Iteration: 53 / 100
Iteration: 54 / 100
Iteration: 55 / 100
Iteration: 56 / 100
Iteration: 57 / 100
Iteration: 58 / 100
Iteration: 59 / 100
Iteration: 60 / 100
Iteration: 61 / 100
Iteration: 62 / 100
Iteration: 63 / 100
Iteration: 64 / 100
Iteration: 65 / 100
Iteration: 66 / 100
Iteration: 67 / 100
Iteration: 68 / 100
Iteration: 69 / 100
Iteration: 70 / 100
Iteration: 71 / 100
Iteration: 72 / 100
Iteration: 73 / 100
Iteration: 74 / 100
Iteration: 75 / 100
Iteration: 76 / 100
Iteration: 77 / 100
Iteration: 78 / 100
Iteration: 79 / 100
Iteration: 80 / 100
Iteration: 81 / 100
Iteration: 82 / 100
Iteration: 83 / 100
Iteration: 84 / 100
Iteration: 85 / 100
Iteration: 86 / 100
Iteration: 87 / 100
Iteration: 88 / 100
Iteration: 89 / 100
Iteration: 90 / 100
Iteration: 91 / 100
Iteration: 92 / 100
Iteration: 93 / 100
Iteration: 94 / 100
Iteration: 95 / 100
Iteration: 96 / 100
Iteration: 97 / 100
Iteration: 98 / 100
Iteration: 99 / 100


BorutaPy finished running.

Iteration:      100 / 100
Confirmed:      16
Tentative:      0
Rejected:       286
[30]:
PipelineDF(steps=[('preprocess', ColumnTransformerDF(transformers=[('num', SimpleImputerDF(add_indicator=True, strategy='median'), <sklearn.compose._column_transformer.make_column_selector object at 0x13958f2b0>), ('cat', PipelineDF(steps=[('imputer', SimpleImputerDF(fill_value='<unknown>', missing_values=None, strategy='constant')), ('one-hot', OneHotEncoderDF(handle_unknown='ignore', sparse=False))]), <sklearn.compose._column_transformer.make_column_selector object at 0x13958fd60>)])), ('boruta', BorutaDF(estimator=RandomForestRegressor(max_depth=5, n_estimators=116, n_jobs=-3,
                      random_state=RandomState(MT19937) at 0x13B34BD40), n_estimators='auto', random_state=RandomState(MT19937) at 0x13B34BD40, verbose=1))])

Boruta is implemented as a scikit-learn transformer; its output features are all features that passed the Boruta test.

[31]:
boruta_pipeline.feature_names_out_.to_list()
[31]:
['num__LotFrontage',
 'num__LotArea',
 'num__OverallQual',
 'num__YearBuilt',
 'num__YearRemodAdd',
 'num__MasVnrArea',
 'num__BsmtFinSF1',
 'num__TotalBsmtSF',
 'num__1stFlrSF',
 'num__2ndFlrSF',
 'num__GrLivArea',
 'num__FullBath',
 'num__TotRmsAbvGrd',
 'num__GarageCars',
 'num__GarageArea',
 'cat__BsmtQual_Ex']

sklearndf allows us to trace outgoing features back to the original features from which they were derived, using the sklearndf.TransformerDF.features_original_ property. This is useful here as we want to know which features to eliminate before putting them into the pipeline.

In our example, feature BsmtQual_Ex is a derivative of feature BsmtQual, obtained through one-hot encoding:

[32]:
boruta_pipeline.feature_names_original_.to_frame()
[32]:
feature_in
feature_out
num__LotFrontage LotFrontage
num__LotArea LotArea
num__OverallQual OverallQual
num__YearBuilt YearBuilt
num__YearRemodAdd YearRemodAdd
num__MasVnrArea MasVnrArea
num__BsmtFinSF1 BsmtFinSF1
num__TotalBsmtSF TotalBsmtSF
num__1stFlrSF 1stFlrSF
num__2ndFlrSF 2ndFlrSF
num__GrLivArea GrLivArea
num__FullBath FullBath
num__TotRmsAbvGrd TotRmsAbvGrd
num__GarageCars GarageCars
num__GarageArea GarageArea
cat__BsmtQual_Ex BsmtQual

So, to obtain all features we want to select from the original data set, we can determine the unique relevant ingoing features from the original feature mapping:

[33]:
features_relevant = boruta_pipeline.feature_names_original_.unique()
features_relevant.tolist()
[33]:
['LotFrontage',
 'LotArea',
 'OverallQual',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'GrLivArea',
 'FullBath',
 'TotRmsAbvGrd',
 'GarageCars',
 'GarageArea',
 'BsmtQual']

When we look at model performance for the reduced set of relevant features with model performance for the full feature set, we notice that performance is almost the same despite a significantly lower feature count (bearing in mind that, for reasons of simplicity, we are validating based on a single test set; in a real-world example we’d want to use cross-validation for more accurate model scores).

[34]:
pipeline_df.fit(
    # fit the model using the full feature set
    df_train, y_train
)

score_all_features = pipeline_df.score(
    # score the model using the full feature set
    df_test, y_test
)

print(f"R2={score_all_features:.3f} with model fitted on {len(pipeline_df.final_estimator.feature_names_in_)} features")
R2=0.888 with model fitted on 299 features
[35]:
pipeline_df.fit(
    # fit the model using the relevant feature set
    df_train.loc[:, features_relevant], y_train
)

score_relevant_features = pipeline_df.score(
    # score the model using the relevant feature set
    df_test.loc[:, features_relevant], y_test
)

print(f"R2={score_relevant_features:.3f} with model fitted on {len(pipeline_df.final_estimator.feature_names_in_)} features")
R2=0.893 with model fitted on 22 features