{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "delete_for_interactive": true, "nbsphinx": "hidden", "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "# this cell's metadata contains\n", "# \"nbsphinx\": \"hidden\" so it is hidden by nbsphinx\n", "\n", "def _ignore_warnings() -> None:\n", " # ignore irrelevant warnings that would affect the output of this tutorial notebook\n", " \n", " # ignore a useless LGBM warning\n", " import warnings\n", " warnings.filterwarnings(\"ignore\", category=UserWarning, message=r\".*Xcode_8\\.3\\.3\")\n", " warnings.filterwarnings(\"ignore\", category=FutureWarning)\n", "\n", "_ignore_warnings()\n", "\n", "del _ignore_warnings" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-07-26T08:31:08.171058Z", "start_time": "2019-07-26T08:31:08.167059Z" } }, "source": [ "# Scikit-learn and data frames\n" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "raw_mimetype": "text/restructuredtext" }, "source": [ "The `sklearndf` package enhances scikit-learn for advanced support of data frames.\n", "\n", "It addresses a common issue with scikit-learn: the outputs of transformers are numpy arrays, even when the input is a data frame. However, to inspect a model it is essential to keep track of the feature names.\n", "\n", "`sklearndf` enhances scikit-learn's estimators to:\n", "\n", "- return data frames as results of transformations, preserving feature names as the column index\n", "- add additional estimator properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers \n", "\n", "Using `sklearndf` is very simple: Append `DF` at the end of scikit-learn class names, and you will get enhanced data frame support." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:53:38.950751Z", "start_time": "2019-07-30T18:53:36.816398Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.compose import ColumnTransformer, make_column_selector\n", "from sklearn.datasets import fetch_openml\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import OneHotEncoder\n", "\n", "from sklearndf.classification import RandomForestClassifierDF\n", "from sklearndf.pipeline import PipelineDF, RegressorPipelineDF\n", "from sklearndf.regression import RandomForestRegressorDF\n", "from sklearndf.regression.extra import LGBMRegressorDF\n", "from sklearndf.transformation import ColumnTransformerDF, OneHotEncoderDF, SimpleImputerDF\n", "from sklearndf.transformation.extra import BorutaDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We load our data:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:53:38.991698Z", "start_time": "2019-07-30T18:53:38.953342Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "housing_features_df: pd.DataFrame\n", "housing_target_sr: pd.Series\n", "\n", "housing_features_df, housing_target_sr = fetch_openml(data_id=42165, return_X_y=True, as_frame=True)\n", "housing_features_df = housing_features_df.drop([\"Id\", \"YrSold\", \"MoSold\", \"MSSubClass\", \"MiscVal\"], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data set includes categorical features, e.g., garage types:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:53:39.002353Z", "start_time": "2019-07-30T18:53:38.993328Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": [ "['Attchd', 'Detchd', 'BuiltIn', 'CarPort', None, 'Basment', '2Types']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "housing_features_df[\"GarageType\"].unique().tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Contrasting a scikit-learn and sklearndf pipeline" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-07-26T13:42:19.065036Z", "start_time": "2019-07-26T13:42:19.058033Z" } }, "source": [ "Let us build a preprocessing pipeline which:\n", "\n", "- for categorical variables fills missing values with the string 'nan' and then one-hot encodes\n", "- for numerical values fills missing values using median values" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## A scikit-learn pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first build the preprocessing pipeline with native scikit-learn transformers.\n", "This is achievable with a few lines of code; however does not allow us to keep track of feature names. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:53:39.055313Z", "start_time": "2019-07-30T18:53:39.047314Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "preprocessing_numeric = SimpleImputer(strategy=\"median\", add_indicator=True)\n", "\n", "preprocessing_categorical = Pipeline(\n", " steps=[\n", " ('imputer', SimpleImputer(missing_values=None, strategy='constant', fill_value='')),\n", " ('one-hot', OneHotEncoder(sparse=False))\n", " ]\n", ")\n", "\n", "preprocessing = ColumnTransformer(\n", " transformers=[\n", " ('num', preprocessing_numeric, make_column_selector(dtype_include=np.number)),\n", " ('cat', preprocessing_categorical, make_column_selector(dtype_exclude=np.number)),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:53:39.134309Z", "start_time": "2019-07-30T18:53:39.082355Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/plain": [ "array([[6.500e+01, 8.450e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,\n", " 0.000e+00],\n", " [8.000e+01, 9.600e+03, 6.000e+00, ..., 0.000e+00, 1.000e+00,\n", " 0.000e+00],\n", " [6.800e+01, 1.125e+04, 7.000e+00, ..., 0.000e+00, 1.000e+00,\n", " 0.000e+00],\n", " ...,\n", " [6.600e+01, 9.042e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,\n", " 0.000e+00],\n", " [6.800e+01, 9.717e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,\n", " 0.000e+00],\n", " [7.500e+01, 9.937e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,\n", " 0.000e+00]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preprocessing.fit_transform(X=housing_features_df, y=housing_target_sr)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" }, "raw_mimetype": "text/restructuredtext" }, "source": [ "The strength of `sklearndf` is to maintain the scikit-learn conventions and expressiveness, and to also preserve dataframes, hence keeping track of the feature names." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## A scikit-learn pipeline" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The convention in `sklearndf` is to append `DF` at the end of each corresponding scikit-learn class. \n", "For instance, to reproduce the above example, we write:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:53:39.153308Z", "start_time": "2019-07-30T18:53:39.145308Z" }, "pycharm": { "is_executing": false } }, "outputs": [], "source": [ "preprocessing_numeric_df = SimpleImputerDF(strategy=\"median\", add_indicator=True)\n", "\n", "preprocessing_categorical_df = PipelineDF(\n", " steps=[\n", " ('imputer', SimpleImputerDF(missing_values=None, strategy='constant', fill_value='')),\n", " ('one-hot', OneHotEncoderDF(sparse=False, handle_unknown=\"ignore\"))\n", " ]\n", ")\n", "\n", "preprocessing_df = ColumnTransformerDF(\n", " transformers=[\n", " ('num', preprocessing_numeric_df, make_column_selector(dtype_include=np.number)),\n", " ('cat', preprocessing_categorical_df, make_column_selector(dtype_exclude=np.number)),\n", " ],\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:53:39.249358Z", "start_time": "2019-07-30T18:53:39.180306Z" }, "pycharm": { "is_executing": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
feature_out
num__LotFrontage65.080.068.060.084.0
num__LotArea8450.09600.011250.09550.014260.0
num__OverallQual7.06.07.07.08.0
num__OverallCond5.08.05.05.05.0
num__YearBuilt2003.01976.02001.01915.02000.0
..................
cat__SaleCondition_AdjLand0.00.00.00.00.0
cat__SaleCondition_Alloca0.00.00.00.00.0
cat__SaleCondition_Family0.00.00.00.00.0
cat__SaleCondition_Normal1.01.01.00.01.0
cat__SaleCondition_Partial0.00.00.00.00.0
\n", "

303 rows × 5 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "feature_out \n", "num__LotFrontage 65.0 80.0 68.0 60.0 84.0\n", "num__LotArea 8450.0 9600.0 11250.0 9550.0 14260.0\n", "num__OverallQual 7.0 6.0 7.0 7.0 8.0\n", "num__OverallCond 5.0 8.0 5.0 5.0 5.0\n", "num__YearBuilt 2003.0 1976.0 2001.0 1915.0 2000.0\n", "... ... ... ... ... ...\n", "cat__SaleCondition_AdjLand 0.0 0.0 0.0 0.0 0.0\n", "cat__SaleCondition_Alloca 0.0 0.0 0.0 0.0 0.0\n", "cat__SaleCondition_Family 0.0 0.0 0.0 0.0 0.0\n", "cat__SaleCondition_Normal 1.0 1.0 1.0 0.0 1.0\n", "cat__SaleCondition_Partial 0.0 0.0 0.0 0.0 0.0\n", "\n", "[303 rows x 5 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transformed_df = preprocessing_df.fit_transform(X=housing_features_df, y=housing_target_sr)\n", "transformed_df.head().T" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ "The `feature_names_original_` attribute returns a series, mapping the output columns (the series' index) to the input columns (the series' values):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:35.436833Z", "start_time": "2019-07-30T18:54:35.431830Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_in
feature_out
num__LotFrontageLotFrontage
num__LotAreaLotArea
num__OverallQualOverallQual
num__OverallCondOverallCond
num__YearBuiltYearBuilt
......
cat__SaleCondition_AdjLandSaleCondition
cat__SaleCondition_AllocaSaleCondition
cat__SaleCondition_FamilySaleCondition
cat__SaleCondition_NormalSaleCondition
cat__SaleCondition_PartialSaleCondition
\n", "

303 rows × 1 columns

\n", "
" ], "text/plain": [ " feature_in\n", "feature_out \n", "num__LotFrontage LotFrontage\n", "num__LotArea LotArea\n", "num__OverallQual OverallQual\n", "num__OverallCond OverallCond\n", "num__YearBuilt YearBuilt\n", "... ...\n", "cat__SaleCondition_AdjLand SaleCondition\n", "cat__SaleCondition_Alloca SaleCondition\n", "cat__SaleCondition_Family SaleCondition\n", "cat__SaleCondition_Normal SaleCondition\n", "cat__SaleCondition_Partial SaleCondition\n", "\n", "[303 rows x 1 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preprocessing_df.feature_names_original_.to_frame()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "You can therefore easily select all output features generated from a given input feature:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:47.270660Z", "start_time": "2019-07-30T18:54:47.254638Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
feature_out
cat__GarageType_2Types0.00.00.00.00.0
cat__GarageType_<unknown>0.00.00.00.00.0
cat__GarageType_Attchd1.01.01.00.01.0
cat__GarageType_Basment0.00.00.00.00.0
cat__GarageType_BuiltIn0.00.00.00.00.0
cat__GarageType_CarPort0.00.00.00.00.0
cat__GarageType_Detchd0.00.00.01.00.0
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "feature_out \n", "cat__GarageType_2Types 0.0 0.0 0.0 0.0 0.0\n", "cat__GarageType_ 0.0 0.0 0.0 0.0 0.0\n", "cat__GarageType_Attchd 1.0 1.0 1.0 0.0 1.0\n", "cat__GarageType_Basment 0.0 0.0 0.0 0.0 0.0\n", "cat__GarageType_BuiltIn 0.0 0.0 0.0 0.0 0.0\n", "cat__GarageType_CarPort 0.0 0.0 0.0 0.0 0.0\n", "cat__GarageType_Detchd 0.0 0.0 0.0 1.0 0.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "garage_type_derivatives = preprocessing_df.feature_names_original_ == \"GarageType\"\n", "\n", "transformed_df.loc[:, garage_type_derivatives].head().T" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Supervised learners" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-07-26T08:36:43.003042Z", "start_time": "2019-07-26T08:36:42.733047Z" }, "pycharm": { "name": "#%% md\n" } }, "source": [ "## Regressors" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "As for transformers, scikit-learn regressors and classifiers have a `sklearndf` sibling obtained by appending `DF` to the class name, and the API remains the same. The result of any predict and decision function will be returned as a `pandas` series (single output) or data frame (class probabilities or multi-output).\n", "\n", "For a random forest regressor we get:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:49.180964Z", "start_time": "2019-07-30T18:54:48.757981Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "# a simplified features vector (we will use a pipeline for more sophisticated pre-processing further down)\n", "numerical_features_df = housing_features_df.select_dtypes(np.number).fillna(0)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:49.180964Z", "start_time": "2019-07-30T18:54:48.757981Z" }, "pycharm": { "is_executing": false, "name": "#%% md\n" } }, "source": [ "For demonstration purposes, we create a simple train-test split:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:49.180964Z", "start_time": "2019-07-30T18:54:48.757981Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "df_numerical_train, df_numerical_test, y_train, y_test = train_test_split(\n", " numerical_features_df,\n", " housing_target_sr,\n", " random_state=42\n", ")" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:49.180964Z", "start_time": "2019-07-30T18:54:48.757981Z" }, "pycharm": { "is_executing": false, "name": "#%% md\n" } }, "source": [ "Then we create, fit and score a DF-augmented random forest regressor:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:49.180964Z", "start_time": "2019-07-30T18:54:48.757981Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "0.8638857401761126" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_regressor_df = RandomForestRegressorDF(\n", " n_estimators=100,\n", " max_depth=5,\n", " random_state=42,\n", " n_jobs=-3\n", ")\n", "\n", "random_forest_regressor_df.fit(X=df_numerical_train, y=y_train)\n", "random_forest_regressor_df.score(X=df_numerical_test, y=y_test)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "The DF-augmented regressor returns predictions as a series (we convert the series to a data frame using method `.to_frame()` for better rendering in the jupyter notebook):" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "tags": [], "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
prediction
892138678.817934
1105305008.808241
413133420.810780
522171533.659061
1036307214.384636
\n", "
" ], "text/plain": [ " prediction\n", "892 138678.817934\n", "1105 305008.808241\n", "413 133420.810780\n", "522 171533.659061\n", "1036 307214.384636" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_regressor_df.predict(df_numerical_test.head()).to_frame()" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:49.188965Z", "start_time": "2019-07-30T18:54:49.182965Z" }, "pycharm": { "is_executing": false, "name": "#%% md\n" } }, "source": [ "All methods and attributes of the native random forest regressor are also available in the DF-augmented regressor, e.g., `.get_params()` and `.set_params()`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:49.188965Z", "start_time": "2019-07-30T18:54:49.182965Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "{'bootstrap': True,\n", " 'ccp_alpha': 0.0,\n", " 'criterion': 'squared_error',\n", " 'max_depth': 5,\n", " 'max_features': 'auto',\n", " 'max_leaf_nodes': None,\n", " 'max_samples': None,\n", " 'min_impurity_decrease': 0.0,\n", " 'min_samples_leaf': 1,\n", " 'min_samples_split': 2,\n", " 'min_weight_fraction_leaf': 0.0,\n", " 'n_estimators': 100,\n", " 'n_jobs': -3,\n", " 'oob_score': False,\n", " 'random_state': 42,\n", " 'verbose': 0,\n", " 'warm_start': False}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_regressor_df.get_params()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:49.443565Z", "start_time": "2019-07-30T18:54:49.431562Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressorDF(max_depth=7, n_jobs=-3, random_state=42)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_regressor_df.set_params(max_depth=7)" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext", "pycharm": { "name": "#%% md\n" } }, "source": [ "The underlying scikit-learn regressor is available through the `native_estimator` attribute:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:49.895830Z", "start_time": "2019-07-30T18:54:49.889831Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "RandomForestRegressor(max_depth=7, n_jobs=-3, random_state=42)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_regressor_df.native_estimator" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext", "pycharm": { "name": "#%% md\n" } }, "source": [ "Property `is_fitted` tells if the regressor is fitted, and – for fitted estimators – property `feature_names_in_` returns the names of the ingoing features as a pandas index:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:50.523346Z", "start_time": "2019-07-30T18:54:50.518398Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_regressor_df.is_fitted" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:50.349489Z", "start_time": "2019-07-30T18:54:50.345466Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',\n", " 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',\n", " 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',\n", " 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',\n", " 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',\n", " 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',\n", " 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea'],\n", " dtype='object', name='feature_in')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_regressor_df.feature_names_in_" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Classifiers\n", "\n", "Classifiers follow a similar logic." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "To get started, we create for house prices house below 100k, below 200k, and above 200k for multi-label classification." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:50.833368Z", "start_time": "2019-07-30T18:54:50.826362Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "y_classes = housing_target_sr.apply(\n", " lambda x:\n", " \">=200k\" if x >= 200000\n", " else \">=100k\" if x >= 100000\n", " else \"<100k\"\n", ").astype(\"category\")" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We create a train-test split:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:50.833368Z", "start_time": "2019-07-30T18:54:50.826362Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "df_numerical_train, df_numerical_test, y_classification_train, y_classification_test = train_test_split(\n", " numerical_features_df,\n", " y_classes,\n", " random_state=42\n", ")" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:50.833368Z", "start_time": "2019-07-30T18:54:50.826362Z" }, "pycharm": { "is_executing": false, "name": "#%% md\n" } }, "source": [ " We create, fit and score a DF-augmented classifier:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:51.432224Z", "start_time": "2019-07-30T18:54:51.341154Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "0.8767123287671232" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_classifier_df = RandomForestClassifierDF(\n", " n_estimators=100,\n", " max_depth=5,\n", " random_state=42,\n", " n_jobs=-3\n", ")\n", "random_forest_classifier_df.fit(df_numerical_train, y_classification_train)\n", "random_forest_classifier_df.score(df_numerical_test, y_classification_test)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "The DF-augmented classifier returns predicted class labels as a series (once more, we invoke `.to_frame()` for better rendering in Jupyter):" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
prediction
892>=100k
1105>=200k
413>=100k
522>=100k
1036>=200k
\n", "
" ], "text/plain": [ " prediction\n", "892 >=100k\n", "1105 >=200k\n", "413 >=100k\n", "522 >=100k\n", "1036 >=200k" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_classifier_df.predict(df_numerical_test.head()).to_frame()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Class probabilities are returned as a data frame, using class labels as the column names:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
<100k>=100k>=200k
8920.0565550.9052980.038148
11050.0011410.0911140.907745
4130.1044720.8609270.034602
5220.0474160.8149170.137667
10360.0008120.0817640.917424
\n", "
" ], "text/plain": [ " <100k >=100k >=200k\n", "892 0.056555 0.905298 0.038148\n", "1105 0.001141 0.091114 0.907745\n", "413 0.104472 0.860927 0.034602\n", "522 0.047416 0.814917 0.137667\n", "1036 0.000812 0.081764 0.917424" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_classifier_df.predict_proba(df_numerical_test.head())" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
<100k>=100k>=200k
892-2.872548-0.099491-3.266291
1105-6.775823-2.395643-0.096792
413-2.258840-0.149746-3.363852
522-3.048800-0.204669-1.982916
1036-7.116299-2.503917-0.086185
\n", "
" ], "text/plain": [ " <100k >=100k >=200k\n", "892 -2.872548 -0.099491 -3.266291\n", "1105 -6.775823 -2.395643 -0.096792\n", "413 -2.258840 -0.149746 -3.363852\n", "522 -3.048800 -0.204669 -1.982916\n", "1036 -7.116299 -2.503917 -0.086185" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_forest_classifier_df.predict_log_proba(df_numerical_test.head())" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Pipeline" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-07-26T08:37:37.633663Z", "start_time": "2019-07-26T08:37:37.618669Z" }, "raw_mimetype": "text/restructuredtext", "pycharm": { "name": "#%% md\n" } }, "source": [ "We can combine the above steps to build a full predictive pipeline. `sklearndf` provides two useful, specialised pipeline objects for this, `RegressorPipelineDF` and `ClassifierPipelineDF`. Both implement a special two-step pipeline with one preprocessing step and one prediction step, while staying compatible with the general sklearn pipeline idiom." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:51.520150Z", "start_time": "2019-07-30T18:54:51.513148Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [], "source": [ "pipeline_df = RegressorPipelineDF(\n", " preprocessing=preprocessing_df,\n", " regressor=RandomForestRegressorDF(\n", " n_estimators=1000,\n", " max_features=2/3,\n", " max_depth=7,\n", " random_state=42,\n", " n_jobs=-3\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:51.874135Z", "start_time": "2019-07-30T18:54:51.522150Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "0.8883657928153087" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train, df_test, y_train, y_test = train_test_split(housing_features_df, housing_target_sr, random_state=42)\n", "pipeline_df.fit(df_train, y_train)\n", "pipeline_df.score(df_test, y_test)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Extras" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext", "pycharm": { "name": "#%% md\n" } }, "source": [ "`sklearndf` also provides some additional estimators developed by Gamma or third parties, which are useful additions to the scikit-learn repertoire, and which follow the scikit-learn idiom. These are provided in the `.extra` modules:\n", "\n", "- `sklearndf.regression.extra.LGBMRegressorDF`\n", "- `sklearndf.regression.extra.XGBRegressorDF`\n", "- `sklearndf.classification.extra.LGBMClassifierDF`\n", "- `sklearndf.classification.extra.XGBClassifierDF`\n", "- `sklearndf.transformation.extra.BorutaDF`" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "In the interest of lean package dependencies, `sklearndf` does not require packages `lightgbm`, `xgboost`, or `boruta_py` to be installed by default, but will not provide any of these DF classes if their associated third-party package is not installed." ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## LightGBM regressor" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-07-30T18:54:52.046129Z", "start_time": "2019-07-30T18:54:51.876135Z" }, "pycharm": { "is_executing": false, "name": "#%%\n" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
prediction
892139447.461334
1105288031.725712
413124917.506705
522167320.406141
1036315868.109901
\n", "
" ], "text/plain": [ " prediction\n", "892 139447.461334\n", "1105 288031.725712\n", "413 124917.506705\n", "522 167320.406141\n", "1036 315868.109901" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lgbm_df = LGBMRegressorDF(n_estimators=100, max_depth=8)\n", "lgbm_df.fit(df_numerical_train, y_train)\n", "lgbm_df.predict(df_numerical_test.head()).to_frame()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "## Boruta" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext", "pycharm": { "name": "#%% md\n" } }, "source": [ "[Boruta](https://www.jstatsoft.org/article/view/v036i11) is a smart feature selection method to eliminate all features whose predictive power is not better than random noise.\n", "\n", "The `sklearndf.transformation.extra.BorutaDF` transformer provides easy access to this powerful method. The basis of this is a tree-based learner, usually a random forest.\n", "\n", "For the random forest, we rely on default parameters but set the maximum tree depth to 5 (for Boruta, setting a depth between 3 and 7 is highly recommended and depends on the number of features and expected complexity of the feature/target interactions). The number of trees is automatically managed by the Boruta feature selector (argument ``n_estimators=\"auto\"``).\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [], "source": [ "boruta_pipeline = PipelineDF(\n", " steps=[\n", " ('preprocess', preprocessing_df),\n", " ('boruta', BorutaDF(\n", " estimator=RandomForestRegressorDF(max_depth=5, n_jobs=-3), \n", " n_estimators=\"auto\", \n", " random_state=42,\n", " verbose=1\n", " )),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "scrolled": true, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iteration: 1 / 100\n", "Iteration: 2 / 100\n", "Iteration: 3 / 100\n", "Iteration: 4 / 100\n", "Iteration: 5 / 100\n", "Iteration: 6 / 100\n", "Iteration: 7 / 100\n", "Iteration: 8 / 100\n", "Iteration: 9 / 100\n", "Iteration: 10 / 100\n", "Iteration: 11 / 100\n", "Iteration: 12 / 100\n", "Iteration: 13 / 100\n", "Iteration: 14 / 100\n", "Iteration: 15 / 100\n", "Iteration: 16 / 100\n", "Iteration: 17 / 100\n", "Iteration: 18 / 100\n", "Iteration: 19 / 100\n", "Iteration: 20 / 100\n", "Iteration: 21 / 100\n", "Iteration: 22 / 100\n", "Iteration: 23 / 100\n", "Iteration: 24 / 100\n", "Iteration: 25 / 100\n", "Iteration: 26 / 100\n", "Iteration: 27 / 100\n", "Iteration: 28 / 100\n", "Iteration: 29 / 100\n", "Iteration: 30 / 100\n", "Iteration: 31 / 100\n", "Iteration: 32 / 100\n", "Iteration: 33 / 100\n", "Iteration: 34 / 100\n", "Iteration: 35 / 100\n", "Iteration: 36 / 100\n", "Iteration: 37 / 100\n", "Iteration: 38 / 100\n", "Iteration: 39 / 100\n", "Iteration: 40 / 100\n", "Iteration: 41 / 100\n", "Iteration: 42 / 100\n", "Iteration: 43 / 100\n", "Iteration: 44 / 100\n", "Iteration: 45 / 100\n", "Iteration: 46 / 100\n", "Iteration: 47 / 100\n", "Iteration: 48 / 100\n", "Iteration: 49 / 100\n", "Iteration: 50 / 100\n", "Iteration: 51 / 100\n", "Iteration: 52 / 100\n", "Iteration: 53 / 100\n", "Iteration: 54 / 100\n", "Iteration: 55 / 100\n", "Iteration: 56 / 100\n", "Iteration: 57 / 100\n", "Iteration: 58 / 100\n", "Iteration: 59 / 100\n", "Iteration: 60 / 100\n", "Iteration: 61 / 100\n", "Iteration: 62 / 100\n", "Iteration: 63 / 100\n", "Iteration: 64 / 100\n", "Iteration: 65 / 100\n", "Iteration: 66 / 100\n", "Iteration: 67 / 100\n", "Iteration: 68 / 100\n", "Iteration: 69 / 100\n", "Iteration: 70 / 100\n", "Iteration: 71 / 100\n", "Iteration: 72 / 100\n", "Iteration: 73 / 100\n", "Iteration: 74 / 100\n", "Iteration: 75 / 100\n", "Iteration: 76 / 100\n", "Iteration: 77 / 100\n", "Iteration: 78 / 100\n", "Iteration: 79 / 100\n", "Iteration: 80 / 100\n", "Iteration: 81 / 100\n", "Iteration: 82 / 100\n", "Iteration: 83 / 100\n", "Iteration: 84 / 100\n", "Iteration: 85 / 100\n", "Iteration: 86 / 100\n", "Iteration: 87 / 100\n", "Iteration: 88 / 100\n", "Iteration: 89 / 100\n", "Iteration: 90 / 100\n", "Iteration: 91 / 100\n", "Iteration: 92 / 100\n", "Iteration: 93 / 100\n", "Iteration: 94 / 100\n", "Iteration: 95 / 100\n", "Iteration: 96 / 100\n", "Iteration: 97 / 100\n", "Iteration: 98 / 100\n", "Iteration: 99 / 100\n", "\n", "\n", "BorutaPy finished running.\n", "\n", "Iteration: \t100 / 100\n", "Confirmed: \t16\n", "Tentative: \t0\n", "Rejected: \t286\n" ] }, { "data": { "text/plain": [ "PipelineDF(steps=[('preprocess', ColumnTransformerDF(transformers=[('num', SimpleImputerDF(add_indicator=True, strategy='median'), ), ('cat', PipelineDF(steps=[('imputer', SimpleImputerDF(fill_value='', missing_values=None, strategy='constant')), ('one-hot', OneHotEncoderDF(handle_unknown='ignore', sparse=False))]), )])), ('boruta', BorutaDF(estimator=RandomForestRegressor(max_depth=5, n_estimators=116, n_jobs=-3,\n", " random_state=RandomState(MT19937) at 0x13B34BD40), n_estimators='auto', random_state=RandomState(MT19937) at 0x13B34BD40, verbose=1))])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boruta_pipeline.fit(X=housing_features_df, y=housing_target_sr)" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Boruta is implemented as a *scikit-learn* transformer; its output features are all features that passed the Boruta test." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "['num__LotFrontage',\n", " 'num__LotArea',\n", " 'num__OverallQual',\n", " 'num__YearBuilt',\n", " 'num__YearRemodAdd',\n", " 'num__MasVnrArea',\n", " 'num__BsmtFinSF1',\n", " 'num__TotalBsmtSF',\n", " 'num__1stFlrSF',\n", " 'num__2ndFlrSF',\n", " 'num__GrLivArea',\n", " 'num__FullBath',\n", " 'num__TotRmsAbvGrd',\n", " 'num__GarageCars',\n", " 'num__GarageArea',\n", " 'cat__BsmtQual_Ex']" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boruta_pipeline.feature_names_out_.to_list()" ] }, { "cell_type": "markdown", "metadata": { "raw_mimetype": "text/restructuredtext", "pycharm": { "name": "#%% md\n" } }, "source": [ "`sklearndf` allows us to trace outgoing features back to the original features from which they were derived, using the `sklearndf.TransformerDF.features_original_` property. This is useful here as we want to know which features to eliminate before putting them into the pipeline.\n", "\n", "In our example, feature `BsmtQual_Ex` is a derivative of feature `BsmtQual`, obtained through one-hot encoding: " ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_in
feature_out
num__LotFrontageLotFrontage
num__LotAreaLotArea
num__OverallQualOverallQual
num__YearBuiltYearBuilt
num__YearRemodAddYearRemodAdd
num__MasVnrAreaMasVnrArea
num__BsmtFinSF1BsmtFinSF1
num__TotalBsmtSFTotalBsmtSF
num__1stFlrSF1stFlrSF
num__2ndFlrSF2ndFlrSF
num__GrLivAreaGrLivArea
num__FullBathFullBath
num__TotRmsAbvGrdTotRmsAbvGrd
num__GarageCarsGarageCars
num__GarageAreaGarageArea
cat__BsmtQual_ExBsmtQual
\n", "
" ], "text/plain": [ " feature_in\n", "feature_out \n", "num__LotFrontage LotFrontage\n", "num__LotArea LotArea\n", "num__OverallQual OverallQual\n", "num__YearBuilt YearBuilt\n", "num__YearRemodAdd YearRemodAdd\n", "num__MasVnrArea MasVnrArea\n", "num__BsmtFinSF1 BsmtFinSF1\n", "num__TotalBsmtSF TotalBsmtSF\n", "num__1stFlrSF 1stFlrSF\n", "num__2ndFlrSF 2ndFlrSF\n", "num__GrLivArea GrLivArea\n", "num__FullBath FullBath\n", "num__TotRmsAbvGrd TotRmsAbvGrd\n", "num__GarageCars GarageCars\n", "num__GarageArea GarageArea\n", "cat__BsmtQual_Ex BsmtQual" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "boruta_pipeline.feature_names_original_.to_frame()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "So, to obtain all features we want to select from the original data set, we can determine the unique relevant ingoing features from the original feature mapping:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "['LotFrontage',\n", " 'LotArea',\n", " 'OverallQual',\n", " 'YearBuilt',\n", " 'YearRemodAdd',\n", " 'MasVnrArea',\n", " 'BsmtFinSF1',\n", " 'TotalBsmtSF',\n", " '1stFlrSF',\n", " '2ndFlrSF',\n", " 'GrLivArea',\n", " 'FullBath',\n", " 'TotRmsAbvGrd',\n", " 'GarageCars',\n", " 'GarageArea',\n", " 'BsmtQual']" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features_relevant = boruta_pipeline.feature_names_original_.unique()\n", "features_relevant.tolist()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "When we look at model performance for the reduced set of relevant features with model performance for the full feature set, we notice that performance is almost the same despite a significantly lower feature count (bearing in mind that, for reasons of simplicity, we are validating based on a single test set; in a real-world example we'd want to use cross-validation for more accurate model scores)." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "pycharm": { "name": "#%%\n" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R2=0.888 with model fitted on 299 features\n" ] } ], "source": [ "pipeline_df.fit(\n", " # fit the model using the full feature set\n", " df_train, y_train\n", ")\n", "\n", "score_all_features = pipeline_df.score(\n", " # score the model using the full feature set\n", " df_test, y_test\n", ")\n", "\n", "print(f\"R2={score_all_features:.3f} with model fitted on {len(pipeline_df.final_estimator.feature_names_in_)} features\")" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "R2=0.893 with model fitted on 22 features\n" ] } ], "source": [ "pipeline_df.fit(\n", " # fit the model using the relevant feature set\n", " df_train.loc[:, features_relevant], y_train\n", ")\n", "\n", "score_relevant_features = pipeline_df.score(\n", " # score the model using the relevant feature set\n", " df_test.loc[:, features_relevant], y_test\n", ")\n", "\n", "print(f\"R2={score_relevant_features:.3f} with model fitted on {len(pipeline_df.final_estimator.feature_names_in_)} features\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "165px" }, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }