{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "delete_for_interactive": true,
    "nbsphinx": "hidden",
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# this cell's metadata contains\n",
    "# \"nbsphinx\": \"hidden\" so it is hidden by nbsphinx\n",
    "\n",
    "def _ignore_warnings() -> None:\n",
    "    # ignore irrelevant warnings that would affect the output of this tutorial notebook\n",
    "    \n",
    "    # ignore a useless LGBM warning\n",
    "    import warnings\n",
    "    warnings.filterwarnings(\"ignore\", category=UserWarning, message=r\".*Xcode_8\\.3\\.3\")\n",
    "    warnings.filterwarnings(\"ignore\", category=FutureWarning)\n",
    "\n",
    "_ignore_warnings()\n",
    "\n",
    "del _ignore_warnings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-26T08:31:08.171058Z",
     "start_time": "2019-07-26T08:31:08.167059Z"
    }
   },
   "source": [
    "# Scikit-learn and data frames\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The `sklearndf` package enhances scikit-learn for advanced support of data frames.\n",
    "\n",
    "It addresses a common issue with scikit-learn: the outputs of transformers are numpy arrays, even when the input is a data frame. However, to inspect a model it is essential to keep track of the feature names.\n",
    "\n",
    "`sklearndf` enhances scikit-learn's estimators to:\n",
    "\n",
    "- return data frames as results of transformations, preserving feature names as the column index\n",
    "- add additional estimator properties to enable tracing a feature name back to its original input feature; this is especially useful for transformers that create new features (e.g., one-hot encode), and for pipelines that include such transformers \n",
    "\n",
    "Using `sklearndf` is very simple: Append `DF` at the end of scikit-learn class names, and you will get enhanced data frame support."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:53:38.950751Z",
     "start_time": "2019-07-30T18:53:36.816398Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.compose import ColumnTransformer, make_column_selector\n",
    "from sklearn.datasets import fetch_openml\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "from sklearndf.classification import RandomForestClassifierDF\n",
    "from sklearndf.pipeline import PipelineDF, RegressorPipelineDF\n",
    "from sklearndf.regression import RandomForestRegressorDF\n",
    "from sklearndf.regression.extra import LGBMRegressorDF\n",
    "from sklearndf.transformation import ColumnTransformerDF, OneHotEncoderDF, SimpleImputerDF\n",
    "from sklearndf.transformation.extra import BorutaDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We load our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:53:38.991698Z",
     "start_time": "2019-07-30T18:53:38.953342Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "housing_features_df: pd.DataFrame\n",
    "housing_target_sr: pd.Series\n",
    "\n",
    "housing_features_df, housing_target_sr = fetch_openml(data_id=42165, return_X_y=True, as_frame=True)\n",
    "housing_features_df = housing_features_df.drop([\"Id\", \"YrSold\", \"MoSold\", \"MSSubClass\", \"MiscVal\"], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data set includes categorical features, e.g., garage types:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:53:39.002353Z",
     "start_time": "2019-07-30T18:53:38.993328Z"
    },
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Attchd', 'Detchd', 'BuiltIn', 'CarPort', None, 'Basment', '2Types']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "housing_features_df[\"GarageType\"].unique().tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Contrasting a scikit-learn and sklearndf pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-26T13:42:19.065036Z",
     "start_time": "2019-07-26T13:42:19.058033Z"
    }
   },
   "source": [
    "Let us build a preprocessing pipeline which:\n",
    "\n",
    "- for categorical variables fills missing values with the string 'nan' and then one-hot encodes\n",
    "- for numerical values fills missing values using median values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## A scikit-learn pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We first build the preprocessing pipeline with native scikit-learn transformers.\n",
    "This is achievable with a few lines of code; however does not allow us to keep track of feature names. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:53:39.055313Z",
     "start_time": "2019-07-30T18:53:39.047314Z"
    },
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "preprocessing_numeric = SimpleImputer(strategy=\"median\", add_indicator=True)\n",
    "\n",
    "preprocessing_categorical = Pipeline(\n",
    "    steps=[\n",
    "        ('imputer', SimpleImputer(missing_values=None, strategy='constant', fill_value='<unknown>')),\n",
    "        ('one-hot', OneHotEncoder(sparse=False))\n",
    "    ]\n",
    ")\n",
    "\n",
    "preprocessing = ColumnTransformer(\n",
    "    transformers=[\n",
    "        ('num', preprocessing_numeric, make_column_selector(dtype_include=np.number)),\n",
    "        ('cat', preprocessing_categorical, make_column_selector(dtype_exclude=np.number)),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:53:39.134309Z",
     "start_time": "2019-07-30T18:53:39.082355Z"
    },
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[6.500e+01, 8.450e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,\n",
       "        0.000e+00],\n",
       "       [8.000e+01, 9.600e+03, 6.000e+00, ..., 0.000e+00, 1.000e+00,\n",
       "        0.000e+00],\n",
       "       [6.800e+01, 1.125e+04, 7.000e+00, ..., 0.000e+00, 1.000e+00,\n",
       "        0.000e+00],\n",
       "       ...,\n",
       "       [6.600e+01, 9.042e+03, 7.000e+00, ..., 0.000e+00, 1.000e+00,\n",
       "        0.000e+00],\n",
       "       [6.800e+01, 9.717e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,\n",
       "        0.000e+00],\n",
       "       [7.500e+01, 9.937e+03, 5.000e+00, ..., 0.000e+00, 1.000e+00,\n",
       "        0.000e+00]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preprocessing.fit_transform(X=housing_features_df, y=housing_target_sr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    },
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The strength of `sklearndf` is to maintain the scikit-learn conventions and expressiveness, and to also preserve dataframes, hence keeping track of the feature names."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## A scikit-learn pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The convention in `sklearndf` is to append `DF` at the end of each corresponding scikit-learn class. \n",
    "For instance, to reproduce the above example, we write:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:53:39.153308Z",
     "start_time": "2019-07-30T18:53:39.145308Z"
    },
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [],
   "source": [
    "preprocessing_numeric_df = SimpleImputerDF(strategy=\"median\", add_indicator=True)\n",
    "\n",
    "preprocessing_categorical_df = PipelineDF(\n",
    "    steps=[\n",
    "        ('imputer', SimpleImputerDF(missing_values=None, strategy='constant', fill_value='<unknown>')),\n",
    "        ('one-hot', OneHotEncoderDF(sparse=False, handle_unknown=\"ignore\"))\n",
    "    ]\n",
    ")\n",
    "\n",
    "preprocessing_df = ColumnTransformerDF(\n",
    "    transformers=[\n",
    "        ('num', preprocessing_numeric_df, make_column_selector(dtype_include=np.number)),\n",
    "        ('cat', preprocessing_categorical_df, make_column_selector(dtype_exclude=np.number)),\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:53:39.249358Z",
     "start_time": "2019-07-30T18:53:39.180306Z"
    },
    "pycharm": {
     "is_executing": false
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_out</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>num__LotFrontage</th>\n",
       "      <td>65.0</td>\n",
       "      <td>80.0</td>\n",
       "      <td>68.0</td>\n",
       "      <td>60.0</td>\n",
       "      <td>84.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__LotArea</th>\n",
       "      <td>8450.0</td>\n",
       "      <td>9600.0</td>\n",
       "      <td>11250.0</td>\n",
       "      <td>9550.0</td>\n",
       "      <td>14260.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__OverallQual</th>\n",
       "      <td>7.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>7.0</td>\n",
       "      <td>8.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__OverallCond</th>\n",
       "      <td>5.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__YearBuilt</th>\n",
       "      <td>2003.0</td>\n",
       "      <td>1976.0</td>\n",
       "      <td>2001.0</td>\n",
       "      <td>1915.0</td>\n",
       "      <td>2000.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_AdjLand</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_Alloca</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_Family</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_Normal</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_Partial</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>303 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 0       1        2       3        4\n",
       "feature_out                                                         \n",
       "num__LotFrontage              65.0    80.0     68.0    60.0     84.0\n",
       "num__LotArea                8450.0  9600.0  11250.0  9550.0  14260.0\n",
       "num__OverallQual               7.0     6.0      7.0     7.0      8.0\n",
       "num__OverallCond               5.0     8.0      5.0     5.0      5.0\n",
       "num__YearBuilt              2003.0  1976.0   2001.0  1915.0   2000.0\n",
       "...                            ...     ...      ...     ...      ...\n",
       "cat__SaleCondition_AdjLand     0.0     0.0      0.0     0.0      0.0\n",
       "cat__SaleCondition_Alloca      0.0     0.0      0.0     0.0      0.0\n",
       "cat__SaleCondition_Family      0.0     0.0      0.0     0.0      0.0\n",
       "cat__SaleCondition_Normal      1.0     1.0      1.0     0.0      1.0\n",
       "cat__SaleCondition_Partial     0.0     0.0      0.0     0.0      0.0\n",
       "\n",
       "[303 rows x 5 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transformed_df = preprocessing_df.fit_transform(X=housing_features_df, y=housing_target_sr)\n",
    "transformed_df.head().T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "The `feature_names_original_` attribute returns a series, mapping the output columns (the series' index) to the input columns (the series' values):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:35.436833Z",
     "start_time": "2019-07-30T18:54:35.431830Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature_in</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_out</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>num__LotFrontage</th>\n",
       "      <td>LotFrontage</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__LotArea</th>\n",
       "      <td>LotArea</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__OverallQual</th>\n",
       "      <td>OverallQual</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__OverallCond</th>\n",
       "      <td>OverallCond</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__YearBuilt</th>\n",
       "      <td>YearBuilt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_AdjLand</th>\n",
       "      <td>SaleCondition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_Alloca</th>\n",
       "      <td>SaleCondition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_Family</th>\n",
       "      <td>SaleCondition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_Normal</th>\n",
       "      <td>SaleCondition</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__SaleCondition_Partial</th>\n",
       "      <td>SaleCondition</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>303 rows × 1 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                               feature_in\n",
       "feature_out                              \n",
       "num__LotFrontage              LotFrontage\n",
       "num__LotArea                      LotArea\n",
       "num__OverallQual              OverallQual\n",
       "num__OverallCond              OverallCond\n",
       "num__YearBuilt                  YearBuilt\n",
       "...                                   ...\n",
       "cat__SaleCondition_AdjLand  SaleCondition\n",
       "cat__SaleCondition_Alloca   SaleCondition\n",
       "cat__SaleCondition_Family   SaleCondition\n",
       "cat__SaleCondition_Normal   SaleCondition\n",
       "cat__SaleCondition_Partial  SaleCondition\n",
       "\n",
       "[303 rows x 1 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preprocessing_df.feature_names_original_.to_frame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "You can therefore easily select all output features generated from a given input feature:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:47.270660Z",
     "start_time": "2019-07-30T18:54:47.254638Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_out</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>cat__GarageType_2Types</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__GarageType_&lt;unknown&gt;</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__GarageType_Attchd</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__GarageType_Basment</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__GarageType_BuiltIn</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__GarageType_CarPort</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__GarageType_Detchd</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             0    1    2    3    4\n",
       "feature_out                                       \n",
       "cat__GarageType_2Types     0.0  0.0  0.0  0.0  0.0\n",
       "cat__GarageType_<unknown>  0.0  0.0  0.0  0.0  0.0\n",
       "cat__GarageType_Attchd     1.0  1.0  1.0  0.0  1.0\n",
       "cat__GarageType_Basment    0.0  0.0  0.0  0.0  0.0\n",
       "cat__GarageType_BuiltIn    0.0  0.0  0.0  0.0  0.0\n",
       "cat__GarageType_CarPort    0.0  0.0  0.0  0.0  0.0\n",
       "cat__GarageType_Detchd     0.0  0.0  0.0  1.0  0.0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "garage_type_derivatives = preprocessing_df.feature_names_original_ == \"GarageType\"\n",
    "\n",
    "transformed_df.loc[:, garage_type_derivatives].head().T"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Supervised learners"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-26T08:36:43.003042Z",
     "start_time": "2019-07-26T08:36:42.733047Z"
    },
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Regressors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "As for transformers, scikit-learn regressors and classifiers have a `sklearndf` sibling obtained by appending `DF` to the class name, and the API remains the same. The result of any predict and decision function will be returned as a `pandas` series (single output) or data frame (class probabilities or multi-output).\n",
    "\n",
    "For a random forest regressor we get:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:49.180964Z",
     "start_time": "2019-07-30T18:54:48.757981Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# a simplified features vector (we will use a pipeline for more sophisticated pre-processing further down)\n",
    "numerical_features_df = housing_features_df.select_dtypes(np.number).fillna(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:49.180964Z",
     "start_time": "2019-07-30T18:54:48.757981Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    }
   },
   "source": [
    "For demonstration purposes, we create a simple train-test split:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:49.180964Z",
     "start_time": "2019-07-30T18:54:48.757981Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "df_numerical_train, df_numerical_test, y_train, y_test = train_test_split(\n",
    "    numerical_features_df,\n",
    "    housing_target_sr,\n",
    "    random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:49.180964Z",
     "start_time": "2019-07-30T18:54:48.757981Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    }
   },
   "source": [
    "Then we create, fit and score a DF-augmented random forest regressor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:49.180964Z",
     "start_time": "2019-07-30T18:54:48.757981Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8638857401761126"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_regressor_df = RandomForestRegressorDF(\n",
    "    n_estimators=100,\n",
    "    max_depth=5,\n",
    "    random_state=42,\n",
    "    n_jobs=-3\n",
    ")\n",
    "\n",
    "random_forest_regressor_df.fit(X=df_numerical_train, y=y_train)\n",
    "random_forest_regressor_df.score(X=df_numerical_test, y=y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "The DF-augmented regressor returns predictions as a series (we convert the series to a data frame using method `.to_frame()` for better rendering in the jupyter notebook):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": [],
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prediction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>892</th>\n",
       "      <td>138678.817934</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>305008.808241</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413</th>\n",
       "      <td>133420.810780</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>522</th>\n",
       "      <td>171533.659061</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1036</th>\n",
       "      <td>307214.384636</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         prediction\n",
       "892   138678.817934\n",
       "1105  305008.808241\n",
       "413   133420.810780\n",
       "522   171533.659061\n",
       "1036  307214.384636"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_regressor_df.predict(df_numerical_test.head()).to_frame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:49.188965Z",
     "start_time": "2019-07-30T18:54:49.182965Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    }
   },
   "source": [
    "All methods and attributes of the native random forest regressor are also available in the DF-augmented regressor, e.g., `.get_params()` and `.set_params()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:49.188965Z",
     "start_time": "2019-07-30T18:54:49.182965Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'bootstrap': True,\n",
       " 'ccp_alpha': 0.0,\n",
       " 'criterion': 'squared_error',\n",
       " 'max_depth': 5,\n",
       " 'max_features': 'auto',\n",
       " 'max_leaf_nodes': None,\n",
       " 'max_samples': None,\n",
       " 'min_impurity_decrease': 0.0,\n",
       " 'min_samples_leaf': 1,\n",
       " 'min_samples_split': 2,\n",
       " 'min_weight_fraction_leaf': 0.0,\n",
       " 'n_estimators': 100,\n",
       " 'n_jobs': -3,\n",
       " 'oob_score': False,\n",
       " 'random_state': 42,\n",
       " 'verbose': 0,\n",
       " 'warm_start': False}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_regressor_df.get_params()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:49.443565Z",
     "start_time": "2019-07-30T18:54:49.431562Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestRegressorDF(max_depth=7, n_jobs=-3, random_state=42)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_regressor_df.set_params(max_depth=7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "The underlying scikit-learn regressor is available through the `native_estimator` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:49.895830Z",
     "start_time": "2019-07-30T18:54:49.889831Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestRegressor(max_depth=7, n_jobs=-3, random_state=42)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_regressor_df.native_estimator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Property `is_fitted` tells if the regressor is fitted, and – for fitted estimators – property `feature_names_in_` returns the names of the ingoing features as a pandas index:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:50.523346Z",
     "start_time": "2019-07-30T18:54:50.518398Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_regressor_df.is_fitted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:50.349489Z",
     "start_time": "2019-07-30T18:54:50.345466Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',\n",
       "       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',\n",
       "       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',\n",
       "       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',\n",
       "       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',\n",
       "       'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',\n",
       "       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea'],\n",
       "      dtype='object', name='feature_in')"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_regressor_df.feature_names_in_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Classifiers\n",
    "\n",
    "Classifiers follow a similar logic."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "To get started, we create for house prices house below 100k, below 200k, and above 200k for multi-label classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:50.833368Z",
     "start_time": "2019-07-30T18:54:50.826362Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "y_classes = housing_target_sr.apply(\n",
    "    lambda x:\n",
    "    \">=200k\" if x >= 200000\n",
    "    else \">=100k\" if x >= 100000\n",
    "    else \"<100k\"\n",
    ").astype(\"category\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We create a train-test split:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:50.833368Z",
     "start_time": "2019-07-30T18:54:50.826362Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "df_numerical_train, df_numerical_test, y_classification_train, y_classification_test = train_test_split(\n",
    "    numerical_features_df,\n",
    "    y_classes,\n",
    "    random_state=42\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:50.833368Z",
     "start_time": "2019-07-30T18:54:50.826362Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%% md\n"
    }
   },
   "source": [
    " We create, fit and score a DF-augmented classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:51.432224Z",
     "start_time": "2019-07-30T18:54:51.341154Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8767123287671232"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_classifier_df = RandomForestClassifierDF(\n",
    "    n_estimators=100,\n",
    "    max_depth=5,\n",
    "    random_state=42,\n",
    "    n_jobs=-3\n",
    ")\n",
    "random_forest_classifier_df.fit(df_numerical_train, y_classification_train)\n",
    "random_forest_classifier_df.score(df_numerical_test, y_classification_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "The DF-augmented classifier returns predicted class labels as a series (once more, we invoke `.to_frame()` for better rendering in Jupyter):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prediction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>892</th>\n",
       "      <td>&gt;=100k</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>&gt;=200k</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413</th>\n",
       "      <td>&gt;=100k</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>522</th>\n",
       "      <td>&gt;=100k</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1036</th>\n",
       "      <td>&gt;=200k</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     prediction\n",
       "892      >=100k\n",
       "1105     >=200k\n",
       "413      >=100k\n",
       "522      >=100k\n",
       "1036     >=200k"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_classifier_df.predict(df_numerical_test.head()).to_frame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Class probabilities are returned as a data frame, using class labels as the column names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>&lt;100k</th>\n",
       "      <th>&gt;=100k</th>\n",
       "      <th>&gt;=200k</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>892</th>\n",
       "      <td>0.056555</td>\n",
       "      <td>0.905298</td>\n",
       "      <td>0.038148</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>0.001141</td>\n",
       "      <td>0.091114</td>\n",
       "      <td>0.907745</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413</th>\n",
       "      <td>0.104472</td>\n",
       "      <td>0.860927</td>\n",
       "      <td>0.034602</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>522</th>\n",
       "      <td>0.047416</td>\n",
       "      <td>0.814917</td>\n",
       "      <td>0.137667</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1036</th>\n",
       "      <td>0.000812</td>\n",
       "      <td>0.081764</td>\n",
       "      <td>0.917424</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         <100k    >=100k    >=200k\n",
       "892   0.056555  0.905298  0.038148\n",
       "1105  0.001141  0.091114  0.907745\n",
       "413   0.104472  0.860927  0.034602\n",
       "522   0.047416  0.814917  0.137667\n",
       "1036  0.000812  0.081764  0.917424"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_classifier_df.predict_proba(df_numerical_test.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>&lt;100k</th>\n",
       "      <th>&gt;=100k</th>\n",
       "      <th>&gt;=200k</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>892</th>\n",
       "      <td>-2.872548</td>\n",
       "      <td>-0.099491</td>\n",
       "      <td>-3.266291</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>-6.775823</td>\n",
       "      <td>-2.395643</td>\n",
       "      <td>-0.096792</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413</th>\n",
       "      <td>-2.258840</td>\n",
       "      <td>-0.149746</td>\n",
       "      <td>-3.363852</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>522</th>\n",
       "      <td>-3.048800</td>\n",
       "      <td>-0.204669</td>\n",
       "      <td>-1.982916</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1036</th>\n",
       "      <td>-7.116299</td>\n",
       "      <td>-2.503917</td>\n",
       "      <td>-0.086185</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         <100k    >=100k    >=200k\n",
       "892  -2.872548 -0.099491 -3.266291\n",
       "1105 -6.775823 -2.395643 -0.096792\n",
       "413  -2.258840 -0.149746 -3.363852\n",
       "522  -3.048800 -0.204669 -1.982916\n",
       "1036 -7.116299 -2.503917 -0.086185"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "random_forest_classifier_df.predict_log_proba(df_numerical_test.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-26T08:37:37.633663Z",
     "start_time": "2019-07-26T08:37:37.618669Z"
    },
    "raw_mimetype": "text/restructuredtext",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We can combine the above steps to build a full predictive pipeline. `sklearndf` provides two useful, specialised pipeline objects for this, `RegressorPipelineDF` and `ClassifierPipelineDF`. Both implement a special two-step pipeline with one preprocessing step and one prediction step, while staying compatible with the general sklearn pipeline idiom."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:51.520150Z",
     "start_time": "2019-07-30T18:54:51.513148Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "pipeline_df = RegressorPipelineDF(\n",
    "    preprocessing=preprocessing_df,\n",
    "    regressor=RandomForestRegressorDF(\n",
    "        n_estimators=1000,\n",
    "        max_features=2/3,\n",
    "        max_depth=7,\n",
    "        random_state=42,\n",
    "        n_jobs=-3\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:51.874135Z",
     "start_time": "2019-07-30T18:54:51.522150Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8883657928153087"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train, df_test, y_train, y_test = train_test_split(housing_features_df, housing_target_sr, random_state=42)\n",
    "pipeline_df.fit(df_train, y_train)\n",
    "pipeline_df.score(df_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Extras"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "`sklearndf` also provides some additional estimators developed by Gamma or third parties, which are useful additions to the scikit-learn repertoire, and which follow the scikit-learn idiom. These are provided in the `.extra` modules:\n",
    "\n",
    "- `sklearndf.regression.extra.LGBMRegressorDF`\n",
    "- `sklearndf.regression.extra.XGBRegressorDF`\n",
    "- `sklearndf.classification.extra.LGBMClassifierDF`\n",
    "- `sklearndf.classification.extra.XGBClassifierDF`\n",
    "- `sklearndf.transformation.extra.BorutaDF`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "In the interest of lean package dependencies, `sklearndf` does not require packages `lightgbm`, `xgboost`, or `boruta_py` to be installed by default, but will not provide any of these DF classes if their associated third-party package is not installed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## LightGBM regressor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-07-30T18:54:52.046129Z",
     "start_time": "2019-07-30T18:54:51.876135Z"
    },
    "pycharm": {
     "is_executing": false,
     "name": "#%%\n"
    },
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>prediction</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>892</th>\n",
       "      <td>139447.461334</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1105</th>\n",
       "      <td>288031.725712</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>413</th>\n",
       "      <td>124917.506705</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>522</th>\n",
       "      <td>167320.406141</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1036</th>\n",
       "      <td>315868.109901</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         prediction\n",
       "892   139447.461334\n",
       "1105  288031.725712\n",
       "413   124917.506705\n",
       "522   167320.406141\n",
       "1036  315868.109901"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lgbm_df = LGBMRegressorDF(n_estimators=100, max_depth=8)\n",
    "lgbm_df.fit(df_numerical_train, y_train)\n",
    "lgbm_df.predict(df_numerical_test.head()).to_frame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Boruta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "[Boruta](https://www.jstatsoft.org/article/view/v036i11) is a smart feature selection method to eliminate all features whose predictive power is not better than random noise.\n",
    "\n",
    "The `sklearndf.transformation.extra.BorutaDF` transformer provides easy access to this powerful method. The basis of this is a tree-based learner, usually a random forest.\n",
    "\n",
    "For the random forest, we rely on default parameters but set the maximum tree depth to 5 (for Boruta, setting a depth between 3 and 7 is highly recommended and depends on the number of features and expected complexity of the feature/target interactions). The number of trees is automatically managed by the Boruta feature selector (argument ``n_estimators=\"auto\"``).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "boruta_pipeline = PipelineDF(\n",
    "    steps=[\n",
    "        ('preprocess', preprocessing_df),\n",
    "        ('boruta', BorutaDF(\n",
    "            estimator=RandomForestRegressorDF(max_depth=5, n_jobs=-3), \n",
    "            n_estimators=\"auto\", \n",
    "            random_state=42,\n",
    "            verbose=1\n",
    "        )),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "scrolled": true,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iteration: 1 / 100\n",
      "Iteration: 2 / 100\n",
      "Iteration: 3 / 100\n",
      "Iteration: 4 / 100\n",
      "Iteration: 5 / 100\n",
      "Iteration: 6 / 100\n",
      "Iteration: 7 / 100\n",
      "Iteration: 8 / 100\n",
      "Iteration: 9 / 100\n",
      "Iteration: 10 / 100\n",
      "Iteration: 11 / 100\n",
      "Iteration: 12 / 100\n",
      "Iteration: 13 / 100\n",
      "Iteration: 14 / 100\n",
      "Iteration: 15 / 100\n",
      "Iteration: 16 / 100\n",
      "Iteration: 17 / 100\n",
      "Iteration: 18 / 100\n",
      "Iteration: 19 / 100\n",
      "Iteration: 20 / 100\n",
      "Iteration: 21 / 100\n",
      "Iteration: 22 / 100\n",
      "Iteration: 23 / 100\n",
      "Iteration: 24 / 100\n",
      "Iteration: 25 / 100\n",
      "Iteration: 26 / 100\n",
      "Iteration: 27 / 100\n",
      "Iteration: 28 / 100\n",
      "Iteration: 29 / 100\n",
      "Iteration: 30 / 100\n",
      "Iteration: 31 / 100\n",
      "Iteration: 32 / 100\n",
      "Iteration: 33 / 100\n",
      "Iteration: 34 / 100\n",
      "Iteration: 35 / 100\n",
      "Iteration: 36 / 100\n",
      "Iteration: 37 / 100\n",
      "Iteration: 38 / 100\n",
      "Iteration: 39 / 100\n",
      "Iteration: 40 / 100\n",
      "Iteration: 41 / 100\n",
      "Iteration: 42 / 100\n",
      "Iteration: 43 / 100\n",
      "Iteration: 44 / 100\n",
      "Iteration: 45 / 100\n",
      "Iteration: 46 / 100\n",
      "Iteration: 47 / 100\n",
      "Iteration: 48 / 100\n",
      "Iteration: 49 / 100\n",
      "Iteration: 50 / 100\n",
      "Iteration: 51 / 100\n",
      "Iteration: 52 / 100\n",
      "Iteration: 53 / 100\n",
      "Iteration: 54 / 100\n",
      "Iteration: 55 / 100\n",
      "Iteration: 56 / 100\n",
      "Iteration: 57 / 100\n",
      "Iteration: 58 / 100\n",
      "Iteration: 59 / 100\n",
      "Iteration: 60 / 100\n",
      "Iteration: 61 / 100\n",
      "Iteration: 62 / 100\n",
      "Iteration: 63 / 100\n",
      "Iteration: 64 / 100\n",
      "Iteration: 65 / 100\n",
      "Iteration: 66 / 100\n",
      "Iteration: 67 / 100\n",
      "Iteration: 68 / 100\n",
      "Iteration: 69 / 100\n",
      "Iteration: 70 / 100\n",
      "Iteration: 71 / 100\n",
      "Iteration: 72 / 100\n",
      "Iteration: 73 / 100\n",
      "Iteration: 74 / 100\n",
      "Iteration: 75 / 100\n",
      "Iteration: 76 / 100\n",
      "Iteration: 77 / 100\n",
      "Iteration: 78 / 100\n",
      "Iteration: 79 / 100\n",
      "Iteration: 80 / 100\n",
      "Iteration: 81 / 100\n",
      "Iteration: 82 / 100\n",
      "Iteration: 83 / 100\n",
      "Iteration: 84 / 100\n",
      "Iteration: 85 / 100\n",
      "Iteration: 86 / 100\n",
      "Iteration: 87 / 100\n",
      "Iteration: 88 / 100\n",
      "Iteration: 89 / 100\n",
      "Iteration: 90 / 100\n",
      "Iteration: 91 / 100\n",
      "Iteration: 92 / 100\n",
      "Iteration: 93 / 100\n",
      "Iteration: 94 / 100\n",
      "Iteration: 95 / 100\n",
      "Iteration: 96 / 100\n",
      "Iteration: 97 / 100\n",
      "Iteration: 98 / 100\n",
      "Iteration: 99 / 100\n",
      "\n",
      "\n",
      "BorutaPy finished running.\n",
      "\n",
      "Iteration: \t100 / 100\n",
      "Confirmed: \t16\n",
      "Tentative: \t0\n",
      "Rejected: \t286\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "PipelineDF(steps=[('preprocess', ColumnTransformerDF(transformers=[('num', SimpleImputerDF(add_indicator=True, strategy='median'), <sklearn.compose._column_transformer.make_column_selector object at 0x13958f2b0>), ('cat', PipelineDF(steps=[('imputer', SimpleImputerDF(fill_value='<unknown>', missing_values=None, strategy='constant')), ('one-hot', OneHotEncoderDF(handle_unknown='ignore', sparse=False))]), <sklearn.compose._column_transformer.make_column_selector object at 0x13958fd60>)])), ('boruta', BorutaDF(estimator=RandomForestRegressor(max_depth=5, n_estimators=116, n_jobs=-3,\n",
       "                      random_state=RandomState(MT19937) at 0x13B34BD40), n_estimators='auto', random_state=RandomState(MT19937) at 0x13B34BD40, verbose=1))])"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "boruta_pipeline.fit(X=housing_features_df, y=housing_target_sr)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Boruta is implemented as a *scikit-learn* transformer; its output features are all features that passed the Boruta test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['num__LotFrontage',\n",
       " 'num__LotArea',\n",
       " 'num__OverallQual',\n",
       " 'num__YearBuilt',\n",
       " 'num__YearRemodAdd',\n",
       " 'num__MasVnrArea',\n",
       " 'num__BsmtFinSF1',\n",
       " 'num__TotalBsmtSF',\n",
       " 'num__1stFlrSF',\n",
       " 'num__2ndFlrSF',\n",
       " 'num__GrLivArea',\n",
       " 'num__FullBath',\n",
       " 'num__TotRmsAbvGrd',\n",
       " 'num__GarageCars',\n",
       " 'num__GarageArea',\n",
       " 'cat__BsmtQual_Ex']"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "boruta_pipeline.feature_names_out_.to_list()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "raw_mimetype": "text/restructuredtext",
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "`sklearndf` allows us to trace outgoing features back to the original features from which they were derived, using the `sklearndf.TransformerDF.features_original_` property. This is useful here as we want to know which features to eliminate before putting them into the pipeline.\n",
    "\n",
    "In our example, feature `BsmtQual_Ex` is a derivative of feature `BsmtQual`, obtained through one-hot encoding: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature_in</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>feature_out</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>num__LotFrontage</th>\n",
       "      <td>LotFrontage</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__LotArea</th>\n",
       "      <td>LotArea</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__OverallQual</th>\n",
       "      <td>OverallQual</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__YearBuilt</th>\n",
       "      <td>YearBuilt</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__YearRemodAdd</th>\n",
       "      <td>YearRemodAdd</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__MasVnrArea</th>\n",
       "      <td>MasVnrArea</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__BsmtFinSF1</th>\n",
       "      <td>BsmtFinSF1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__TotalBsmtSF</th>\n",
       "      <td>TotalBsmtSF</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__1stFlrSF</th>\n",
       "      <td>1stFlrSF</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__2ndFlrSF</th>\n",
       "      <td>2ndFlrSF</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__GrLivArea</th>\n",
       "      <td>GrLivArea</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__FullBath</th>\n",
       "      <td>FullBath</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__TotRmsAbvGrd</th>\n",
       "      <td>TotRmsAbvGrd</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__GarageCars</th>\n",
       "      <td>GarageCars</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>num__GarageArea</th>\n",
       "      <td>GarageArea</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>cat__BsmtQual_Ex</th>\n",
       "      <td>BsmtQual</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     feature_in\n",
       "feature_out                    \n",
       "num__LotFrontage    LotFrontage\n",
       "num__LotArea            LotArea\n",
       "num__OverallQual    OverallQual\n",
       "num__YearBuilt        YearBuilt\n",
       "num__YearRemodAdd  YearRemodAdd\n",
       "num__MasVnrArea      MasVnrArea\n",
       "num__BsmtFinSF1      BsmtFinSF1\n",
       "num__TotalBsmtSF    TotalBsmtSF\n",
       "num__1stFlrSF          1stFlrSF\n",
       "num__2ndFlrSF          2ndFlrSF\n",
       "num__GrLivArea        GrLivArea\n",
       "num__FullBath          FullBath\n",
       "num__TotRmsAbvGrd  TotRmsAbvGrd\n",
       "num__GarageCars      GarageCars\n",
       "num__GarageArea      GarageArea\n",
       "cat__BsmtQual_Ex       BsmtQual"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "boruta_pipeline.feature_names_original_.to_frame()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "So, to obtain all features we want to select from the original data set, we can determine the unique relevant ingoing features from the original feature mapping:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['LotFrontage',\n",
       " 'LotArea',\n",
       " 'OverallQual',\n",
       " 'YearBuilt',\n",
       " 'YearRemodAdd',\n",
       " 'MasVnrArea',\n",
       " 'BsmtFinSF1',\n",
       " 'TotalBsmtSF',\n",
       " '1stFlrSF',\n",
       " '2ndFlrSF',\n",
       " 'GrLivArea',\n",
       " 'FullBath',\n",
       " 'TotRmsAbvGrd',\n",
       " 'GarageCars',\n",
       " 'GarageArea',\n",
       " 'BsmtQual']"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "features_relevant = boruta_pipeline.feature_names_original_.unique()\n",
    "features_relevant.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "When we look at model performance for the reduced set of relevant features with model performance for the full feature set, we notice that performance is almost the same despite a significantly lower feature count (bearing in mind that, for reasons of simplicity, we are validating based on a single test set; in a real-world example we'd want to use cross-validation for more accurate model scores)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R2=0.888 with model fitted on 299 features\n"
     ]
    }
   ],
   "source": [
    "pipeline_df.fit(\n",
    "    # fit the model using the full feature set\n",
    "    df_train, y_train\n",
    ")\n",
    "\n",
    "score_all_features = pipeline_df.score(\n",
    "    # score the model using the full feature set\n",
    "    df_test, y_test\n",
    ")\n",
    "\n",
    "print(f\"R2={score_all_features:.3f} with model fitted on {len(pipeline_df.final_estimator.feature_names_in_)} features\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "R2=0.893 with model fitted on 22 features\n"
     ]
    }
   ],
   "source": [
    "pipeline_df.fit(\n",
    "    # fit the model using the relevant feature set\n",
    "    df_train.loc[:, features_relevant], y_train\n",
    ")\n",
    "\n",
    "score_relevant_features = pipeline_df.score(\n",
    "    # score the model using the relevant feature set\n",
    "    df_test.loc[:, features_relevant], y_test\n",
    ")\n",
    "\n",
    "print(f\"R2={score_relevant_features:.3f} with model fitted on {len(pipeline_df.final_estimator.feature_names_in_)} features\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "165px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}