{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Standard Scikit-learn Classification Summary with FACET\n", "\n", "FACET is composed of the following key components:\n", "\n", "- **Model Inspection**\n", "\n", " FACET introduces a new algorithm to quantify dependencies and interactions between features in ML models. This new tool for human-explainable AI adds a new, global perspective to the observation-level explanations provided by the popular [SHAP](https://shap.readthedocs.io/en/latest/) approach. To learn more about FACET's model inspection capabilities, see the getting started example below.\n", "\n", "\n", "- **Model Simulation**\n", "\n", " FACET's model simulation algorithms use ML models for *virtual experiments* to help identify scenarios that optimise predicted outcomes. To quantify the uncertainty in simulations, FACET utilises a range of bootstrapping algorithms including stationary and stratified bootstraps. For an example of FACET’s bootstrap simulations, see the getting started example below. \n", " \n", " \n", "- **Enhanced Machine Learning Workflow** \n", "\n", " FACET offers an efficient and transparent machine learning workflow, enhancing [scikit-learn]( https://scikit-learn.org/stable/index.html)'s tried and tested pipelining paradigm with new capabilities for model selection, inspection, and simulation. FACET also introduces [sklearndf](https://github.com/BCG-X-Official/sklearndf), an augmented version of *scikit-learn* with enhanced support for *pandas* dataframes that ensures end-to-end traceability of features. \n", "\n", "***\n", "\n", "**Context**\n", "\n", "In this tutorial notebook we will first build a classifier for predicting customer churn using a well known datasets from [Kaggle](https://www.kaggle.com/blastchar/telco-customer-churn). Then using the developed classifier we will demonstrate how to perform selected typical model performance summary tasks including:\n", "- Using a final fitted model on all CV-folds to obtain a confusion matrix, classification report and precision-recall curve\n", "- Using models fitted to each CV fold, create a set of summary metrics and a ROC curve both with an assessment of error based on the cross-validation\n", "\n", "***\n", "\n", "**Tutorial outline**\n", "\n", "1. [Required imports](#Required-imports)\n", "2. [Quick data preparation](#Quick-data-preparation)\n", "3. [Preprocessing and feature selection](#Preprocessing-and-feature-selection)\n", "4. [Learner selection with FACET](#Learner-selection-with-FACET)\n", "5. [Using the final fitted model](#Using-the-final-fitted-model)\n", "6. [Using the crossfit for the best model](#Using-the-crossfit-for-the-best-model)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "delete_for_interactive": true, "nbsphinx": "hidden" }, "outputs": [], "source": [ "# this cell's metadata contains\n", "# \"nbsphinx\": \"hidden\" so it is hidden by nbsphinx\n", "\n", "\n", "# ignore irrelevant warnings that would affect the output of this tutorial notebook\n", "\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\", category=UserWarning, message=r\".*Xcode_8\\.3\\.3\")\n", "warnings.filterwarnings(\"ignore\", message=r\".*`should_run_async` will not call `transform_cell`\")\n", "warnings.filterwarnings(\"ignore\", message=r\".*`np\\..*` is a deprecated alias\")\n", "warnings.filterwarnings(\"ignore\", message=r\"Importing display from IPython.core.display is deprecated.*\")\n", "\n", "\n", "# set global options for matplotlib\n", "\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "\n", "matplotlib.rcParams[\"figure.figsize\"] = (12.0, 6.0)\n", "matplotlib.rcParams[\"figure.dpi\"] = 96" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to run this notebook, we will import not only the FACET package, but also other packages useful to solve this task. Overall, we can break down the imports into three categories: \n", "\n", "1. Common packages (pandas, matplotlib, sklearn, etc.)\n", "2. Required FACET classes (i.e., selection)\n", "3. Other BCG GAMMA packages which simplify pipelining (sklearndf, see on [GitHub](https://github.com/orgs/BCG-X-Official/sklearndf/)) when using FACET" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Common package imports**" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from numpy import interp\n", "import matplotlib.pylab as plt\n", "\n", "from sklearn.metrics import (\n", " classification_report,\n", " confusion_matrix,\n", " roc_curve,\n", " roc_auc_score,\n", " auc,\n", " accuracy_score,\n", " f1_score,\n", " precision_score,\n", " recall_score,\n", " precision_recall_curve,\n", " ConfusionMatrixDisplay,\n", " PrecisionRecallDisplay,\n", ")\n", "from sklearn.compose import make_column_selector\n", "from sklearn.model_selection import RepeatedKFold, GridSearchCV" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**FACET imports**" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from facet.data import Sample\n", "from facet.selection import LearnerSelector, MultiEstimatorParameterSpace, ParameterSpace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**sklearndf imports**\n", "\n", "Instead of using the \"regular\" scikit-learn package, we are going to use sklearndf (see on [GitHub](https://github.com/orgs/BCG-X-Official/sklearndf/)). sklearndf is an open source library designed to address a common issue with scikit-learn: the outputs of transformers are numpy arrays, even when the input is a data frame. However, to inspect a model it is essential to keep track of the feature names. sklearndf retains all the functionality available through scikit-learn plus the feature traceability and usability associated with Pandas data frames. Additionally, the names of all your favourite scikit-learn functions are the same except for `DF` on the end. For example, the standard scikit-learn import:\n", "\n", "`from sklearn.pipeline import Pipeline`\n", "\n", "becomes:\n", "\n", "`from sklearndf.pipeline import PipelineDF`" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "from sklearndf.pipeline import PipelineDF, ClassifierPipelineDF\n", "from sklearndf.classification import RandomForestClassifierDF\n", "from sklearndf.transformation import (\n", " ColumnTransformerDF,\n", " OneHotEncoderDF,\n", " SimpleImputerDF,\n", ")\n", "from sklearndf.transformation.extra import BorutaDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Quick data preparation\n", "\n", "We start by obtaining a dataset for analysis. In this case we use the well known Telco Customer Churn dataset from [Kaggle](https://www.kaggle.com/blastchar/telco-customer-churn).\n", "\n", "Briefly, the dataset contains one row for each of 7043 customers and includes information on those who left with the last month (i.e., `Churn` - our target of interest, n=1869), services signed up for, account information and demographics.\n", "\n", "As this dataset has been well described and analyzed, we apply the minimum number of steps necessary to prepare the data for this tutorial. These are as follows:\n", "\n", "- drop the `customerID` column\n", "- convert `TotalCharges` to numeric type\n", "- convert `SeniorCitizen` to string type\n", "- relabel and convert `Churn` to a 0/1 target\n", "\n", "Finally we place the dataframe in a FACET `Sample` object for easier data management. This allows us to:\n", "\n", "- Quickly access the target vs. features\n", "- Pass our data into sklearndf pipelines\n", "- Pass information to other FACET functions" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "churn sample data loaded with 7043 observations\n" ] } ], "source": [ "# This dataset is from Kaggle has been analyzed numerous times, and so we skip EDA\n", "# read in the data\n", "churn_df = pd.read_csv('KAGGLE-Telco-Customer-Churn.csv')\n", "\n", "# drop customer ID\n", "churn_df = churn_df.drop(columns=['customerID'])\n", "\n", "# TotalCharges needs to be float (known to have a few missing values)\n", "churn_df.TotalCharges = pd.to_numeric(churn_df.TotalCharges, errors='coerce')\n", "\n", "# To support preprocessing pipeline we will also convert SeniorCitizen to object type\n", "# only tenure, MonthlyCharges and TotalCharges are numeric\n", "churn_df.SeniorCitizen = churn_df.SeniorCitizen.astype(\"category\")\n", "\n", "# Create a new 0/1 target where 1=churn\n", "churn_df.Churn = churn_df.Churn.map(dict(Yes=1, No=0))\n", "\n", "# create sample object\n", "churn_sample = Sample(\n", " observations=churn_df,\n", " feature_names=churn_df.drop(columns=[\"Churn\"]).columns,\n", " target_name=\"Churn\",\n", ")\n", "\n", "print(f\"churn sample data loaded with {len(churn_sample)} observations\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['gender',\n", " 'SeniorCitizen',\n", " 'Partner',\n", " 'Dependents',\n", " 'tenure',\n", " 'PhoneService',\n", " 'MultipleLines',\n", " 'InternetService',\n", " 'OnlineSecurity',\n", " 'OnlineBackup',\n", " 'DeviceProtection',\n", " 'TechSupport',\n", " 'StreamingTV',\n", " 'StreamingMovies',\n", " 'Contract',\n", " 'PaperlessBilling',\n", " 'PaymentMethod',\n", " 'MonthlyCharges',\n", " 'TotalCharges']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# get feature names\n", "churn_sample.feature_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing and feature selection\n", "\n", "Our first step is to create a minimum preprocessing pipeline which based on our dataset needs to address the following:\n", "\n", "- Simple imputation for missing values\n", "\n", "- One-hot encoding for categorical features\n", "\n", "We will use the sklearndf wrappers for scikit-learn functions such as `SimpleImputerDF` in place of `SimpleImputer`, `OneHotEncoderDF` in place of `OneHotEncoder`, and so on." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# for categorical features we will use the mode as the imputation value and also one-hot encode\n", "preprocessing_categorical = PipelineDF(\n", " steps=[\n", " (\"imputer\", SimpleImputerDF(strategy=\"most_frequent\", fill_value=\"\")),\n", " (\"one-hot\", OneHotEncoderDF(sparse=False, handle_unknown=\"ignore\")),\n", " ]\n", ")\n", "\n", "# for numeric features we will impute using the median\n", "preprocessing_numerical = SimpleImputerDF(strategy=\"median\")\n", "\n", "# put the pipeline together\n", "preprocessing_features = ColumnTransformerDF(\n", " transformers=[\n", " (\n", " \"categorical\",\n", " preprocessing_categorical,\n", " make_column_selector(dtype_include=object),\n", " ),\n", " (\n", " \"numerical\",\n", " preprocessing_numerical,\n", " make_column_selector(dtype_include=np.number),\n", " ),\n", " ],\n", " verbose_feature_names_out=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we perform some initial feature selection using Boruta, a recent approach shown to have quite good performance. The Boruta algorithm removes features that are no more predictive than random noise. If you are interested further, please see this [article](https://www.jstatsoft.org/article/view/v036i11).\n", "\n", "The `BorutaDF` transformer in our sklearndf package provides easy access to this powerful method. The approach relies on a tree-based learner, usually a random forest. For settings, a `max_depth` of between 3 and 7 is typically recommended, and here we rely on the default setting of 5. However, as this depends on the number of features and the complexity of interactions one could also explore the sensitivity of feature selection to this parameter. The number of trees is automatically managed by the Boruta feature selector argument `n_estimators=\"auto\"`.\n", "\n", "We also use parallelization for the random forest using `n_jobs` to accelerate the Boruta iterations." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['InternetService', 'OnlineSecurity', 'OnlineBackup',\n", " 'DeviceProtection', 'TechSupport', 'StreamingTV',\n", " 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',\n", " 'tenure', 'MonthlyCharges', 'TotalCharges'], dtype=object)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create the pipeline for Boruta\n", "boruta_feature_selection = PipelineDF(\n", " steps=[\n", " (\"preprocessing\", preprocessing_features),\n", " (\n", " \"boruta\",\n", " BorutaDF(\n", " estimator=RandomForestClassifierDF(\n", " max_depth=5, n_jobs=-3, random_state=42\n", " ),\n", " n_estimators=\"auto\",\n", " random_state=42,\n", " verbose=False,\n", " ),\n", " ),\n", " ]\n", ")\n", "\n", "# run feature selection using Boruta and report those selected\n", "boruta_feature_selection.fit(X=churn_sample.features, y=churn_sample.target)\n", "selected = boruta_feature_selection.feature_names_original_.unique()\n", "selected" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Learner selection with FACET\n", "\n", "FACET implements several additional useful wrappers which simplify comparing and tuning models:\n", "\n", "- `ParameterSpace`: allows you to pass a learner pipeline (i.e., classifier + any preprocessing) and set hyperparameters.\n", "\n", "- `LearnerSelector`: multiple LearnerGrids can be passed into this class as a list - this allows tuning hyperparameters both across different types of learners in a single step and ranks the resulting models accordingly\n", "\n", "For the purpose of this tutorial we will assess a Random Forest Classifier and hyperparameter ranges will be assessed using 10 repeated 5-fold cross-validation and be scored using AUC:\n", "\n", "Random forest: with hyperparameters\n", "\n", "- max_depth: [4, 8, 16, 32]\n", "\n", "- n_estimators: [200, 500]\n", "\n", "Learner ranking uses the average performance minus two times the standard deviation, so that we consider both the average performance and variability when selecting a classifier.\n", "\n", "If you want a list of available hyperparameters you can use `classifier_name().get_params().keys()` where `classifier_name` could be for example `RandomForestClassifierDF` and if you want to see the default values, just use `classifier_name().get_params()`." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
scoreparamtime
testclassifierfitscore
rankmeanstdmax_depthn_estimatorsmeanstdmeanstd
310.8464670.01019872000.4288130.0056490.0300350.002274
220.8461760.01020771000.2250230.0029870.0180260.000395
130.8427050.01057042000.3307420.0042080.0229320.000563
040.8420600.01070241000.1761520.0022480.0148630.000459
550.8416390.010050102000.5480960.0221820.0380360.003584
460.8409890.010235101000.2881220.0198390.0230280.003089
\n", "
" ], "text/plain": [ " score param time \\\n", " test classifier fit \n", " rank mean std max_depth n_estimators mean std \n", "3 1 0.846467 0.010198 7 200 0.428813 0.005649 \n", "2 2 0.846176 0.010207 7 100 0.225023 0.002987 \n", "1 3 0.842705 0.010570 4 200 0.330742 0.004208 \n", "0 4 0.842060 0.010702 4 100 0.176152 0.002248 \n", "5 5 0.841639 0.010050 10 200 0.548096 0.022182 \n", "4 6 0.840989 0.010235 10 100 0.288122 0.019839 \n", "\n", " \n", " score \n", " mean std \n", "3 0.030035 0.002274 \n", "2 0.018026 0.000395 \n", "1 0.022932 0.000563 \n", "0 0.014863 0.000459 \n", "5 0.038036 0.003584 \n", "4 0.023028 0.003089 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# reduce sample object to selected features\n", "churn_sample_kept_features = churn_sample.keep(feature_names=selected)\n", "\n", "# Classifier pipeline composed of the feature preprocessing steps created earlier and random forest learner\n", "rforest_clf = ClassifierPipelineDF(\n", " preprocessing=preprocessing_features,\n", " classifier=RandomForestClassifierDF(random_state=42),\n", ")\n", "\n", "# set space of hyper-parameters \n", "classifier_ps = ParameterSpace(rforest_clf)\n", "classifier_ps.classifier.max_depth = [4, 7, 10]\n", "classifier_ps.classifier.n_estimators = [100, 200]\n", "\n", "# run the learner selector\n", "clf_selector = LearnerSelector(\n", " searcher_type=GridSearchCV,\n", " parameter_space=classifier_ps,\n", " cv=RepeatedKFold(n_splits=5, n_repeats=10, random_state=42),\n", " n_jobs=-3,\n", " scoring=\"roc_auc\",\n", ").fit(churn_sample_kept_features)\n", "\n", "# look at results\n", "clf_selector.summary_report()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using the final fitted model\n", "\n", "As part of the `clf_selector` we can access a final model (`best_estimator_`) that represents the selected best model but re-fit using all available training data. With this model we can then predict either the class or the probability (score) and generate standard scikit-learn classifier performance summaries such as a classification report, confusion matrix or precision-recall curve." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# obtain required quantities\n", "y_pred = clf_selector.best_estimator_.predict(churn_sample_kept_features.features)\n", "y_prob = clf_selector.best_estimator_.predict_proba(churn_sample_kept_features.features)[1]\n", "y_true = churn_sample_kept_features.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification Report\n", "\n", "The classification report from scikit-learn is often used as a summary for classifiers, especially in the case of imbalanced datasets, as it provides precision, recall and the f1-score by class along with the support (number of observations for a class). For more information on the implementation in scikit-learn please see [sklearn.metrics.classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.84 0.92 0.88 5174\n", " 1 0.70 0.53 0.61 1869\n", "\n", " accuracy 0.82 7043\n", " macro avg 0.77 0.73 0.74 7043\n", "weighted avg 0.81 0.82 0.81 7043\n", "\n" ] } ], "source": [ "print(classification_report(y_true, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confusion Matrix\n", "\n", "The confusion matrix can be used to evaluate the accuracy of the fitted classifier by comparing the predicted class to the observed class. For more information on the implementation in scikit-learn please see [sklearn.metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "cf_matrix = confusion_matrix(y_true, y_pred)\n", "ConfusionMatrixDisplay(cf_matrix).plot()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Precision-recall curve\n", "\n", "The precision-recall curve is helpful for understanding the trade off between precision (positive predictive value) and recall (sensitivity) according to a specified threshold applied to the predicted probability (or score) for determining the predicted class for an observation. For more information on the implementation in scikit-learn please see [sklearn.metrics.precision_recall_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html#sklearn.metrics.precision_recall_curve)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "prec, recall, _ = precision_recall_curve(y_true, y_prob, pos_label=1)\n", "PrecisionRecallDisplay(precision=prec, recall=recall).plot()" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Panel of metrics\n", "\n", "Below we demonstrate how to use the best estimator results to obtain a set of common classification metrics: Accuracy, F1, Precision, Recall and AUC. This approach can of course be adapted to any metric and any summary thereof.\n", "\n", "For more information about classifier metrics in scikit-learn please see [classification-metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Summary of model performance')" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "metrics = []\n", "\n", "# calculate metrics\n", "metrics.append(pd.Series({\n", " 'Accuracy': accuracy_score(y_true, y_pred),\n", " 'F1': f1_score(y_true, y_pred),\n", " 'Precision': precision_score(y_true, y_pred),\n", " 'Recall': recall_score(y_true, y_pred),\n", " 'AUC': roc_auc_score(y_true, y_prob)})\n", ")\n", " \n", "# collect required summaries and plot\n", "metrics_df = pd.DataFrame(metrics)\n", "fig, ax = plt.subplots()\n", "ax.bar(\n", " metrics_df.columns,\n", " metrics_df.mean(),\n", " align='center',\n", " ecolor='lime',\n", " capsize=10,\n", " color='lightgrey'\n", ")\n", "ax.set_ylabel('Metric')\n", "ax.set_title('Summary of model performance')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "384px" }, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }