e6fecc7df5d94413bf6253b93aa54350

Required imports#

In order to run this notebook, we will import not only the FACET package, but also other packages useful to solve this task. Overall, we can break down the imports into three categories:

Common packages (pandas, matplotlib, etc.)
Required FACET classes (inspection, selection, validation, simulation, etc.)
Other BCG GAMMA packages which simplify pipelining (sklearndf, see on GitHub) and support visualization (pytools, see on GitHub) when using FACET

Common package imports

[2]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import shap
import seaborn as sns
from sklearn.compose import make_column_selector
from sklearn.model_selection import RepeatedKFold, RandomizedSearchCV

FACET imports

[3]:

from facet.data import Sample
from facet.inspection import LearnerInspector
from facet.selection import LearnerSelector, ParameterSpace
from facet.validation import BootstrapCV
from facet.data.partition import ContinuousRangePartitioner
from facet.simulation import UnivariateProbabilitySimulator
from facet.simulation.viz import SimulationDrawer

sklearndf imports

Instead of using the “regular” scikit-learn package, we are going to use sklearndf (see on GitHub). sklearndf is an open source library designed to address a common issue with scikit-learn: the outputs of transformers are numpy arrays, even when the input is a data frame. However, to inspect a model it is essential to keep track of the feature names. sklearndf retains all the functionality available through scikit-learn plus the feature traceability and usability associated with Pandas data frames. Additionally, the names of all your favourite scikit-learn functions are the same except for DF on the end. For example, the standard scikit-learn import:

from sklearn.pipeline import Pipeline

becomes:

from sklearndf.pipeline import PipelineDF

[4]:

from sklearndf.pipeline import PipelineDF, ClassifierPipelineDF
from sklearndf.classification import RandomForestClassifierDF
from sklearndf.classification.extra import LGBMClassifierDF
from sklearndf.transformation import (
    ColumnTransformerDF,
    OneHotEncoderDF,
    SimpleImputerDF,
)
from sklearndf.transformation.extra import BorutaDF

pytools imports

pytools (see on GitHub) is an open source library containing general machine learning and visualization utilities, some of which are useful for visualising the advanced model inspection capabilities of FACET.

[5]:

from pytools.viz.dendrogram import DendrogramDrawer
from pytools.viz.matrix import MatrixDrawer

Preprocessing and initial feature selection#

First we need to load our prediabetes data and create a simple preprocessing pipeline. For those interested some initial EDA can be found in the Appendix (Exploratory Data Analysis).

[6]:

# load the prepared data frame
prediab_df = pd.read_csv("pre_diab_nhanes.csv")

# create a couple of new interesting features
prediab_df["SBP_to_DBP"] = prediab_df["Average_SBP"] / prediab_df["Average_DBP"]
prediab_df["Waist_to_hgt"] = (
    prediab_df["Waist_Circumference"] / prediab_df["Standing_Height"]
)

# make clear based on dtypes these two features are categorical
prediab_df["General_health"] = prediab_df["General_health"].astype("object")
prediab_df["Healthy_diet"] = prediab_df["Healthy_diet"].astype("object")

# have a look
prediab_df.head().T

[6]:

	0	1	2	3	4
Age	73.0	56.0	61.0	56.0	65.0
Gender	2.0	1.0	2.0	2.0	1.0
Waist_Circumference	NaN	123.1	110.8	85.5	93.7
Weight	52.0	105.0	93.4	61.8	65.3
Standing_Height	162.4	158.7	161.8	152.8	172.4
BMI	19.7	41.7	35.7	26.5	22.0
Average_SBP	137.333333	157.333333	122.666667	122.0	141.333333
Average_DBP	86.666667	82.0	80.666667	72.666667	77.333333
HDL_Cholesterol	85.0	38.0	58.0	59.0	79.0
Total_Cholesterol	201.0	226.0	168.0	278.0	173.0
High_BP	1.0	1.0	1.0	0.0	0.0
Sleep_hours	9.0	5.0	9.0	6.0	7.0
Trouble_sleeping	0.0	0.0	0.0	1.0	0.0
Sleep_disorder	0.0	1.0	1.0	0.0	0.0
Told_overweight	0.0	1.0	1.0	0.0	0.0
General_health	5.0	5.0	3.0	3.0	3.0
Family_hist_diab	0	0	0	0	0
Feel_at_risk_diab	0.0	1.0	1.0	0.0	0.0
Vigorous_work_activity	0.0	1.0	0.0	1.0	1.0
Moderate_work_activity	1.0	0.0	0.0	1.0	0.0
Walk_or_bicycle	0.0	0.0	0.0	0.0	0.0
Vigorous_rec_activity	0.0	0.0	0.0	0.0	0.0
Moderate_rec_activity	0.0	0.0	0.0	0.0	0.0
Tried_weight_loss_past_year	0.0	1.0	1.0	0.0	0.0
Healthy_diet	2.0	5.0	4.0	2.0	3.0
WBC_count	6.6	9.4	5.2	9.5	6.3
RBC_count	4.72	4.93	4.66	4.43	4.35
Hematocrit	43.8	41.5	39.8	41.4	43.1
Triglycerides	88.0	327.0	68.0	262.0	39.0
Uric_acid	4.2	9.1	5.1	3.5	6.3
Osmolality	290.0	287.0	281.0	277.0	281.0
Sodium	142.0	143.0	140.0	139.0	140.0
Potassium	4.1	3.3	3.9	4.0	4.8
Gamma_glutamyl_transferase	31.0	22.0	17.0	21.0	24.0
Calcium	10.0	9.3	9.9	9.5	9.5
Alanine_aminotransferase	28.0	16.0	21.0	24.0	20.0
Aspartate_aminotransferase	36.0	24.0	20.0	23.0	29.0
Pre_diab	1	0	1	0	0
SBP_to_DBP	1.584615	1.918699	1.520661	1.678899	1.827586
Waist_to_hgt	NaN	0.775677	0.684796	0.559555	0.543503

[7]:

# to ensure a quick run we will use a random sample of 1000 observations
prediab_df = prediab_df.sample(n=1000, random_state=42)

For easier data management we will create a sample object using FACET’s Sample class, which allows us to:

Quickly access the target vs. features
Pass our data into sklearndf pipelines
Pass information to other FACET functions

[8]:

# create a FACET sample object
prediab = Sample(
    observations=prediab_df,
    feature_names=prediab_df.drop(columns=["Pre_diab"]).columns,
    target_name="Pre_diab",
)

Next we create a minimum preprocessing pipeline which based on our initial EDA (Exploratory Data Analysis) needs to address the following:

Simple imputation for missing values in both continuous and categorical features
One-hot encoding for categorical features

We will use the sklearndf wrappers for scikit-learn functions such as SimpleImputerDF in place of SimpleImputer, OneHotEncoderDF in place of OneHotEncoder, and so on.

[9]:

# for categorical features we will use the mode as the imputation value and also one-hot encode
preprocessing_categorical = PipelineDF(
    steps=[
        ("imputer", SimpleImputerDF(strategy="most_frequent", fill_value="<na>")),
        ("one-hot", OneHotEncoderDF(sparse=False, handle_unknown="ignore")),
    ]
)

# for numeric features we will impute using the median
preprocessing_numerical = SimpleImputerDF(strategy="median")

# put the pipeline together
preprocessing_features = ColumnTransformerDF(
    transformers=[
        (
            "categorical",
            preprocessing_categorical,
            make_column_selector(dtype_include=object),
        ),
        (
            "numerical",
            preprocessing_numerical,
            make_column_selector(dtype_include=np.number),
        ),
    ],
    verbose_feature_names_out=False,
)

Next, we perform some initial feature selection using Boruta, a recent approach shown to have quite good performance. The Boruta algorithm removes features that are no more predictive than random noise. If you are interested further, please see this article.

The BorutaDF transformer in our sklearndf package provides easy access to this powerful method. The approach relies on a tree-based learner, usually a random forest. For settings, a max_depth of between 3 and 7 is typically recommended, and here we rely on the default setting of 5. However, as this depends on the number of features and the complexity of interactions one could also explore the sensitivity of feature selection to this parameter. The number of trees is automatically managed by the Boruta feature selector argument n_estimators="auto".

We also use parallelization for the random forest using n_jobs to accelerate the Boruta iterations.

[10]:

# create the pipeline for Boruta
boruta_feature_selection = PipelineDF(
    steps=[
        ("preprocessing", preprocessing_features),
        (
            "boruta",
            BorutaDF(
                estimator=RandomForestClassifierDF(
                    max_depth=5, n_jobs=-3, random_state=42
                ),
                n_estimators="auto",
                random_state=42,
                verbose=False,
            ),
        ),
    ]
)

# run feature selection using Boruta and report those selected
boruta_feature_selection.fit(X=prediab.features, y=prediab.target)
selected = boruta_feature_selection.feature_names_original_.unique()
selected

[10]:

array(['Age', 'Waist_Circumference', 'BMI', 'Average_SBP', 'RBC_count',
       'Hematocrit', 'Uric_acid', 'Gamma_glutamyl_transferase',
       'SBP_to_DBP', 'Waist_to_hgt'], dtype=object)

Boruta identified 10 features (out of a potential 47) that we will retain in our FACET sample object for classification. Note that this feature selection process could be included in a general preprocessing pipeline, however due to the computation involved, we have utilized Boruta here as an initial one-off processing step to narrow down the features for our classifier development.

[11]:

# update FACET sample object to only those features Boruta identified as useful
prediab_initial_features = prediab.keep(feature_names=selected)

Selecting a learner using FACET#

FACET implements several additional useful wrappers which further simplify comparing and tuning a larger number of models and configurations:

ParameterSpace: allows you to pass a learner pipeline (i.e., classifier + any preprocessing) and a set of hyperparameters
LearnerSelector: one or more ParameterSpaces can be passed into this class - this allows tuning hyperparameters across different types of learners in a single step and ranks the resulting models accordingly

The following learners and hyperparameter ranges will be assessed using 10 repeated 5-fold cross-validation:

Random forest: with hyperparameters
- max_depth: [4..7]
- min_samples_leaf: [8..19]; smaller ints are more frequent (zipfian distribution)
- n_estimators: [20..300]; smaller ints are more frequent (zipfian distribution)
Light gradient boosting: with hyperparameters
- max_depth: [4..7]
- min_child_samples: [8..19]; smaller ints are more frequent (zipfian distribution)
- n_estimators: [20..300]; smaller ints are more frequent (zipfian distribution)

Note if you want to see a list of hyperparameter names you can use <ClassifierClass>().get_params().keys() where <ClassifierClass> could be for example RandomForestClassifierDF and if you want to see the default values, just use <ClassifierClass>().get_params().

Finally, for this exercise we will use accuracy as the default performance metric for scoring and ranking our classifiers.

First, we specify the classifiers we want to train using ClassifierPipelineDF from sklearndf. Note here we also include the feature preprocessing steps we created earlier.

[12]:

# random forest learner
rforest_clf = ClassifierPipelineDF(
    preprocessing=preprocessing_features,
    classifier=RandomForestClassifierDF(random_state=42),
)

# light gradient boosting learner
lgbm_clf = ClassifierPipelineDF(
    preprocessing=preprocessing_features,
    classifier=LGBMClassifierDF(random_state=42)
)

Then we create parameter spaces with ParameterSpace for each classifier and specify set of hyperparameters for each one of them. Contrary to standard sklearn workflow, in this approach setting wrong hyperparameter will throw an exception as setting an attribute comes with a proper check.

[13]:

rforest_ps = ParameterSpace(rforest_clf)

# random ints 4 <= x <= 7
rforest_ps.classifier.max_depth = stats.randint(4, 8)
# random ints 8 <= x <= 19; smaller ints are more frequent (zipfian distribution)
rforest_ps.classifier.min_samples_leaf = stats.zipfian(a=1, n=12, loc=7)
# random ints 20 <= x < 300; smaller ints are more frequent (zipfian distribution)
rforest_ps.classifier.n_estimators = stats.zipfian(a=1/2, n=380, loc=20)

lgbm_ps = ParameterSpace(lgbm_clf)

# random ints 4 <= x <= 7
lgbm_ps.classifier.max_depth = stats.randint(4, 8)
# random ints 8 <= x <= 19; smaller ints are more frequent (zipfian distribution)
lgbm_ps.classifier.min_child_samples = stats.zipfian(a=1, n=12, loc=7)
# random ints 20 <= x < 300; smaller ints are more frequent (zipfian distribution)
lgbm_ps.classifier.n_estimators = stats.zipfian(a=1/2, n=380, loc=20)
lgbm_ps.classifier.subsample = stats.uniform(0.8, 0.2)

We now fit a LearnerSelector using the parameter spaces defined above, running a random search using 10 repeated 5-fold cross-validation on our selected set of features from Boruta (LearnerSelector supports any CV approach implementing class BaseSearchCV).

[14]:

clf_selector = LearnerSelector(
    searcher_type=RandomizedSearchCV,
    parameter_space=[rforest_ps, lgbm_ps],
    cv=RepeatedKFold(n_splits=5, n_repeats=10, random_state=42),
    n_jobs=-3,
    scoring="roc_auc",
    random_state=42,
).fit(prediab_initial_features)

We can see how each model scored using the summary_report() method of the LearnerSelector.

[15]:

# let's look at performance for the top ranked classifiers
clf_selector.summary_report()

[15]:

	score			candidate	param					time
	test			-	classifier					fit		score
	rank	mean	std	-	max_depth	min_samples_leaf	n_estimators	min_child_samples	subsample	mean	std	mean	std
2	1	0.725129	0.033742	RandomForestClassifierDF	6	15	164	NaN	NaN	0.135929	0.002942	0.008452	0.000312
9	2	0.724997	0.032834	RandomForestClassifierDF	5	18	376	NaN	NaN	0.304241	0.017114	0.016190	0.000974
0	3	0.724842	0.032747	RandomForestClassifierDF	7	18	230	NaN	NaN	0.194192	0.011951	0.011218	0.000494
7	4	0.723124	0.033567	RandomForestClassifierDF	5	8	128	NaN	NaN	0.107344	0.005442	0.007033	0.000470
1	5	0.718529	0.034161	RandomForestClassifierDF	4	8	33	NaN	NaN	0.028277	0.000811	0.003479	0.000201
5	6	0.710667	0.031914	LGBMClassifierDF	4	NaN	58	9	0.922371	0.011555	0.000390	0.002386	0.000164
4	7	0.709203	0.027709	LGBMClassifierDF	5	NaN	37	8	0.860848	0.011761	0.000773	0.002461	0.000338
6	8	0.698365	0.030459	LGBMClassifierDF	7	NaN	78	8	0.891214	0.023414	0.000938	0.002875	0.000229
8	9	0.692535	0.029564	LGBMClassifierDF	4	NaN	167	8	0.834105	0.023739	0.001152	0.002916	0.000286
3	10	0.685510	0.028177	LGBMClassifierDF	6	NaN	379	8	0.966489	0.078497	0.002283	0.004991	0.000202

We can see based on our LearnerSelector, we have selected a Random Forest algorithm that achieved a mean accuracy of 0.725 with a SD of 0.034.

Using FACET for advanced model inspection#

The SHAP approach has become the standard method for model inspection. SHAP values are used to explain the additive contribution of each feature to the prediction for each observation (i.e., explain individual predictions).

The FACET LearnerInspector computes SHAP values for each observation using the best model identified by the LearnerSelector. The FACET LearnerInspector then provides advanced model inspection through new SHAP-based summary metrics for understanding pairwise feature redundancy and synergy. Redundancy and synergy are calculated using a new algorithm to understand model predictions from a global perspective to complement local SHAP.

The definitions of synergy and redundancy are as follows:

Synergy

The degree to which the model combines information from one feature with another to predict the target. For example, let’s assume we are predicting cardiovascular health using age and gender and the fitted model includes a complex interaction between them. This means these two features are synergistic for predicting cardiovascular health. Further, both features are important to the model and removing either one would significantly impact performance. Let’s assume age brings more information to the joint contribution than gender. This asymmetric contribution means the synergy for (age, gender) is less than the synergy for (gender, age). To think about it another way, imagine the prediction is a coordinate you are trying to reach. From your starting point, age gets you much closer to this point than gender, however, you need both to get there. Synergy reflects the fact that gender gets more help from age (higher synergy from the perspective of gender) than age does from gender (lower synergy from the perspective of age) to reach the prediction. This leads to an important point: synergy is a naturally asymmetric property of the global information two interacting features contribute to the model predictions. Synergy is expressed as a percentage ranging from 0% (full autonomy) to 100% (full synergy).
Redundancy

The degree to which a feature in a model duplicates the information of a second feature to predict the target. For example, let’s assume we had house size and number of bedrooms for predicting house price. These features capture similar information as the more bedrooms the larger the house and likely a higher price on average. The redundancy for (number of bedrooms, house size) will be greater than the redundancy for (house size, number of bedrooms). This is because house size “knows” more of what number of bedrooms does for predicting house price than vice-versa. Hence, there is greater redundancy from the perspective of number of bedrooms. Another way to think about it is removing house size will be more detrimental to model performance than removing number of bedrooms, as house size can better compensate for the absence of number of bedrooms. This also implies that house size would be a more important feature than number of bedrooms in the model. The important point here is that like synergy, redundancy is a naturally asymmetric property of the global information feature pairs have for predicting an outcome. Redundancy is expressed as a percentage ranging from 0% (full uniqueness) to 100% (full redundancy).

Note that cases can apply at the same time so a feature pair can use some information synergistically and some information redundantly.

The FACET LearnerInspector can calculate all of this with a single method call, but also offers methods to access the intermediate results of each step. A lightweight visualization framework is available to render the results in different styles.

SHAP values from the LearnerInspector can also be used with the SHAP package plotting functions for sample and observation level SHAP visualizations, such as SHAP distribution plots, dependency plots, force plots and waterfall plots.

[16]:

# run inspector
clf_inspector = LearnerInspector(
    pipeline=clf_selector.best_estimator_,
    n_jobs=-3,
    verbose=False,
).fit(prediab_initial_features)

Feature importance#

Feature importance has many ways of being measured. Here we utilize the FACET implementation based on the LearnerInspector. Each feature is ranked according to the mean SHAP value for that feature. This plot is paired with a standard SHAP distribution plot for features to see if there is any directional tendency for the associations.

[17]:

# FACET feature importance
f_importance = clf_inspector.feature_importance()
plt.subplot(1, 2, 1)
f_importance.sort_values().plot.barh()

# get some info for standard SHAP model inspection
shap_data = clf_inspector.shap_plot_data()

# standard SHAP summary plot using the shap package
plt.subplot(1, 2, 2)
shap.summary_plot(shap_values=shap_data.shap_values, features=shap_data.features, show=False, plot_size=(16.0, 8.0))
plt.tight_layout()

../_images/tutorial_Classification_with_Facet_38_0.png

Based on the feature importance’s we can see the top five features are age, RBC count, waist to height ratio, average systolic blood pressure and waist circumference. Inspection of the SHAP value distributions does not provide any indication of a general direction of association for any features.

Synergy#

[18]:

# synergy heatmap
synergy_matrix = clf_inspector.feature_synergy_matrix()
MatrixDrawer(style="matplot%").draw(synergy_matrix, title="Feature synergies")

../_images/tutorial_Classification_with_Facet_41_0.png

To interpret pairs of features in the synergy matrix, rows represent the primary feature (“perspective from”), and the columns represent associated features. Column and row widths indicate relative feature importance.

For example, for (SBP_to_DBP, RBC_count) from the perspective of primary feature SBP_to_DBP we find that 9.5% of its contribution to predict prediabetes risk is enabled by context from associated feature RBC_count, whereas RBC_count’s dependence on context from SBP_to_DBP is negligible.

Another interesting observation is that looking at the column for Age from the perspective of other features such as BMI, RBC count and SBP_to_DBP, we see values ranging up to 15% suggesting that contributions of BMI, RBC count and SBP/DBP to predicting prediabetes are partly enabled by Age. By contrast, Age is a strong independent feature. It has the highest feature importance, as indicated by the column and row width of the matrix.

We can inspect synergistic pairs of features more deeply using the SHAP dependence plots offered by the SHAP package. The LearnerInspector provides all data required for creating these and other plots through method shap_plot_data():

[19]:

spd = clf_inspector.shap_plot_data()
shap.dependence_plot("SBP_to_DBP", spd.shap_values, spd.features, interaction_index="Age")
shap.dependence_plot("BMI", spd.shap_values, spd.features, interaction_index="Age")

../_images/tutorial_Classification_with_Facet_43_0.png

../_images/tutorial_Classification_with_Facet_43_1.png

Redundancy#

[20]:

# redundancy heatmap
redundancy_matrix = clf_inspector.feature_redundancy_matrix()
MatrixDrawer(style="matplot%").draw(redundancy_matrix, title="Feature redundancies")

../_images/tutorial_Classification_with_Facet_45_0.png

As with synergy, the matrix row is the primary feature (“perspective from”) in the row-column feature pair. Let’s take Hematocrit and RBC count as our features of interest. We can see that from the perspective of Hematocrit 35% of the information is duplicated with RBC count to predict prediabetes, and vice versa.

A second interesting and perhaps expected attribute of the heatmap is the apparent clustering of BMI, Waist Circumference and Waist/Height. Intuitively it makes sense that these features would have varying degrees of redundancy among them, given they are physically related.

Feature clustering#

As detailed above redundancy and synergy for a feature pair is from the “perspective” of one of the features in the pair, and so yields two distinct values. However, a symmetric version can also be computed that provides not only a simplified perspective but allows the use of (1 - metric) as a feature distance. With this distance, hierarchical, single linkage clustering is applied to create a dendrogram visualization. This helps to identify groups of low distance features which work “in tandem” to predict the outcome. Such information can then be used to either reduce clusters of highly redundant features to a subset or highlight clusters of highly synergistic features that should always be considered together.

For this example, let’s apply clustering to redundancy to see how the apparent grouping observed in the heatmap appears in the dendrogram. Ideally, we want to see features only start to cluster as close to the right-hand side of the dendrogram as possible. This implies all features in the model are contributing uniquely to our predictions.

[21]:

# redundancy dendrogram
dd_redundancy = clf_inspector.feature_redundancy_linkage()
DendrogramDrawer().draw(title="Redundancy linkage", data=dd_redundancy)

../_images/tutorial_Classification_with_Facet_48_0.png

The dendrogram shows that Waist Circumference and Waist/Height cluster together first, which then cluster with BMI as well. This is a much clearer representation of the cluster of redundant features we observed in the heatmap. Considering BMI, Waist Circumference and Waist/Height:

What might we infer from synergy, redundancy and redundancy feature clustering?

BMI, Waist Circumference and Waist/Height form a small cluster of redundant features. This seems reasonable:
- Waist Circumference is included in the calculation of Waist/Height
- we might expect BMI to capture similar information about excess body mass as higher Waist Circumference and Waist/Height
We saw little synergy between features. We might have expected apriori to find some interesting synergies between diet, exercise, sleep and body composition. Of course, the model needs to identify these relationships from them to be reflected in the synergy metric(s).

What action(s) might we take?

Given the redundancy that appears between BMI, Waist Circumference and Waist/Height, we could look to eliminate one or two of these features from the model.

For convenience when working in a non-notebook environment, all of the Drawer’s provided by the pytools package also support a style='text' flag.

[22]:

DendrogramDrawer(style="text").draw(title="Redundancy linkage", data=dd_redundancy)

============================== Redundancy linkage ==============================

BMI              4% ----------------------\
Waist_to_hgt    11% ------------------\___/---------------------------------\
Waist_Circumfer  7% ------------------/                                     |
Gamma_glutamyl_  7% -------------------------------------------------------\\
Uric_acid        8% -----------------------------------------------------\ ||__
RBC_count       12% ---------------------------------------\_____________/-/|
Hematocrit       6% ---------------------------------------/                |
Age             30% -------------------------------------------------------\|
Average_SBP     10% ------------------------------------------------------\//
SBP_to_DBP       6% ------------------------------------------------------/

Removing redundant features#

Recall the redundancy dendrogram above where we saw a clear cluster of features with redundancy; Waist/Height, BMI, and Waist Circumference.

assess if the features of the model are unique, i.e. not redundant with other features
decide which features to discard, combine, or modify to increase the uniqueness of important features in the model

Before we proceed to looking at SHAP values for individual predictions and perform a univariate simulation, let’s eliminate two partially redundant features - we will choose to keep Waist/Height ratio and drop BMI and Waist Circumference.

[23]:

# drop redundant features from our FACET sample object
prediab_no_redundant_feat = prediab_initial_features.drop(
    feature_names=["BMI", "Waist_Circumference"]
)

[24]:

clf_selector = LearnerSelector(
    searcher_type=RandomizedSearchCV,
    parameter_space=[rforest_ps, lgbm_ps],
    cv=RepeatedKFold(n_splits=5, n_repeats=10, random_state=42),
    n_jobs=-3,
    scoring="roc_auc",
    random_state=42,
).fit(prediab_no_redundant_feat)

clf_selector.summary_report()

[24]:

	score			candidate	param					time
	test			-	classifier					fit		score
	rank	mean	std	-	max_depth	min_samples_leaf	n_estimators	min_child_samples	subsample	mean	std	mean	std
0	1	0.730649	0.033252	RandomForestClassifierDF	7	18	230	NaN	NaN	0.163255	0.011057	0.011816	0.005582
9	2	0.729762	0.033654	RandomForestClassifierDF	5	18	376	NaN	NaN	0.267629	0.019498	0.017950	0.007206
2	3	0.729191	0.032447	RandomForestClassifierDF	6	15	164	NaN	NaN	0.116502	0.003290	0.008389	0.000300
7	4	0.728881	0.033787	RandomForestClassifierDF	5	8	128	NaN	NaN	0.097372	0.010087	0.008520	0.003782
1	5	0.723997	0.034597	RandomForestClassifierDF	4	8	33	NaN	NaN	0.025100	0.000913	0.003405	0.000173
5	6	0.717630	0.029876	LGBMClassifierDF	4	NaN	58	9	0.922371	0.010459	0.000463	0.002328	0.000127
4	7	0.714231	0.031225	LGBMClassifierDF	5	NaN	37	8	0.860848	0.010239	0.000386	0.002252	0.000156
6	8	0.700294	0.031576	LGBMClassifierDF	7	NaN	78	8	0.891214	0.020512	0.000915	0.002845	0.000243
8	9	0.696527	0.032738	LGBMClassifierDF	4	NaN	167	8	0.834105	0.022742	0.004056	0.003137	0.001046
3	10	0.684973	0.032143	LGBMClassifierDF	6	NaN	379	8	0.966489	0.067832	0.001911	0.004945	0.000251

[25]:

# run inspector
inspector_no_redun = LearnerInspector(
    pipeline=clf_selector.best_estimator_,
    n_jobs=-3,
    verbose=False,
).fit(prediab_no_redundant_feat)

[26]:

# redundancy dendrogram
dd_redundancy = inspector_no_redun.feature_redundancy_linkage()
DendrogramDrawer().draw(title="HD set features", data=dd_redundancy)

../_images/tutorial_Classification_with_Facet_56_0.png

Now with the removal of BMI and Waist Circumference we can see the feature clustering starts much further to the right.

We can also check the best ranked model after removing redundant features.

[27]:

clf_selector.best_estimator_.classifier

[27]:

RandomForestClassifierDF(
    max_depth=7, min_samples_leaf=18, n_estimators=230, random_state=42
)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Summary#

With the capabilities offered by FACET we were able to:

Identify a learner with performance comparable to models in the literature.
Utilize advanced the SHAP value capabilities (synergy and redundancy) to identify additional features that could be removed (i.e., BMI and waist circumference removed in favour of waist to height ratio) and whether any features had strong synergistic effects - which they did not.
Simulate the effect of changes in waist to height ratio on the likelihood of being prediabetic.

What can you do next?#

There are several next/alternative steps that could be taken:

Utilize methods to deal with class imbalance and see if it improves the model.
Adding more features! The NHANES data is a treasure trove of information.
Retain diabetic patients and convert it into a multi-class learning problem.
What would happen if we applied survey weights when constructing a learner?
Further investigation of feature engineering. One could also look at different sets of measurements such as the bio-profile and perform dimension reduction first via PCA or some other method.
Other learners such as SVC, LDA, Elastic-Net, CNN.
More sophisticated imputation for missing values: the assumption of MAR might not hold, as those with worse health and thus more at risk of prediabetes may be more likely not to disclose poor health characteristics. Methods enabled by IterativeImputer could be used or even KNNImputer. Also feature engineering could be done post imputation in the pipeline, so values such as ratios are consistent.

Appendix#

Data source and study cohort#

Introduction

Prediabetes is a treatable condition that leads to many health complications, including eventually type 2 diabetes. Prediabetes has become an epidemic worldwide and is increasing in prevalence. As a largely asymptomatic condition, screening for prediabetes can be extremely challenging. However, early intervention, especially with lifestyle changes has been shown as effective in treating prediabetes. Accurate prediction/identification of those individuals at risk of prediabetes can improve early intervention and may provide insights into those interventions that work best. The current standard of care is a CDC prediabetes risk screening tool.

Data source

The dataset used in this tutorial is derived from the National Health and Nutrition Examination Survey (NHANES) 2013-14 cross-sectional survey. In brief, NHANES collects demographic, socioeconomic, dietary, health, medical, dental, physiological and laboratory data on a nationally representative sample of noninstitutionalized, civilian United States residents. Please note the set-up for this data loosely follows the approach in De Silva et al.

Patient cohort

In the NHANES data sub-sets of those surveyed may undergo a plasma glucose (FPG) test, oral glucose tolerance test (OGTT), or have glycated haemoglobin (HbA1c) measured. Diabetic patients were defined as those with any of the following: FPG >= 126 mg/dl, OGTT > 200 mg/dl, HbA1c > 6.4% or a Doctor diagnosed diabetes. The created dataset contains selected information for 4356 patients aged 20 years or over who were not considered diabetic or who were not pregnant or suspected to be pregnant at the time of the survey.

Learning target: prediabetes status

Using any of the available FPG, OGTT and HbA1c tests we defined patients as prediabetic where any of the following was satisfied: FPG 100–125 mg/dl, OGTT 140–200 mg/dl, or HbA1c 5.7–6.4%. Among this cohort 35% were prediabetic (n=1509).

Initial features

The following tables provides an overview of the 37 features included in the example dataset.

Instrument	Data File Name (File)	NHANES Field	Description	Dataset name	Type
Demographics	Demographic Variables, Sample Weights (DEMO_H)	RIDAGEYR	Age in years at screening	Age	Numeric
Demographics	Demographic Variables, Sample Weights (DEMO_H)	RIAGENDR	Gender	Gender	Categorical
Examination	Body Measures (BMX_H)	BMXWT	Weight (kg)	Weight	Numeric
Examination	Body Measures (BMX_H)	BMXHT	Standing Height (cm)	Standing_Height	Numeric
Examination	Body Measures (BMX_H)	BMXWAIST	Waist Circumference (cm)	Waist_Circumference	Numeric
Examination	Body Measures (BMX_H)	BMXBMI	Body Mass Index (kg/m^2)	BMI	Numeric
Examination	Blood Pressure (BPX_H)	BPXSY1 to 4	Systolic: Blood pres mm Hg	Average_SBP	Numeric
Examination	Blood Pressure (BPX_H)	BPXDI1 to 4	Diastolic: Blood pres mm Hg	Average_DBP	Numeric
Questionnaire	Blood Pressure & Cholesterol (BPQ_H)	BPQ020	Ever told you had high blood pressure	High_BP	Categorical
Questionnaire	Diet Behavior & Nutrition (DBQ_H)	DBQ700	How healthy is the diet	Healthy_diet	Categorical
Questionnaire	Diabetes (DIQ_H)	DIQ175A	Family history	Family_hist_diab	Categorical
Questionnaire	Diabetes (DIQ_H)	DIQ172	Feel could be at risk for diabetes	Feel_at_risk_diab	Categorical
Questionnaire	Current Health Status (HSQ_H)	HSD010	General health condition	General_health	Categorical
Questionnaire	Medical Conditions (MCQ_H)	MCQ080	Doctor ever said you were overweight	Told_overweight	Categorical
Questionnaire	Physical Activity (PAQ_H)	PAQ605	Vigorous work activity	Vigorous_work_activity	Categorical
Questionnaire	Physical Activity (PAQ_H)	PAQ620	Moderate work activity	Moderate_work_activity	Categorical
Questionnaire	Physical Activity (PAQ_H)	PAQ635	Walk or bicycle	Walk_or_bicycle	Categorical
Questionnaire	Physical Activity (PAQ_H)	PAQ650	Vigorous recreational activities	Vigorous_rec_activity	Categorical
Questionnaire	Physical Activity (PAQ_H)	PAQ665	Moderate recreational activities	Moderate_rec_activity	Categorical
Questionnaire	Sleep Disorders (SLQ_H)	SLD010H	How much sleep do you get (hours)?	Sleep_hours	Numeric
Questionnaire	Sleep Disorders (SLQ_H)	SLQ050	Ever told doctor had trouble sleeping?	Trouble_sleeping	Categorical
Questionnaire	Sleep Disorders (SLQ_H)	SLQ060	Ever told by doctor have sleep disorder?	Sleep_disorder	Categorical
Questionnaire	Weight History (WHQ_H)	WHQ070	Tried to lose weight in past year	Tried_weight_loss_past_year	Categorical
Laboratory	Cholesterol HDL (HDL_H)	LBDHDD	Direct HDL-Cholesterol (mg/dL)	HDL_Cholesterol	Numeric
Laboratory	Cholesterol Total (TCHOL_H)	LBXTC	Total Cholesterol(mg/dL)	Total_Cholesterol	Numeric
Laboratory	Complete Blood Count (CBC_H)	LBXWBCSI	White blood cell count (1000 cells/uL)	WBC_count	Numeric
Laboratory	Complete Blood Count (CBC_H)	LBXRBCSI	Red blood cell count (million cells/uL)	RBC_count	Numeric
Laboratory	Complete Blood Count (CBC_H)	LBXHCT	Haematocrit (%)	Haematocrit	Numeric
Laboratory	Biochemistry Profile (BIOPRO_H)	LBXSTR	Triglycerides (mg/dL)	Triglycerides	Numeric
Laboratory	Biochemistry Profile (BIOPRO_H)	LBXSUA	Uric acid (mg/dL)	Uric_acid	Numeric
Laboratory	Biochemistry Profile (BIOPRO_H)	LBXSOSSI	Osmolality (mmol/Kg)	Osmolality	Numeric
Laboratory	Biochemistry Profile (BIOPRO_H)	LBXSNASI	Sodium (mmol/L)	Sodium	Numeric
Laboratory	Biochemistry Profile (BIOPRO_H)	LBXSKSI	Potassium (mmol/L)	Potassium	Numeric
Laboratory	Biochemistry Profile (BIOPRO_H)	LBXSGTSI	Gamma glutamyl transferase (U/L)	Gamma_glutamyl_transferase	Numeric
Laboratory	Biochemistry Profile (BIOPRO_H)	LBXSCA	Total calcium (mg/dL)	Calcium	Numeric
Laboratory	Biochemistry Profile (BIOPRO_H)	LBXSATSI	Alanine aminotransferase ALT (IU/L)	Alanine_aminotransferase	Numeric
Laboratory	Biochemistry Profile (BIOPRO_H)	LBXSASSI	Aspartate aminotransferase AST (IU/L)	Aspartate_aminotransferase	Numeric

Exploratory Data Analysis (EDA)#

Let’s begin by doing some brief exploratory data analysis to assess the impact features might have on the likelihood someone is prediabetic and to also determine what will need to be addressed in a preprocessing pipeline.

[31]:

# load the prepared data frame
prediab_eda = pd.read_csv("pre_diab_nhanes.csv")
prediab_eda.head()

[31]:

	Age	Gender	Waist_Circumference	Weight	Standing_Height	BMI	Average_SBP	Average_DBP	HDL_Cholesterol	Total_Cholesterol	...	Triglycerides	Uric_acid	Osmolality	Sodium	Potassium	Gamma_glutamyl_transferase	Calcium	Alanine_aminotransferase	Aspartate_aminotransferase	Pre_diab
0	73.0	2.0	NaN	52.0	162.4	19.7	137.333333	86.666667	85.0	201.0	...	88.0	4.2	290.0	142.0	4.1	31.0	10.0	28.0	36.0	1
1	56.0	1.0	123.1	105.0	158.7	41.7	157.333333	82.000000	38.0	226.0	...	327.0	9.1	287.0	143.0	3.3	22.0	9.3	16.0	24.0	0
2	61.0	2.0	110.8	93.4	161.8	35.7	122.666667	80.666667	58.0	168.0	...	68.0	5.1	281.0	140.0	3.9	17.0	9.9	21.0	20.0	1
3	56.0	2.0	85.5	61.8	152.8	26.5	122.000000	72.666667	59.0	278.0	...	262.0	3.5	277.0	139.0	4.0	21.0	9.5	24.0	23.0	0
4	65.0	1.0	93.7	65.3	172.4	22.0	141.333333	77.333333	79.0	173.0	...	39.0	6.3	281.0	140.0	4.8	24.0	9.5	20.0	29.0	0

5 rows × 38 columns

We might also consider some rudimentary feature engineering as well, such as the ratio of waist circumference to height or the ratio of systolic to diastolic blood pressure. Let’s create these two features as well.

[32]:

prediab_eda["SBP_to_DBP"] = prediab_eda["Average_SBP"] / prediab_eda["Average_DBP"]
prediab_eda["Waist_to_hgt"] = (
    prediab_eda["Waist_Circumference"] / prediab_eda["Standing_Height"]
)
prediab_eda.head()

[32]:

	Age	Gender	Waist_Circumference	Weight	Standing_Height	BMI	Average_SBP	Average_DBP	HDL_Cholesterol	Total_Cholesterol	...	Osmolality	Sodium	Potassium	Gamma_glutamyl_transferase	Calcium	Alanine_aminotransferase	Aspartate_aminotransferase	Pre_diab	SBP_to_DBP	Waist_to_hgt
0	73.0	2.0	NaN	52.0	162.4	19.7	137.333333	86.666667	85.0	201.0	...	290.0	142.0	4.1	31.0	10.0	28.0	36.0	1	1.584615	NaN
1	56.0	1.0	123.1	105.0	158.7	41.7	157.333333	82.000000	38.0	226.0	...	287.0	143.0	3.3	22.0	9.3	16.0	24.0	0	1.918699	0.775677
2	61.0	2.0	110.8	93.4	161.8	35.7	122.666667	80.666667	58.0	168.0	...	281.0	140.0	3.9	17.0	9.9	21.0	20.0	1	1.520661	0.684796
3	56.0	2.0	85.5	61.8	152.8	26.5	122.000000	72.666667	59.0	278.0	...	277.0	139.0	4.0	21.0	9.5	24.0	23.0	0	1.678899	0.559555
4	65.0	1.0	93.7	65.3	172.4	22.0	141.333333	77.333333	79.0	173.0	...	281.0	140.0	4.8	24.0	9.5	20.0	29.0	0	1.827586	0.543503

5 rows × 40 columns

[33]:

# first a quick look at features overall
prediab_eda.describe().T

[33]:

	count	mean	std	min	25%	50%	75%	max
Age	4356.0	47.044536	17.188703	20.000000	33.000000	45.000000	61.000000	80.000000
Gender	4356.0	1.521350	0.499601	1.000000	1.000000	2.000000	2.000000	2.000000
Waist_Circumference	4158.0	97.167268	15.799909	55.500000	85.900000	95.500000	106.500000	177.900000
Weight	4321.0	79.939968	21.135093	32.300000	65.100000	76.900000	91.100000	222.600000
Standing_Height	4318.0	167.369701	10.158315	136.300000	160.000000	167.100000	174.500000	199.400000
BMI	4314.0	28.446639	6.793577	14.100000	23.700000	27.300000	31.800000	82.900000
Average_SBP	4229.0	121.893671	17.271841	64.666667	110.000000	118.666667	131.333333	228.000000
Average_DBP	4222.0	70.280515	11.039221	22.666667	63.333333	70.666667	77.333333	128.000000
HDL_Cholesterol	4318.0	53.723020	16.240125	10.000000	42.000000	51.000000	63.000000	173.000000
Total_Cholesterol	4318.0	190.363131	39.631288	69.000000	163.000000	188.000000	214.000000	525.000000
High_BP	4355.0	0.320321	0.466654	0.000000	0.000000	0.000000	1.000000	1.000000
Sleep_hours	4352.0	6.882812	1.389642	2.000000	6.000000	7.000000	8.000000	12.000000
Trouble_sleeping	4354.0	0.249426	0.432730	0.000000	0.000000	0.000000	0.000000	1.000000
Sleep_disorder	4346.0	0.085826	0.280139	0.000000	0.000000	0.000000	0.000000	1.000000
Told_overweight	4354.0	0.310978	0.462947	0.000000	0.000000	0.000000	1.000000	1.000000
General_health	4016.0	2.739293	0.952422	1.000000	2.000000	3.000000	3.000000	5.000000
Family_hist_diab	4356.0	0.219238	0.413778	0.000000	0.000000	0.000000	0.000000	1.000000
Feel_at_risk_diab	4311.0	0.288564	0.453147	0.000000	0.000000	0.000000	1.000000	1.000000
Vigorous_work_activity	4355.0	0.192652	0.394427	0.000000	0.000000	0.000000	0.000000	1.000000
Moderate_work_activity	4354.0	0.338769	0.473346	0.000000	0.000000	0.000000	1.000000	1.000000
Walk_or_bicycle	4356.0	0.254821	0.435811	0.000000	0.000000	0.000000	1.000000	1.000000
Vigorous_rec_activity	4356.0	0.246097	0.430785	0.000000	0.000000	0.000000	0.000000	1.000000
Moderate_rec_activity	4356.0	0.432277	0.495449	0.000000	0.000000	0.000000	1.000000	1.000000
Tried_weight_loss_past_year	3857.0	0.363495	0.481068	0.000000	0.000000	0.000000	1.000000	1.000000
Healthy_diet	4356.0	2.931359	0.995567	1.000000	2.000000	3.000000	4.000000	5.000000
WBC_count	4356.0	7.238522	2.313441	2.500000	5.700000	6.900000	8.400000	55.700000
RBC_count	4356.0	4.663620	0.482255	1.670000	4.347500	4.640000	4.970000	8.300000
Hematocrit	4356.0	41.543251	4.032086	17.900000	38.900000	41.500000	44.300000	55.900000
Triglycerides	4306.0	143.406410	105.113638	20.000000	76.000000	113.000000	176.000000	1213.000000
Uric_acid	4308.0	5.387303	1.384542	0.700000	4.400000	5.300000	6.300000	13.300000
Osmolality	4309.0	279.061963	4.825951	237.000000	276.000000	279.000000	282.000000	323.000000
Sodium	4309.0	139.882571	2.205999	119.000000	139.000000	140.000000	141.000000	154.000000
Potassium	4308.0	4.002948	0.347091	2.800000	3.800000	4.000000	4.200000	5.800000
Gamma_glutamyl_transferase	4308.0	27.459146	46.812967	5.000000	13.000000	18.000000	28.000000	1510.000000
Calcium	4289.0	9.451131	0.363228	7.600000	9.200000	9.400000	9.700000	14.800000
Alanine_aminotransferase	4307.0	24.831437	19.528176	6.000000	16.000000	20.000000	27.000000	536.000000
Aspartate_aminotransferase	4307.0	25.300906	19.612550	9.000000	19.000000	22.000000	27.000000	882.000000
Pre_diab	4356.0	0.346419	0.475883	0.000000	0.000000	0.000000	1.000000	1.000000
SBP_to_DBP	4222.0	1.765659	0.331493	1.272727	1.550097	1.684448	1.880256	5.470588
Waist_to_hgt	4152.0	0.581131	0.095154	0.364412	0.514059	0.571075	0.635716	1.063995

[34]:

# missingness
miss_count = prediab_eda.isna().sum()
miss_pct = miss_count[miss_count > 0] / len(prediab_eda)
miss_pct.sort_values().plot.barh()

[34]:

<AxesSubplot:>

../_images/tutorial_Classification_with_Facet_76_1.png

[35]:

# those variables that are complete
miss_count[miss_count == 0] / len(prediab_eda)

[35]:

Age                      0.0
Gender                   0.0
Family_hist_diab         0.0
Walk_or_bicycle          0.0
Vigorous_rec_activity    0.0
Moderate_rec_activity    0.0
Healthy_diet             0.0
WBC_count                0.0
RBC_count                0.0
Hematocrit               0.0
Pre_diab                 0.0
dtype: float64

[36]:

# view correlations with a heatmap
df_cor = prediab_eda.corr()
sns.heatmap(df_cor, xticklabels=df_cor.columns, yticklabels=df_cor.columns)

[36]:

<AxesSubplot:>

../_images/tutorial_Classification_with_Facet_78_1.png

[37]:

# let's do a table comparing features by the target

from tableone import TableOne

TableOne(
    prediab_eda,
    columns=prediab_eda.columns.drop("Pre_diab").to_list(),
    categorical=[
        "Gender",
        "High_BP",
        "Trouble_sleeping",
        "Sleep_disorder",
        "Told_overweight",
        "General_health",
        "Family_hist_diab",
        "Feel_at_risk_diab",
        "Vigorous_work_activity",
        "Moderate_work_activity",
        "Walk_or_bicycle",
        "Vigorous_rec_activity",
        "Moderate_rec_activity",
        "Tried_weight_loss_past_year",
        "Healthy_diet",
    ],
    groupby="Pre_diab",
    pval=True,
    remarks=False,
    overall=False,
)

[37]:

		Grouped by Pre_diab
		Missing	0	1	P-Value
n			2847	1509
Age, mean (SD)		0	43.4 (16.4)	54.0 (16.4)	<0.001
Gender, n (%)	1.0	0	1330 (46.7)	755 (50.0)	0.040
Gender, n (%)	2.0		1517 (53.3)	754 (50.0)
Waist_Circumference, mean (SD)		198	94.7 (15.3)	101.9 (15.7)	<0.001
...	...	...	...	...	...
Calcium, mean (SD)		67	9.5 (0.4)	9.4 (0.4)	0.129
Alanine_aminotransferase, mean (SD)		49	24.0 (19.1)	26.4 (20.2)	<0.001
Aspartate_aminotransferase, mean (SD)		49	24.7 (14.1)	26.5 (27.1)	0.018
SBP_to_DBP, mean (SD)		134	1.7 (0.3)	1.8 (0.4)	<0.001
Waist_to_hgt, mean (SD)		204	0.6 (0.1)	0.6 (0.1)	<0.001

61 rows × 4 columns

[38]:

# KDE plots by prediabetes status as well for those continuous features
distn_vars = [
    "Age",
    "Waist_Circumference",
    "Weight",
    "Standing_Height",
    "BMI",
    "Average_SBP",
    "Average_DBP",
    "HDL_Cholesterol",
    "Total_Cholesterol",
    "Sleep_hours",
    "WBC_count",
    "RBC_count",
    "Hematocrit",
    "Triglycerides",
    "Uric_acid",
    "Osmolality",
    "Sodium",
    "Potassium",
    "Gamma_glutamyl_transferase",
    "Calcium",
    "Alanine_aminotransferase",
    "Aspartate_aminotransferase",
    "SBP_to_DBP",
    "Waist_to_hgt",
]

df_kde = pd.melt(prediab_eda[distn_vars + ["Pre_diab"]], "Pre_diab", distn_vars)
g = sns.FacetGrid(
    df_kde, col="variable", hue="Pre_diab", col_wrap=5, sharex=False, sharey=False
)
g.map(sns.kdeplot, "value", fill=True)
plt.show()

../_images/tutorial_Classification_with_Facet_80_0.png

Quick EDA summary:

Missingness

Our target and 10 features were complete.
The other 29 features had levels of missing ranging from 0.02% to 11.5%, and 22 of the 29 were <3%.
Most extreme cases of missingness were for tried weight loss in the past year (11.5%) and general health (7.8%).

Correlations

There is a wide range of correlation among features, where we can see that for example RBC count, haematocrit and standing height are all moderately negatively correlated with gender (i.e., females are shorter), and that body measurements for weight, height, BMI, and waist circumference are all strongly positively correlated.

Associations

Some features already appear to not to be strongly associated based on univariate tests, such as: average DBP, trouble sleeping, vigorous work activity, moderate work activity, tried weight loss in the past year, healthy diet, WBC count, and calcium levels.
Features associated with an increased risk of prediabetes include: older age, being male, increased waist circumference, decreased standing height, increased weight, increased BMI, increased average SBP, lower HDL cholesterol, increased total cholesterol, having high BP, increased sleep hours, having a sleep disorder, being told you are overweight, poorer general health, family history of diabetes, feeling at risk for diabetes, reduced walking or cycling, less vigorous or moderate recreational activity, higher RBC count, higher haematocrit, increased triglycerides, increased uric acid, higher osmolality, increased sodium, increased potassium, increased gamma glutamyl transferase, increased alanine aminotransferase, increased aspartate aminotransferase, increased SBP to DBP ratio, increased waist to height ratio.

Distributions of numeric features

Many of the continuous features have been flagged as potentially non-normal/multi-modal and with potential outliers. This can be seen in the plots where many biomarkers have positive skewed distributions and in the case of sodium, some modality which could also be related to measurement.

Introduction to FACET

Standard Scikit-learn Classification Summary with FACET

Classification with FACET: Prediabetes Study#