pytools.data.sim_data#

pytools.data.sim_data(n=100, intercept=- 5, two_way_coef=None, linear_vars=10, linear_var_coef=None, noise_vars=0, corr_vars=0, corr_type='AR1', corr_value=0, surg_err=0.05, bin_var_p=0, bin_coef=0, outcome='classification', regression_err=None, seed_val=4763546)[source]#

Simulate data for classification or regression that includes an interaction between two linear features, and some non-linear and linear features.

Noise variables, correlated variables that are not predictive and surrogate features which are just derived from features that are predictive are also added.

This function is for the most part a direct translation of the twoClassSim function from the R package caret – the option for an ordinal outcome and binary outcome mis-labelling were omitted. Full credit for the approach used for simulating binary classification data goes to the authors and contributors of caret [Kuhn, M. (2008). Caret package. Journal of Statistical Software, 28(5).]

Key modifications compared to the R implementation:

  1. The ordinal outcome option has not been translated

  2. Mis-labelling of the binary outcome has not been translated

  3. The addition of a linear feature that is a copy of another used in the linear predictor with a small amount of noise has been added to allow for the study of variable surrogacy

  4. Option for a binary predictor and surrogate has been added

  5. Toggle option for regression versus classification has been added

  6. Arguments for the coefficients of primary predictors of interest has been added

Parameters
  • n (int) – number of observations

  • intercept (float) – value for the intercept which can be modified to generate class imbalance

  • two_way_coef (Optional[tuple[float, float, float]]) – tuple of three coefficients: two linear terms and an interaction effect

  • linear_vars (int) – number of linear features

  • linear_var_coef (Optional[Sequence[float]]) – an optional list of coefficients for linear features if the default is not desired

  • noise_vars (int) – number of unrelated independent noise features (do not contribute to the linear predictor)

  • corr_vars (int) – number of unrelated correlated noise features (do not contribute to the linear predictor)

  • corr_type (str) – type of correlation (exchangeable or auto-regressive) for correlated noise features

  • corr_value (float) – correlation for correlated noise features

  • surg_err (float) – degree of noise added to first linear predictor

  • bin_var_p (float) – prevalence for a binary feature to include in linear predictor

  • bin_coef (float) – coefficient for the impact of binary feature on linear predictor

  • outcome (str) – can be either classification for a binary outcome or regression for a continuous outcome

  • regression_err (Optional[float]) – the error to be used in simulating a regression outcome

  • seed_val (int) – set a seed for reproducibility

Return type

DataFrame

Returns

data frame containing the simulated features and target for classification