Benchmark for categorical data

In this tutorial, we show how to use Qolmat to define imputation methods managing mixed type data. We benchmark these methods on the Titanic Data Set. It comprehends passengers features as well as if they survived the accident.

from sklearn.pipeline import Pipeline
from sklearn import utils as sku

from qolmat.benchmark import comparator, missing_patterns
from qolmat.imputations import imputers, preprocessing
from qolmat.imputations.imputers import ImputerRegressor
from qolmat.utils import data

seed = 1234
rng = sku.check_random_state(seed)

1. Titanic dataset

We get the data and focus on the explanatory variables

df = data.get_data("Titanic")
df = df.drop(columns=["Survived"])
print("Dataset shape:", df.shape)
df.head()
Dataset shape: (892, 6)
Sex Age SibSp Parch Fare Embarked
0 male 22.0 1.0 0.0 7.2500 S
1 female 38.0 1.0 0.0 71.2833 C
2 female 26.0 0.0 0.0 7.9250 S
3 female 35.0 1.0 0.0 53.1000 S
4 male 35.0 0.0 0.0 8.0500 S


2. Mixed type imputation methods

Qolmat supports three approaches to impute mixed type data. The first approach is a simple imputation by the mean, median or the most-frequent value column by column

imputer_simple = imputers.ImputerSimple()

The second approach relies on the class WrapperTransformer which wraps a numerical imputation method (e.g. RPCA) in a preprocessing transformer with fit_transform and inverse_transform methods providing an embedding of the data.

cols_num = df.select_dtypes(include="number").columns
cols_cat = df.select_dtypes(exclude="number").columns
imputer_rpca = imputers.ImputerRpcaNoisy(random_state=rng)
ohe = preprocessing.OneHotEncoderProjector(
    handle_unknown="ignore",
    handle_missing="return_nan",
    use_cat_names=True,
    cols=cols_cat,
)
bt = preprocessing.BinTransformer(cols=cols_num)
wrapper = Pipeline(steps=[("OneHotEncoder", ohe), ("BinTransformer", bt)])
imputer_wrap_rpca = preprocessing.WrapperTransformer(imputer_rpca, wrapper)

The third approach uses ImputerRegressor which imputes iteratively each column using the other ones. The function make_robust_MixteHGB provides an underlying model able to: - address both numerical targets (regression) and categorical targets (classification) - manage categorical features though one hot encoding - manage missing features (native to the HistGradientBoosting)

pipestimator = preprocessing.make_robust_MixteHGB(avoid_new=True)
imputer_hgb = ImputerRegressor(estimator=pipestimator, handler_nan="none", random_state=rng)
imputer_wrap_hgb = preprocessing.WrapperTransformer(imputer_hgb, bt)

3. Mixed type model selection

Let us now compare these three approaches by measuring their ability to impute uniformly distributed holes.

dict_imputers = {
    "Simple": imputer_simple,
    "HGB": imputer_wrap_hgb,
    "RPCA": imputer_wrap_rpca,
}
cols_to_impute = df.columns
ratio_masked = 0.1
generator_holes = missing_patterns.UniformHoleGenerator(
    n_splits=2,
    subset=cols_to_impute,
    ratio_masked=ratio_masked,
    sample_proportional=False,
    random_state=rng
)
metrics = ["rmse", "accuracy"]

comparison = comparator.Comparator(
    dict_imputers,
    generator_holes=generator_holes,
    metrics=metrics,
    max_evals=2,
)
results = comparison.compare(df)

On numerical variables, the imputation based on the HistGradientBoosting (HGB) model globally achieves lower Root-square Mean Squared Errors (RMSE).

results.loc["rmse"].style.highlight_min(color="lightgreen", axis=1)
  Simple HGB RPCA
Age 15.092098 14.083647 14.226850
Fare 53.723829 47.885061 45.872676
Parch 0.701723 0.543583 0.577267
SibSp 0.999796 0.816497 0.890087


The HGB imputation methods globally reaches a better accuracy on the categorical data.

results.loc["accuracy"].style.highlight_max(color="lightgreen", axis=1)
  Simple HGB RPCA
Age 0.011111 0.033333 0.022222
Embarked 0.744444 0.861111 0.744444
Fare 0.005556 0.005556 0.016667
Parch 0.783333 0.788889 0.766667
Sex 0.694444 0.683333 0.672222
SibSp 0.716667 0.722222 0.611111


Total running time of the script: (0 minutes 43.050 seconds)

Gallery generated by Sphinx-Gallery