.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/tutorials/plot_tuto_categorical.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_tutorials_plot_tuto_categorical.py: ============================== Benchmark for categorical data ============================== In this tutorial, we show how to use Qolmat to define imputation methods managing mixed type data. We benchmark these methods on the Titanic Data Set. It comprehends passengers features as well as if they survived the accident. .. GENERATED FROM PYTHON SOURCE LINES 9-21 .. code-block:: Python from sklearn.pipeline import Pipeline from sklearn import utils as sku from qolmat.benchmark import comparator, missing_patterns from qolmat.imputations import imputers, preprocessing from qolmat.imputations.imputers import ImputerRegressor from qolmat.utils import data seed = 1234 rng = sku.check_random_state(seed) .. GENERATED FROM PYTHON SOURCE LINES 22-25 1. Titanic dataset --------------------------------------------------------------- We get the data and focus on the explanatory variables .. GENERATED FROM PYTHON SOURCE LINES 25-30 .. code-block:: Python df = data.get_data("Titanic") df = df.drop(columns=["Survived"]) print("Dataset shape:", df.shape) df.head() .. rst-class:: sphx-glr-script-out .. code-block:: none Dataset shape: (892, 6) .. raw:: html
Sex Age SibSp Parch Fare Embarked
0 male 22.0 1.0 0.0 7.2500 S
1 female 38.0 1.0 0.0 71.2833 C
2 female 26.0 0.0 0.0 7.9250 S
3 female 35.0 1.0 0.0 53.1000 S
4 male 35.0 0.0 0.0 8.0500 S


.. GENERATED FROM PYTHON SOURCE LINES 31-36 2. Mixed type imputation methods --------------------------------------------------------------- Qolmat supports three approaches to impute mixed type data. The first approach is a simple imputation by the mean, median or the most-frequent value column by column .. GENERATED FROM PYTHON SOURCE LINES 36-39 .. code-block:: Python imputer_simple = imputers.ImputerSimple() .. GENERATED FROM PYTHON SOURCE LINES 40-43 The second approach relies on the class WrapperTransformer which wraps a numerical imputation method (e.g. RPCA) in a preprocessing transformer with fit_transform and inverse_transform methods providing an embedding of the data. .. GENERATED FROM PYTHON SOURCE LINES 43-57 .. code-block:: Python cols_num = df.select_dtypes(include="number").columns cols_cat = df.select_dtypes(exclude="number").columns imputer_rpca = imputers.ImputerRpcaNoisy(random_state=rng) ohe = preprocessing.OneHotEncoderProjector( handle_unknown="ignore", handle_missing="return_nan", use_cat_names=True, cols=cols_cat, ) bt = preprocessing.BinTransformer(cols=cols_num) wrapper = Pipeline(steps=[("OneHotEncoder", ohe), ("BinTransformer", bt)]) imputer_wrap_rpca = preprocessing.WrapperTransformer(imputer_rpca, wrapper) .. GENERATED FROM PYTHON SOURCE LINES 58-63 The third approach uses ImputerRegressor which imputes iteratively each column using the other ones. The function make_robust_MixteHGB provides an underlying model able to: - address both numerical targets (regression) and categorical targets (classification) - manage categorical features though one hot encoding - manage missing features (native to the HistGradientBoosting) .. GENERATED FROM PYTHON SOURCE LINES 63-68 .. code-block:: Python pipestimator = preprocessing.make_robust_MixteHGB(avoid_new=True) imputer_hgb = ImputerRegressor(estimator=pipestimator, handler_nan="none", random_state=rng) imputer_wrap_hgb = preprocessing.WrapperTransformer(imputer_hgb, bt) .. GENERATED FROM PYTHON SOURCE LINES 69-73 3. Mixed type model selection --------------------------------------------------------------- Let us now compare these three approaches by measuring their ability to impute uniformly distributed holes. .. GENERATED FROM PYTHON SOURCE LINES 73-98 .. code-block:: Python dict_imputers = { "Simple": imputer_simple, "HGB": imputer_wrap_hgb, "RPCA": imputer_wrap_rpca, } cols_to_impute = df.columns ratio_masked = 0.1 generator_holes = missing_patterns.UniformHoleGenerator( n_splits=2, subset=cols_to_impute, ratio_masked=ratio_masked, sample_proportional=False, random_state=rng ) metrics = ["rmse", "accuracy"] comparison = comparator.Comparator( dict_imputers, generator_holes=generator_holes, metrics=metrics, max_evals=2, ) results = comparison.compare(df) .. GENERATED FROM PYTHON SOURCE LINES 99-101 On numerical variables, the imputation based on the HistGradientBoosting (HGB) model globally achieves lower Root-square Mean Squared Errors (RMSE). .. GENERATED FROM PYTHON SOURCE LINES 101-103 .. code-block:: Python results.loc["rmse"].style.highlight_min(color="lightgreen", axis=1) .. raw:: html
  Simple HGB RPCA
Age 15.092098 14.083647 14.226850
Fare 53.723829 47.885061 45.872676
Parch 0.701723 0.543583 0.577267
SibSp 0.999796 0.816497 0.890087


.. GENERATED FROM PYTHON SOURCE LINES 104-105 The HGB imputation methods globally reaches a better accuracy on the categorical data. .. GENERATED FROM PYTHON SOURCE LINES 105-106 .. code-block:: Python results.loc["accuracy"].style.highlight_max(color="lightgreen", axis=1) .. raw:: html
  Simple HGB RPCA
Age 0.011111 0.033333 0.022222
Embarked 0.744444 0.861111 0.744444
Fare 0.005556 0.005556 0.016667
Parch 0.783333 0.788889 0.766667
Sex 0.694444 0.683333 0.672222
SibSp 0.716667 0.722222 0.611111


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 43.050 seconds) .. _sphx_glr_download_examples_tutorials_plot_tuto_categorical.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_tuto_categorical.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_tuto_categorical.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_tuto_categorical.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_