.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/tutorials/plot_tuto_mean_median.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_tutorials_plot_tuto_mean_median.py: ======================================================================================== Comparison of basic imputers ======================================================================================== In this tutorial, we show how to use the Qolmat comparator (:class:`~qolmat.benchmark.comparator`) to choose the best imputation between two of the simplest imputation methods: mean or median (:class:`~qolmat.imputations.imputers.ImputerSimple`). The dataset used is the numerical `superconduct` dataset and contains information on 21263 superconductors. We generate holes uniformly at random via :class:`~qolmat.benchmark.missing_patterns.UniformHoleGenerator` .. GENERATED FROM PYTHON SOURCE LINES 14-27 .. code-block:: Python import matplotlib import matplotlib.pyplot as plt import numpy as np from sklearn import utils as sku from qolmat.benchmark import comparator, missing_patterns from qolmat.imputations import imputers from qolmat.utils import data, plot seed = 1234 rng = sku.check_random_state(seed) .. GENERATED FROM PYTHON SOURCE LINES 28-38 1. Data --------------------------------------------------------------- The data contains information on 21263 superconductors. Originally, the first 81 columns contain extracted features and the 82nd column contains the critical temperature which is used as the target variable. The data does not contain missing values; so for the purpose of this notebook, we corrupt the data, with the :func:`qolmat.utils.data.add_holes` function. In this way, each column has missing values. .. GENERATED FROM PYTHON SOURCE LINES 38-43 .. code-block:: Python df = data.add_holes( data.get_data("Superconductor"), ratio_masked=0.2, mean_size=120, random_state=rng ) .. GENERATED FROM PYTHON SOURCE LINES 44-46 The dataset contains 82 columns. For simplicity, we only consider some. .. GENERATED FROM PYTHON SOURCE LINES 46-57 .. code-block:: Python columns = [ "criticaltemp", "mean_atomic_mass", "mean_FusionHeat", "mean_ThermalConductivity", "mean_Valence", ] df = df[columns] cols_to_impute = df.columns .. GENERATED FROM PYTHON SOURCE LINES 58-61 Let's take a look at the missing data. In this plot, a white (resp. black) box represents a missing (resp. observed) value. .. GENERATED FROM PYTHON SOURCE LINES 61-71 .. code-block:: Python plt.figure(figsize=(15, 4)) plt.imshow( df.notna().values.T, aspect="auto", cmap="binary", interpolation="none" ) plt.yticks(range(len(df.columns)), df.columns) plt.xlabel("Samples", fontsize=12) plt.grid(False) plt.show() .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_001.png :alt: plot tuto mean median :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 72-79 2. Imputation --------------------------------------------------------------- This part is devoted to the imputation methods. In this tutorial, we only focus on mean and median imputation. In order to use the comparator, we have to define a dictionary of imputers, a way to generate holes (additional missing values on which the imputers will be evaluated) and a list of metrics. .. GENERATED FROM PYTHON SOURCE LINES 79-86 .. code-block:: Python imputer_mean = imputers.ImputerSimple(strategy="mean") imputer_median = imputers.ImputerSimple(strategy="median") dict_imputers = {"mean": imputer_mean, "median": imputer_median} metrics = ["mae", "wmape", "kl_columnwise"] .. GENERATED FROM PYTHON SOURCE LINES 87-96 Concretely, the comparator takes as input a dataframe to impute, a proportion of nan to create, a dictionary of imputers (those previously mentioned), a list with the columns names to impute, a generator of holes specifying the type of holes to create. in this example, we have chosen the uniform hole generator. For example, by imposing that 10% of missing data be created ``ratio_masked=0.1`` and creating missing values in columns ``subset=cols_to_impute``: .. GENERATED FROM PYTHON SOURCE LINES 96-120 .. code-block:: Python generator_holes = missing_patterns.UniformHoleGenerator( n_splits=2, subset=cols_to_impute, ratio_masked=0.1, random_state=rng ) df_mask = generator_holes.generate_mask(df) df_mask = np.invert(df_mask).astype("int") df_tot = df.copy() df_tot[df.notna()] = 0 df_tot[df.isna()] = 2 df_tot += df_mask colorsList = [(1, 0, 0), (0, 0, 0), (1, 1, 1)] custom_cmap = matplotlib.colors.ListedColormap(colorsList) plt.figure(figsize=(15, 4)) plt.imshow( df_tot.values.T, aspect="auto", cmap=custom_cmap, interpolation="none" ) plt.yticks(range(len(df_tot.columns)), df_tot.columns) plt.xlabel("Samples", fontsize=12) plt.grid(False) plt.show() .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_002.png :alt: plot tuto mean median :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 121-123 Now that we've seen how hole generation behaves, we can use it in the comparator. .. GENERATED FROM PYTHON SOURCE LINES 123-131 .. code-block:: Python comparison = comparator.Comparator( dict_imputers, generator_holes=generator_holes, metrics=metrics, max_evals=5, ) .. GENERATED FROM PYTHON SOURCE LINES 132-136 On the basis of the results, we can see that imputation by the median provides lower reconstruction errors than those obtained by imputation by the mean, except for the `mean_atomic_mass` with MAE. .. GENERATED FROM PYTHON SOURCE LINES 136-140 .. code-block:: Python results = comparison.compare(df) results.style.highlight_min(color="lightsteelblue", axis=1) .. raw:: html
    mean median
kl_columnwise criticaltemp 29.699655 28.770511
mean_FusionHeat 28.283599 13.688591
mean_ThermalConductivity 25.868046 25.545719
mean_Valence 32.290397 28.979781
mean_atomic_mass 24.258423 24.258423
mae criticaltemp 29.505923 27.669155
mean_FusionHeat 8.112540 6.944263
mean_ThermalConductivity 29.626863 29.174941
mean_Valence 0.825231 0.775617
mean_atomic_mass 20.671858 20.683861
wmape criticaltemp 0.855155 0.801891
mean_FusionHeat 0.568664 0.486700
mean_ThermalConductivity 0.331483 0.326426
mean_Valence 0.262622 0.246824
mean_atomic_mass 0.232943 0.233077


.. GENERATED FROM PYTHON SOURCE LINES 141-142 Let's visualize this dataframe. .. GENERATED FROM PYTHON SOURCE LINES 142-152 .. code-block:: Python n_metrics = len(metrics) fig = plt.figure(figsize=(14, 3 * n_metrics)) for i, metric in enumerate(metrics): fig.add_subplot(n_metrics, 1, i + 1) plot.multibar(results.loc[metric], decimals=2) plt.ylabel(metric) plt.show() .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_003.png :alt: plot tuto mean median :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_003.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 153-157 And finally, let's take a look at the imputations. Whatever the method, we observe that the imputations are relatively poor. Other imputation methods are therefore necessary (see folder `imputations`). .. GENERATED FROM PYTHON SOURCE LINES 157-173 .. code-block:: Python dfs_imputed = { name: imp.fit_transform(df) for name, imp in dict_imputers.items() } for col in cols_to_impute: fig, ax = plt.subplots(figsize=(10, 3)) values_orig = df[col] plt.plot(values_orig[15000:], ".", color="black", label="original") for ind, (name, model) in enumerate(list(dict_imputers.items())): values_imp = dfs_imputed[name][col].copy() values_imp[values_orig.notna()] = np.nan plt.plot(values_imp[15000:], ".", label=name, alpha=1) plt.ylabel(col, fontsize=16) plt.legend() plt.show() .. rst-class:: sphx-glr-horizontal * .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_004.png :alt: plot tuto mean median :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_004.png :class: sphx-glr-multi-img * .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_005.png :alt: plot tuto mean median :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_005.png :class: sphx-glr-multi-img * .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_006.png :alt: plot tuto mean median :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_006.png :class: sphx-glr-multi-img * .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_007.png :alt: plot tuto mean median :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_007.png :class: sphx-glr-multi-img * .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_008.png :alt: plot tuto mean median :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_008.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 4.819 seconds) .. _sphx_glr_download_examples_tutorials_plot_tuto_mean_median.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_tuto_mean_median.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_tuto_mean_median.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_tuto_mean_median.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_