.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "examples/tutorials/plot_tuto_mean_median.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_examples_tutorials_plot_tuto_mean_median.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_examples_tutorials_plot_tuto_mean_median.py:

========================================================================================
Comparison of basic imputers
========================================================================================

In this tutorial, we show how to use the Qolmat comparator
(:class:`~qolmat.benchmark.comparator`) to choose
the best imputation between two of the simplest imputation methods: mean or median
(:class:`~qolmat.imputations.imputers.ImputerSimple`).
The dataset used is the numerical `superconduct` dataset and
contains information on 21263 superconductors.
We generate holes uniformly at random via
:class:`~qolmat.benchmark.missing_patterns.UniformHoleGenerator`

.. GENERATED FROM PYTHON SOURCE LINES 14-27

.. code-block:: Python


    import matplotlib
    import matplotlib.pyplot as plt
    import numpy as np
    from sklearn import utils as sku

    from qolmat.benchmark import comparator, missing_patterns
    from qolmat.imputations import imputers
    from qolmat.utils import data, plot

    seed = 1234
    rng = sku.check_random_state(seed)


.. GENERATED FROM PYTHON SOURCE LINES 28-38

1. Data
---------------------------------------------------------------
The data contains information on 21263 superconductors.
Originally, the first 81 columns contain extracted features and
the 82nd column contains the critical temperature which is used as the
target variable.
The data does not contain missing values;
so for the purpose of this notebook,
we corrupt the data, with the :func:`qolmat.utils.data.add_holes` function.
In this way, each column has missing values.

.. GENERATED FROM PYTHON SOURCE LINES 38-43

.. code-block:: Python


    df = data.add_holes(
        data.get_data("Superconductor"), ratio_masked=0.2, mean_size=120, random_state=rng
    )


.. GENERATED FROM PYTHON SOURCE LINES 44-46

The dataset contains 82 columns. For simplicity,
we only consider some.

.. GENERATED FROM PYTHON SOURCE LINES 46-57

.. code-block:: Python


    columns = [
        "criticaltemp",
        "mean_atomic_mass",
        "mean_FusionHeat",
        "mean_ThermalConductivity",
        "mean_Valence",
    ]
    df = df[columns]
    cols_to_impute = df.columns


.. GENERATED FROM PYTHON SOURCE LINES 58-61

Let's take a look at the missing data.
In this plot, a white (resp. black) box represents
a missing (resp. observed) value.

.. GENERATED FROM PYTHON SOURCE LINES 61-71

.. code-block:: Python


    plt.figure(figsize=(15, 4))
    plt.imshow(
        df.notna().values.T, aspect="auto", cmap="binary", interpolation="none"
    )
    plt.yticks(range(len(df.columns)), df.columns)
    plt.xlabel("Samples", fontsize=12)
    plt.grid(False)
    plt.show()


.. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_001.png
   :alt: plot tuto mean median
   :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_001.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 72-79

2. Imputation
---------------------------------------------------------------
This part is devoted to the imputation methods.
In this tutorial, we only focus on mean and median imputation.
In order to use the comparator, we have to define a dictionary of imputers,
a way to generate holes (additional missing values on which the
imputers will be evaluated) and a list of metrics.

.. GENERATED FROM PYTHON SOURCE LINES 79-86

.. code-block:: Python


    imputer_mean = imputers.ImputerSimple(strategy="mean")
    imputer_median = imputers.ImputerSimple(strategy="median")
    dict_imputers = {"mean": imputer_mean, "median": imputer_median}

    metrics = ["mae", "wmape", "kl_columnwise"]


.. GENERATED FROM PYTHON SOURCE LINES 87-96

Concretely, the comparator takes as input a dataframe to impute,
a proportion of nan to create, a dictionary of imputers
(those previously mentioned),
a list with the columns names to impute,
a generator of holes specifying the type of holes to create.
in this example, we have chosen the uniform hole generator.
For example, by imposing that 10% of missing data be created
``ratio_masked=0.1`` and creating missing values in columns
``subset=cols_to_impute``:

.. GENERATED FROM PYTHON SOURCE LINES 96-120

.. code-block:: Python


    generator_holes = missing_patterns.UniformHoleGenerator(
        n_splits=2, subset=cols_to_impute, ratio_masked=0.1, random_state=rng
    )
    df_mask = generator_holes.generate_mask(df)
    df_mask = np.invert(df_mask).astype("int")

    df_tot = df.copy()
    df_tot[df.notna()] = 0
    df_tot[df.isna()] = 2
    df_tot += df_mask

    colorsList = [(1, 0, 0), (0, 0, 0), (1, 1, 1)]
    custom_cmap = matplotlib.colors.ListedColormap(colorsList)

    plt.figure(figsize=(15, 4))
    plt.imshow(
        df_tot.values.T, aspect="auto", cmap=custom_cmap, interpolation="none"
    )
    plt.yticks(range(len(df_tot.columns)), df_tot.columns)
    plt.xlabel("Samples", fontsize=12)
    plt.grid(False)
    plt.show()


.. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_002.png
   :alt: plot tuto mean median
   :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_002.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 121-123

Now that we've seen how hole generation behaves,
we can use it in the comparator.

.. GENERATED FROM PYTHON SOURCE LINES 123-131

.. code-block:: Python


    comparison = comparator.Comparator(
        dict_imputers,
        generator_holes=generator_holes,
        metrics=metrics,
        max_evals=5,
    )


.. GENERATED FROM PYTHON SOURCE LINES 132-136

On the basis of the results, we can see that imputation by
the median provides lower reconstruction errors
than those obtained by imputation by the mean,
except for the `mean_atomic_mass` with MAE.

.. GENERATED FROM PYTHON SOURCE LINES 136-140

.. code-block:: Python


    results = comparison.compare(df)
    results.style.highlight_min(color="lightsteelblue", axis=1)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <style type="text/css">
    #T_ebaea_row0_col1, #T_ebaea_row1_col1, #T_ebaea_row2_col1, #T_ebaea_row3_col1, #T_ebaea_row4_col0, #T_ebaea_row4_col1, #T_ebaea_row5_col1, #T_ebaea_row6_col1, #T_ebaea_row7_col1, #T_ebaea_row8_col1, #T_ebaea_row9_col0, #T_ebaea_row10_col1, #T_ebaea_row11_col1, #T_ebaea_row12_col1, #T_ebaea_row13_col1, #T_ebaea_row14_col0 {
      background-color: lightsteelblue;
    }
    </style>
    <table id="T_ebaea">
      <thead>
        <tr>
          <th class="blank" >&nbsp;</th>
          <th class="blank level0" >&nbsp;</th>
          <th id="T_ebaea_level0_col0" class="col_heading level0 col0" >mean</th>
          <th id="T_ebaea_level0_col1" class="col_heading level0 col1" >median</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th id="T_ebaea_level0_row0" class="row_heading level0 row0" rowspan="5">kl_columnwise</th>
          <th id="T_ebaea_level1_row0" class="row_heading level1 row0" >criticaltemp</th>
          <td id="T_ebaea_row0_col0" class="data row0 col0" >29.699655</td>
          <td id="T_ebaea_row0_col1" class="data row0 col1" >28.770511</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row1" class="row_heading level1 row1" >mean_FusionHeat</th>
          <td id="T_ebaea_row1_col0" class="data row1 col0" >28.283599</td>
          <td id="T_ebaea_row1_col1" class="data row1 col1" >13.688591</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row2" class="row_heading level1 row2" >mean_ThermalConductivity</th>
          <td id="T_ebaea_row2_col0" class="data row2 col0" >25.868046</td>
          <td id="T_ebaea_row2_col1" class="data row2 col1" >25.545719</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row3" class="row_heading level1 row3" >mean_Valence</th>
          <td id="T_ebaea_row3_col0" class="data row3 col0" >32.290397</td>
          <td id="T_ebaea_row3_col1" class="data row3 col1" >28.979781</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row4" class="row_heading level1 row4" >mean_atomic_mass</th>
          <td id="T_ebaea_row4_col0" class="data row4 col0" >24.258423</td>
          <td id="T_ebaea_row4_col1" class="data row4 col1" >24.258423</td>
        </tr>
        <tr>
          <th id="T_ebaea_level0_row5" class="row_heading level0 row5" rowspan="5">mae</th>
          <th id="T_ebaea_level1_row5" class="row_heading level1 row5" >criticaltemp</th>
          <td id="T_ebaea_row5_col0" class="data row5 col0" >29.505923</td>
          <td id="T_ebaea_row5_col1" class="data row5 col1" >27.669155</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row6" class="row_heading level1 row6" >mean_FusionHeat</th>
          <td id="T_ebaea_row6_col0" class="data row6 col0" >8.112540</td>
          <td id="T_ebaea_row6_col1" class="data row6 col1" >6.944263</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row7" class="row_heading level1 row7" >mean_ThermalConductivity</th>
          <td id="T_ebaea_row7_col0" class="data row7 col0" >29.626863</td>
          <td id="T_ebaea_row7_col1" class="data row7 col1" >29.174941</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row8" class="row_heading level1 row8" >mean_Valence</th>
          <td id="T_ebaea_row8_col0" class="data row8 col0" >0.825231</td>
          <td id="T_ebaea_row8_col1" class="data row8 col1" >0.775617</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row9" class="row_heading level1 row9" >mean_atomic_mass</th>
          <td id="T_ebaea_row9_col0" class="data row9 col0" >20.671858</td>
          <td id="T_ebaea_row9_col1" class="data row9 col1" >20.683861</td>
        </tr>
        <tr>
          <th id="T_ebaea_level0_row10" class="row_heading level0 row10" rowspan="5">wmape</th>
          <th id="T_ebaea_level1_row10" class="row_heading level1 row10" >criticaltemp</th>
          <td id="T_ebaea_row10_col0" class="data row10 col0" >0.855155</td>
          <td id="T_ebaea_row10_col1" class="data row10 col1" >0.801891</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row11" class="row_heading level1 row11" >mean_FusionHeat</th>
          <td id="T_ebaea_row11_col0" class="data row11 col0" >0.568664</td>
          <td id="T_ebaea_row11_col1" class="data row11 col1" >0.486700</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row12" class="row_heading level1 row12" >mean_ThermalConductivity</th>
          <td id="T_ebaea_row12_col0" class="data row12 col0" >0.331483</td>
          <td id="T_ebaea_row12_col1" class="data row12 col1" >0.326426</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row13" class="row_heading level1 row13" >mean_Valence</th>
          <td id="T_ebaea_row13_col0" class="data row13 col0" >0.262622</td>
          <td id="T_ebaea_row13_col1" class="data row13 col1" >0.246824</td>
        </tr>
        <tr>
          <th id="T_ebaea_level1_row14" class="row_heading level1 row14" >mean_atomic_mass</th>
          <td id="T_ebaea_row14_col0" class="data row14 col0" >0.232943</td>
          <td id="T_ebaea_row14_col1" class="data row14 col1" >0.233077</td>
        </tr>
      </tbody>
    </table>

    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 141-142

Let's visualize this dataframe.

.. GENERATED FROM PYTHON SOURCE LINES 142-152

.. code-block:: Python


    n_metrics = len(metrics)
    fig = plt.figure(figsize=(14, 3 * n_metrics))
    for i, metric in enumerate(metrics):
        fig.add_subplot(n_metrics, 1, i + 1)
        plot.multibar(results.loc[metric], decimals=2)
        plt.ylabel(metric)
    plt.show()


.. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_003.png
   :alt: plot tuto mean median
   :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_003.png
   :class: sphx-glr-single-img


.. GENERATED FROM PYTHON SOURCE LINES 153-157

And finally, let's take a look at the imputations.
Whatever the method, we observe that the imputations
are relatively poor. Other imputation methods are therefore
necessary (see folder `imputations`).

.. GENERATED FROM PYTHON SOURCE LINES 157-173

.. code-block:: Python


    dfs_imputed = {
        name: imp.fit_transform(df) for name, imp in dict_imputers.items()
    }

    for col in cols_to_impute:
        fig, ax = plt.subplots(figsize=(10, 3))
        values_orig = df[col]
        plt.plot(values_orig[15000:], ".", color="black", label="original")
        for ind, (name, model) in enumerate(list(dict_imputers.items())):
            values_imp = dfs_imputed[name][col].copy()
            values_imp[values_orig.notna()] = np.nan
            plt.plot(values_imp[15000:], ".", label=name, alpha=1)
        plt.ylabel(col, fontsize=16)
        plt.legend()
        plt.show()


.. rst-class:: sphx-glr-horizontal


    *

      .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_004.png
         :alt: plot tuto mean median
         :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_004.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_005.png
         :alt: plot tuto mean median
         :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_005.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_006.png
         :alt: plot tuto mean median
         :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_006.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_007.png
         :alt: plot tuto mean median
         :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_007.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_008.png
         :alt: plot tuto mean median
         :srcset: /examples/tutorials/images/sphx_glr_plot_tuto_mean_median_008.png
         :class: sphx-glr-multi-img


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 4.819 seconds)


.. _sphx_glr_download_examples_tutorials_plot_tuto_mean_median.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_tuto_mean_median.ipynb <plot_tuto_mean_median.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_tuto_mean_median.py <plot_tuto_mean_median.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_tuto_mean_median.zip <plot_tuto_mean_median.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_