Comparison of basic imputers

In this tutorial, we show how to use the Qolmat comparator (comparator) to choose the best imputation between two of the simplest imputation methods: mean or median (ImputerSimple). The dataset used is the numerical superconduct dataset and contains information on 21263 superconductors. We generate holes uniformly at random via UniformHoleGenerator

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from sklearn import utils as sku

from qolmat.benchmark import comparator, missing_patterns
from qolmat.imputations import imputers
from qolmat.utils import data, plot

seed = 1234
rng = sku.check_random_state(seed)

1. Data

The data contains information on 21263 superconductors. Originally, the first 81 columns contain extracted features and the 82nd column contains the critical temperature which is used as the target variable. The data does not contain missing values; so for the purpose of this notebook, we corrupt the data, with the qolmat.utils.data.add_holes() function. In this way, each column has missing values.

df = data.add_holes(
    data.get_data("Superconductor"), ratio_masked=0.2, mean_size=120, random_state=rng
)

The dataset contains 82 columns. For simplicity, we only consider some.

columns = [
    "criticaltemp",
    "mean_atomic_mass",
    "mean_FusionHeat",
    "mean_ThermalConductivity",
    "mean_Valence",
]
df = df[columns]
cols_to_impute = df.columns

Let’s take a look at the missing data. In this plot, a white (resp. black) box represents a missing (resp. observed) value.

plt.figure(figsize=(15, 4))
plt.imshow(
    df.notna().values.T, aspect="auto", cmap="binary", interpolation="none"
)
plt.yticks(range(len(df.columns)), df.columns)
plt.xlabel("Samples", fontsize=12)
plt.grid(False)
plt.show()
plot tuto mean median

2. Imputation

This part is devoted to the imputation methods. In this tutorial, we only focus on mean and median imputation. In order to use the comparator, we have to define a dictionary of imputers, a way to generate holes (additional missing values on which the imputers will be evaluated) and a list of metrics.

imputer_mean = imputers.ImputerSimple(strategy="mean")
imputer_median = imputers.ImputerSimple(strategy="median")
dict_imputers = {"mean": imputer_mean, "median": imputer_median}

metrics = ["mae", "wmape", "kl_columnwise"]

Concretely, the comparator takes as input a dataframe to impute, a proportion of nan to create, a dictionary of imputers (those previously mentioned), a list with the columns names to impute, a generator of holes specifying the type of holes to create. in this example, we have chosen the uniform hole generator. For example, by imposing that 10% of missing data be created ratio_masked=0.1 and creating missing values in columns subset=cols_to_impute:

generator_holes = missing_patterns.UniformHoleGenerator(
    n_splits=2, subset=cols_to_impute, ratio_masked=0.1, random_state=rng
)
df_mask = generator_holes.generate_mask(df)
df_mask = np.invert(df_mask).astype("int")

df_tot = df.copy()
df_tot[df.notna()] = 0
df_tot[df.isna()] = 2
df_tot += df_mask

colorsList = [(1, 0, 0), (0, 0, 0), (1, 1, 1)]
custom_cmap = matplotlib.colors.ListedColormap(colorsList)

plt.figure(figsize=(15, 4))
plt.imshow(
    df_tot.values.T, aspect="auto", cmap=custom_cmap, interpolation="none"
)
plt.yticks(range(len(df_tot.columns)), df_tot.columns)
plt.xlabel("Samples", fontsize=12)
plt.grid(False)
plt.show()
plot tuto mean median

Now that we’ve seen how hole generation behaves, we can use it in the comparator.

comparison = comparator.Comparator(
    dict_imputers,
    generator_holes=generator_holes,
    metrics=metrics,
    max_evals=5,
)

On the basis of the results, we can see that imputation by the median provides lower reconstruction errors than those obtained by imputation by the mean, except for the mean_atomic_mass with MAE.

results = comparison.compare(df)
results.style.highlight_min(color="lightsteelblue", axis=1)
    mean median
kl_columnwise criticaltemp 29.699655 28.770511
mean_FusionHeat 28.283599 13.688591
mean_ThermalConductivity 25.868046 25.545719
mean_Valence 32.290397 28.979781
mean_atomic_mass 24.258423 24.258423
mae criticaltemp 29.505923 27.669155
mean_FusionHeat 8.112540 6.944263
mean_ThermalConductivity 29.626863 29.174941
mean_Valence 0.825231 0.775617
mean_atomic_mass 20.671858 20.683861
wmape criticaltemp 0.855155 0.801891
mean_FusionHeat 0.568664 0.486700
mean_ThermalConductivity 0.331483 0.326426
mean_Valence 0.262622 0.246824
mean_atomic_mass 0.232943 0.233077


Let’s visualize this dataframe.

n_metrics = len(metrics)
fig = plt.figure(figsize=(14, 3 * n_metrics))
for i, metric in enumerate(metrics):
    fig.add_subplot(n_metrics, 1, i + 1)
    plot.multibar(results.loc[metric], decimals=2)
    plt.ylabel(metric)
plt.show()
plot tuto mean median

And finally, let’s take a look at the imputations. Whatever the method, we observe that the imputations are relatively poor. Other imputation methods are therefore necessary (see folder imputations).

dfs_imputed = {
    name: imp.fit_transform(df) for name, imp in dict_imputers.items()
}

for col in cols_to_impute:
    fig, ax = plt.subplots(figsize=(10, 3))
    values_orig = df[col]
    plt.plot(values_orig[15000:], ".", color="black", label="original")
    for ind, (name, model) in enumerate(list(dict_imputers.items())):
        values_imp = dfs_imputed[name][col].copy()
        values_imp[values_orig.notna()] = np.nan
        plt.plot(values_imp[15000:], ".", label=name, alpha=1)
    plt.ylabel(col, fontsize=16)
    plt.legend()
    plt.show()
  • plot tuto mean median
  • plot tuto mean median
  • plot tuto mean median
  • plot tuto mean median
  • plot tuto mean median

Total running time of the script: (0 minutes 4.819 seconds)

Gallery generated by Sphinx-Gallery