Model Selection

Qolmat provides a convenient way to estimate optimal data imputation techniques by leveraging scikit-learn-compatible algorithms. Users can compare various methods based on different evaluation metrics.

1. General approach

Let \(X_{obs}\) be the observed dataset containing \(n\) observations and \(d\) features. Let \(I_{obs} \subseteq [1,n] \times [1,d]\) the set of observed indices.

In order to assess the performance of the imputations (without downstream task), we use the standard approach of masking additional data, impute these additional missing data and compute a score. This procedure is repeated \(K\) times. More precisely, for \(k=1, ..., K\), we define new sets \(I_{mis}^{(k)} \subseteq I_{obs}\) meaning we add missing values in the original dataset (see 3. Hole generator). The associated datasets are denoted \(X_{obs}^{(k)}\). We compute the associated complete dataset \(\hat{X}^{(k)}\) for the partial observations \(X_{obs}^{(k)}\) and then evaluate the imputation (see 2. Metrics) on the indices of additional missing data \(I_{mis}^{(k)}\), i.e. \(s\left( \hat{X}^{(k)}, X_{obs}\right)\). We eventually get the average score over the \(K\) realisations: \(\bar{s} = \frac{1}{K} \sum_{k=1}^K s\left( \hat{X}^{(k)}, X_{obs}\right)\).

2. Metrics

Metric

Description

Metric types

Data types

mean_squared_error

Mean squared error, based on mean_squared_error of sklearn.

Column-wise

Numerical

root_mean_squared_error

Root mean squared error, based on root_mean_squared_error of sklearn.

Column-wise

Numerical

mean_absolute_error

Mean absolute error, based on mean_absolute_error of sklearn.

Column-wise

Numerical

mean_absolute_percentage_error

Mean absolute percentage error, based on mean_absolute_percentage_error of sklearn.

Column-wise

Numerical

weighted_mean_absolute_percentage_error

Weighted mean absolute percentage error. Its definition can be found in MAPE.

Column-wise

Numerical

dist_wasserstein

Wasserstein distances, based on wasserstein_distance of scipy.

Column-wise

Numerical

kolmogorov_smirnov_test

Kolmogorov-Smirnov test statistic, based on ks_2samp of scipy.

Column-wise

Numerical

total_variance_distance

Total variance distance, based on TVComplement of SDMetrics

Column-wise

Categorical

mean_difference_correlation_matrix_numerical_features

Mean absolute of differences between the correlation matrices of two dataframes. The correlation matrices are based on Pearson correlation coefficient or p-value for testing non-correlation.

Column-wise

Numerical

mean_difference_correlation_matrix_categorical_features

Mean absolute of differences between the correlation matrices of two dataframes. The correlation matrices are based on Chi-square test of independence of variables (the test statistic or the p-value)

Column-wise

Categorical

mean_diff_corr_matrix_categorical_vs_numerical_features

Mean absolute of differences between the correlation matrices of two dataframes. The correlation matrices are based the one-way ANOVA (the test statistic or the p-value)

Column-wise

Categorical, Numerical

sum_energy_distances

Sum of energy distances between two dataframes, based on energy-distance of dcor

Row-wise

Numerical

sum_pairwise_distances

Sum of pairwise distances based on a predefined distance metric. It is based on cdist of scipy

Row-wise

Numerical

frechet_distance

The Fréchet distance between two dataframes (Dowson, D. C., and BV666017 Landau., 1982)

Dataframe-wise

Numerical

kl_divergence

Estimation of the Kullback-Leibler divergence between too empirical distributions. Three methods are implemented: columnwise (relying on a uniform binarization and only taking marginals into account, read more in this), gaussian (relying on a Gaussian approximation), random_forest (experimental).

Column-wise, Dataframe-wise

Numerical

distance_anticorr

Score based on the distance anticorrelation between two empirical distributions. The theoretical basis can be found on distance-correlation of dcor.

Dataframe-wise

Numerical

3. Hole generator

Evaluating the imputers requires to generate holes that are representative of the holes at hand. The missingness mechanisms have been classified by Rubin [1] into MCAR, MAR and MNAR.

Suppose we have \(X_{obs}\), a subset of a complete data model \(X = (X_{obs}, X_{mis})\), which is not fully observable (\(X_{mis}\) is the missing part). We define the matrix \(M\) such that \(M_{ij}=1\) if \(X_{ij}\) is missing, and 0 otherwise, and we assume distribution of \(M\) is parametrised by \(\psi\).

The observations are said to be Missing Completely at Random (MCAR) if the probability that an observation is missing is independent of the variables and observations in the dataset. Formally,

\[P(M | X_{obs}, X_{mis}, \psi) = P(M | \psi), \quad \forall \psi.\]

The observations are said to be Missing at Random (MAR) if the probability of an observation to be missing only depends on the observations. Formally,

\[P(M | X_{obs}, X_{mis}, \psi) = P(M | X_{obs}, \psi), \quad \forall \psi, X_{mis}.\]

Finally, the observations are said to be Missing Not at Random (MNAR) in all other cases, i.e. if \(P(M | X_{obs}, X_{mis}, \psi)\) does not simplify.

Qolmat allows to generate new missing values on an existing dataset, but only in the MCAR case.

Here are the different classes to generate missing data. We recommend the last 3 for time series.

  1. UniformHoleGenerator: This is the simplest way to generate missing data, i.e. the holes are generated uniformly at random.

  2. GroupedHoleGenerator: The holes are generated from groups, specified by the user: a given group can either be fully observed or fully missing.

  3. GeometricHoleGenerator: The holes are generated following a Markov 1D process. It means that missing data are created in a columnwise fashion. Given the mask \(M\) corresponding to the dataset observed. For each column of \(M\), we associate a two-state transition matrix between observed and missing states. We then construct a Markov process from this transition matrix.

  4. MultiMarkovHoleGenerator: This method is similar to GeometricHoleGenerator except that each row of the mask (vector) represents a state in the markov chain; we no longer proceed column by column. In the end, a single Markov chain is created to obtain the final mask.

  5. EmpiricalHoleGenerator: The distribution of holes is learned from the data. It allows to create missing data based on the holes size distribution, column by column. y

4. Hyperparameter optimization

Qolmat can be used to search for hyperparameters in imputation functions. Let say the imputation function \(f_{\theta}\) has \(n\) hyperparameters \(\theta = (\theta_1, ..., \theta_n)\) and configuration space \(\Theta = \Theta_1 \times ... \times \Theta_n\). The procedure to find the best hyperparameters set \(\theta^*\) is based on cross-validation, and is the same as that explained in the 1. General approach section, i.e. via the creation of \(L\) additional subsets \(I_{mis}^{(l)}, \, l=1,...,L\). We use Bayesian optimisation with Gaussian process where the function to minimise is the average reconstruction error over the \(L\) realisations, i.e.

\[\theta^* = \underset{\theta \in \Theta}{\mathrm{argmin}} \frac{1}{L} \sum_{l=1}^L \left\Vert X_{obs}^{(l)} - f_{\theta}\left(X_{obs}^{(l)} \right) \right\Vert_1.\]

References

[1] Rubin, Donald B. Inference and missing data. Biometrika 63.3 (1976): 581-592.