Model Selection¶

Qolmat provides a convenient way to estimate optimal data imputation techniques by leveraging scikit-learn-compatible algorithms. Users can compare various methods based on different evaluation metrics.

1. General approach¶

Let \(X_{obs}\) be the observed dataset containing \(n\) observations and \(d\) features. Let \(I_{obs} \subseteq [1,n] \times [1,d]\) the set of observed indices.

In order to assess the performance of the imputations (without downstream task), we use the standard approach of masking additional data, impute these additional missing data and compute a score. This procedure is repeated \(K\) times. More precisely, for \(k=1, ..., K\), we define new sets \(I_{mis}^{(k)} \subseteq I_{obs}\) meaning we add missing values in the original dataset (see 3. Hole generator). The associated datasets are denoted \(X_{obs}^{(k)}\). We compute the associated complete dataset \(\hat{X}^{(k)}\) for the partial observations \(X_{obs}^{(k)}\) and then evaluate the imputation (see 2. Metrics) on the indices of additional missing data \(I_{mis}^{(k)}\), i.e. \(s\left( \hat{X}^{(k)}, X_{obs}\right)\). We eventually get the average score over the \(K\) realisations: \(\bar{s} = \frac{1}{K} \sum_{k=1}^K s\left( \hat{X}^{(k)}, X_{obs}\right)\).

2. Metrics¶

Metric	Description	Metric types	Data types
`mean_squared_error`	Mean squared error, based on mean_squared_error of sklearn.	Column-wise	Numerical
`root_mean_squared_error`	Root mean squared error, based on root_mean_squared_error of sklearn.	Column-wise	Numerical
`mean_absolute_error`	Mean absolute error, based on mean_absolute_error of sklearn.	Column-wise	Numerical
`mean_absolute_percentage_error`	Mean absolute percentage error, based on mean_absolute_percentage_error of sklearn.	Column-wise	Numerical
`weighted_mean_absolute_percentage_error`	Weighted mean absolute percentage error. Its definition can be found in MAPE.	Column-wise	Numerical
`dist_wasserstein`	Wasserstein distances, based on wasserstein_distance of scipy.	Column-wise	Numerical
`kolmogorov_smirnov_test`	Kolmogorov-Smirnov test statistic, based on ks_2samp of scipy.	Column-wise	Numerical
`total_variance_distance`	Total variance distance, based on TVComplement of SDMetrics	Column-wise	Categorical
`mean_difference_correlation_matrix_numerical_features`	Mean absolute of differences between the correlation matrices of two dataframes. The correlation matrices are based on Pearson correlation coefficient or p-value for testing non-correlation.	Column-wise	Numerical
`mean_difference_correlation_matrix_categorical_features`	Mean absolute of differences between the correlation matrices of two dataframes. The correlation matrices are based on Chi-square test of independence of variables (the test statistic or the p-value)	Column-wise	Categorical
`mean_diff_corr_matrix_categorical_vs_numerical_features`	Mean absolute of differences between the correlation matrices of two dataframes. The correlation matrices are based the one-way ANOVA (the test statistic or the p-value)	Column-wise	Categorical, Numerical
`sum_energy_distances`	Sum of energy distances between two dataframes, based on energy-distance of dcor	Row-wise	Numerical
`sum_pairwise_distances`	Sum of pairwise distances based on a predefined distance metric. It is based on cdist of scipy	Row-wise	Numerical
`frechet_distance`	The Fréchet distance between two dataframes (Dowson, D. C., and BV666017 Landau., 1982)	Dataframe-wise	Numerical
`kl_divergence`	Estimation of the Kullback-Leibler divergence between too empirical distributions. Three methods are implemented: columnwise (relying on a uniform binarization and only taking marginals into account, read more in this), gaussian (relying on a Gaussian approximation), random_forest (experimental).	Column-wise, Dataframe-wise	Numerical
`distance_anticorr`	Score based on the distance anticorrelation between two empirical distributions. The theoretical basis can be found on distance-correlation of dcor.	Dataframe-wise	Numerical

3. Hole generator¶

Evaluating the imputers requires to generate holes that are representative of the holes at hand. The missingness mechanisms have been classified by Rubin [1] into MCAR, MAR and MNAR.

Suppose we have \(X_{obs}\), a subset of a complete data model \(X = (X_{obs}, X_{mis})\), which is not fully observable (\(X_{mis}\) is the missing part). We define the matrix \(M\) such that \(M_{ij}=1\) if \(X_{ij}\) is missing, and 0 otherwise, and we assume distribution of \(M\) is parametrised by \(\psi\).

The observations are said to be Missing Completely at Random (MCAR) if the probability that an observation is missing is independent of the variables and observations in the dataset. Formally,

\[P(M | X_{obs}, X_{mis}, \psi) = P(M | \psi), \quad \forall \psi.\]

The observations are said to be Missing at Random (MAR) if the probability of an observation to be missing only depends on the observations. Formally,

\[P(M | X_{obs}, X_{mis}, \psi) = P(M | X_{obs}, \psi), \quad \forall \psi, X_{mis}.\]

Finally, the observations are said to be Missing Not at Random (MNAR) in all other cases, i.e. if \(P(M | X_{obs}, X_{mis}, \psi)\) does not simplify.

Qolmat allows to generate new missing values on an existing dataset, but only in the MCAR case.

Here are the different classes to generate missing data. We recommend the last 3 for time series.

UniformHoleGenerator: This is the simplest way to generate missing data, i.e. the holes are generated uniformly at random.
GroupedHoleGenerator: The holes are generated from groups, specified by the user: a given group can either be fully observed or fully missing.
GeometricHoleGenerator: The holes are generated following a Markov 1D process. It means that missing data are created in a columnwise fashion. Given the mask \(M\) corresponding to the dataset observed. For each column of \(M\), we associate a two-state transition matrix between observed and missing states. We then construct a Markov process from this transition matrix.
MultiMarkovHoleGenerator: This method is similar to GeometricHoleGenerator except that each row of the mask (vector) represents a state in the markov chain; we no longer proceed column by column. In the end, a single Markov chain is created to obtain the final mask.
EmpiricalHoleGenerator: The distribution of holes is learned from the data. It allows to create missing data based on the holes size distribution, column by column. y

4. Hyperparameter optimization¶

Qolmat can be used to search for hyperparameters in imputation functions. Let say the imputation function \(f_{\theta}\) has \(n\) hyperparameters \(\theta = (\theta_1, ..., \theta_n)\) and configuration space \(\Theta = \Theta_1 \times ... \times \Theta_n\). The procedure to find the best hyperparameters set \(\theta^*\) is based on cross-validation, and is the same as that explained in the 1. General approach section, i.e. via the creation of \(L\) additional subsets \(I_{mis}^{(l)}, \, l=1,...,L\). We use Bayesian optimisation with Gaussian process where the function to minimise is the average reconstruction error over the \(L\) realisations, i.e.

\[\theta^* = \underset{\theta \in \Theta}{\mathrm{argmin}} \frac{1}{L} \sum_{l=1}^L \left\Vert X_{obs}^{(l)} - f_{\theta}\left(X_{obs}^{(l)} \right) \right\Vert_1.\]

References¶

[1] Rubin, Donald B. Inference and missing data. Biometrika 63.3 (1976): 581-592.