Imputers

All imputers can be found in the qolmat.imputations folder.

1. Simple (mean/median/mode)

Imputes the missing values using a basic simple statistics: the mode (most frequent value) for the categorical columns, and the mean,median or mode (depending on the user parameter) for the numerical columns. See ImputerSimple.

2. Shuffle

Imputes the missing values using a random value sampled in the same column. See ImputerShuffle.

3. LOCF

Imputes the missing values using the last observation carried forward. See ImputerLOCF.

4. Time interpolation and TSA decomposition

Imputes missing using some interpolation strategies supported by pd.Series.interpolate. It is done column by column. See the ImputerInterpolation class. When data are temporal with clear seasonal decomposition, we can interpolate on the residuals instead of directly interpolate the raw data. Series are de-seasonalised based on statsmodels.tsa.seasonal.seasonal_decompose, residuals are imputed via linear interpolation, then residuals are re-seasonalised. It is also done column by column. See ImputerResiduals.

5. MICE

Multiple Imputation by Chained Equation: multiple imputations based on ICE. It uses IterativeImputer. See the ImputerMICE class.

6. RPCA

Robust Principal Component Analysis (RPCA) is a modification of the statistical procedure of PCA which allows to work with a data matrix \(\mathbf{D} \in \mathbb{R}^{n \times d}\) containing missing values and grossly corrupted observations. We consider here the imputation task alone, but these methods can also tackle anomaly correction.

Two cases are considered.

RPCA via Principal Component Pursuit (PCP) [1, 12]

The class RpcaPcp implements a matrix decomposition \(\mathbf{D} = \mathbf{M} + \mathbf{A}\) where \(\mathbf{M}\) has low-rank and \(\mathbf{A}\) is sparse. It relies on the following optimisation problem

\[\text{min}_{\mathbf{M} \in \mathbb{R}^{m \times n}} \quad \Vert \mathbf{M} \Vert_* + \lambda \Vert P_\Omega(\mathbf{D-M}) \Vert_1\]

with \(\mathbf{A} = \mathbf{D} - \mathbf{M}\). The operator \(P_{\Omega}\) is the projection operator on the set of observed data \(\Omega\), so that there is no penalization for the components of \(A\) corresponding to unobserved data. The imputed values are then given by the matrix \(M\) on the unobserved data. See the ImputerRpcaPcp class for implementation details.

Noisy RPCA [2, 3, 4]

The class RpcaNoisy implements a recommended improved version, which relies on a decomposition \(\mathbf{D} = \mathbf{M} + \mathbf{A} + \mathbf{E}\). The additional term encodes a Gaussian noise and makes the numerical convergence more reliable. This class also implements a time-consistency penalization for time series, parametrized by the \(\eta_k\). By defining \(\Vert \mathbf{MH_k} \Vert_p\) is either \(\Vert \mathbf{MH_k} \Vert_1\) or \(\Vert \mathbf{MH_k} \Vert_F^2\), the optimisation problem is the following

\[\text{min}_{\mathbf{M, A} \in \mathbb{R}^{m \times n}} \quad \frac 1 2 \Vert P_{\Omega} (\mathbf{D}-\mathbf{M}-\mathbf{A}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_* + \lambda \Vert \mathbf{A} \Vert_1 + \sum_{k=1}^K \eta_k \Vert \mathbf{M H_k} \Vert_p\]

with \(\mathbf{E} = \mathbf{D} - \mathbf{M} - \mathbf{A}\). See the ImputerRpcaNoisy class for implementation details.

7. SoftImpute

SoftImpute is an iterative method for matrix completion that uses nuclear-norm regularization [11]. It is a faster alternative to RPCA, although it is much less robust due to the quadratic penalization. Given a matrix \(\mathbf{D} \in \mathbb{R}^{n \times d}\) with observed entries indexed by the set \(\Omega\), this algorithm solves the following problem:

\[\text{minimise}_{\mathbf{M} \in \mathbb{R}^{n \times d}, rg(M) \leq r} \quad \Vert P_{\Omega}(\mathbf{D} - \mathbf{M}) \Vert_F^2 + \tau \Vert \mathbf{M} \Vert_*\]

The imputed values are then given by the matrix \(M=LQ\) on the unobserved data. See the ImputerSoftImpute class for implementation details.

8. KNN

K-nearest neighbors, based on KNNImputer. See the ImputerKNN class.

9. EM sampler

Imputes missing values via EM algorithm [5], and more precisely via MCEM algorithm [6]. See the ImputerEM class. Suppose the data \(\mathbf{X}\) has a density \(p_\theta\) parametrized by some parameter \(\theta\). The EM algorithm allows to draw samples from this distribution by alternating between the expectation and maximization steps.

Expectation

Draw samples of \(\mathbf{X}\) assuming a fixed \(\theta\), conditionally on the values of \(\mathbf{X}_\mathrm{obs}\). This is done by MCMC using a projected Langevin algorithm. This process is characterized by a time step \(h\). Given an initial station \(X_0\), one can update the state at iteration t as

\[\widetilde X_n = X_{n-1} + \Gamma \nabla L_X(X_{n-1}, \theta_n) (X_{n-1} - \mu) h + (2 h \Gamma)^{1/2} Z_n,\]

where \(Z_n\) is a vector of independent standard normal random variables and \(L\) is the log-likelihood. The sampled distribution tends to the target one in the limit \(h \rightarrow 0\) and the number of iterations \(n \rightarrow \infty\). Sampling from the conditional distribution \(p(\mathbf{X}_{mis} \vert \mathbf{X}_{obs} ; \theta^{(n)})\) (see MCEM [6]) is achieved by projecting the samples at each step.

\[X_n = Proj_{obs} \left( \widetilde X_n \right),\]

where \(Proj_{obs}\) is the orthogonal projection onto the subspace of matrices that vanish outside the index of OBS (\(\mathbf{X}_{obs}\) remains unchanged, we only sample \(\mathbf{X}_{mis}\)).

Maximization

We estimate the distribution parameter \(\theta\) by likelihood maximization, given the samples of \(\mathbf{X}\). In practice we keep only the last n_samples samples, assuming they are drawn under the target distribution.

Imputation

Once the parameter \(\theta^*\) has been estimated the final data imputation can be done in two different ways, depending on the value of the argument method:

  • mle: Returns the maximum likelihood estimator

\[X^* = \mathrm{argmax}_X L(X, \theta^*)\]
  • sample: Returns a single sample of \(X\) from the conditional distribution \(p(X | \theta^*)\). Multiple imputation can be achieved by calling the transform method multiple times.

Two parametric distributions are implemented:

  • MultiNormalEM: \(\mathbf{X_i} \in \mathbb{R}^{n \times d} \sim N_d(\mathbf{m}, \mathbf{\Sigma})\) i.i.d. with parameters \(\mathbf{\mu} \in \mathbb{R}^d\) and \(\mathbf{\Sigma} \in \mathbb{R}^{d \times d}\), so that \(\theta = (\mu, \Sigma)\).

  • VARpEM: [7]: \(\mathbf{X} \in \mathbb{R}^{n \times d} \sim VAR_p(\nu, B_1, ..., B_p)\) is generated by a VAR(p) process such that \(X_t = \nu + B_1 X_{t-1} + ... + B_p X_{t-p} + u_t\) where \(\nu \in \mathbb{R}^d\) is a vector of intercept terms, the \(B_i \in \mathbb{R}^{d \times d}\) are the lags coefficient matrices and \(u_t\) is white noise nonsingular covariance matrix \(\Sigma_u \mathbb{R}^{d \times d}\), so that \(\theta = (\nu, B_1, ..., B_p, \Sigma_u)\).

10. TabDDPM

TabDDPM is a deep learning imputer based on Denoising Diffusion Probabilistic Models (DDPMs) [8] for handling multivariate tabular data. Our implementation mainly follows the works of [8, 9]. Diffusion models focus on modeling the process of data transitions from noisy and incomplete observations to the underlying true data. They include two main processes:

  • Forward process perturbs observed data to noise until all the original data structures are lost. The perturbation is done over a series of steps. Let \(X_{obs}\) be observed data, \(T\) be the number of steps that noises \(\epsilon \sim N(0,I)\) are added into the observed data. Therefore, \(X_{obs}^t = \bar{\alpha}_t \times X_{obs} + \sqrt{1-\bar{\alpha}_t} \times \epsilon\) where \(\bar{\alpha}_t\) controls the right amount of noise.

  • Reverse process removes noise and reconstructs the observed data. At each step \(t\), we train an autoencoder \(\epsilon_\theta\) based on ResNet [10] to predict the added noise \(\epsilon_t\) based on the rest of the observed data. The objective function is the error between the noise added in the forward process and the noise predicted by \(\epsilon_\theta\).

In training phase, we use the self-supervised learning method of [9] to train incomplete data. In detail, our model randomly masks a part of observed data and computes loss from these masked data. Moving on to the inference phase, (1) missing data are replaced by Gaussian noises \(\epsilon \sim N(0,I)\), (2) at each noise step from \(T\) to 0, our model denoises these missing data based on \(\epsilon_\theta\).

In the case of time-series data, we also propose TsDDPM (built on top of TabDDPM) to capture time-based relationships between data points in a dataset. In fact, the dataset is pre-processed by using sliding window method to obtain a set of data partitions. The noise prediction of the model \(\epsilon_\theta\) takes into account not only the observed data at the current time step but also data from previous time steps. These time-based relationships are encoded by using a transformer-based architecture [9].

References (Imputers)

[1] Candès, Emmanuel J., et al. Robust principal component analysis? Journal of the ACM (JACM) 58.3 (2011): 1-37.

[2] Botterman, HL., Roussel, J., Morzadec, T., Jabbari, A., Brunel, N. Robust PCA for Anomaly Detection and Data Imputation in Seasonal Time Series in International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer Nature Switzerland (2022).

[3] Chen, Yuxin, et al. Bridging convex and nonconvex optimization in robust PCA: Noise, outliers, and missing data. Annals of statistics 49.5 (2021): 2948.

[4] Wang, Xuehui, et al. An improved robust principal component analysis model for anomalies detection of subway passenger flow. Journal of advanced transportation 2018 (2018).

[5] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society: series B (methodological) 39.1 (1977): 1-22.

[6] Wei, Greg CG, and Martin A. Tanner. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American statistical Association 85.411 (1990): 699-704.

[7] Lütkepohl, Helmut. New introduction to multiple time series analysis. Springer Science & Business Media, 2005.

[8] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020): 6840-6851.

[9] Tashiro, Yusuke, et al. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems 34 (2021): 24804-24816.

[10] Kotelnikov, Akim, et al. Tabddpm: Modelling tabular data with diffusion models. International Conference on Machine Learning. PMLR, 2023.

[11] Hastie, Trevor, et al. Matrix completion and low-rank SVD via fast alternating least squares. The Journal of Machine Learning Research 16.1 (2015): 3367-3402.

[12] Fanhua, Shang, et al. Robust Principal Component Analysis with Missing Data Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (2014).