qolmat.imputations.em_sampler.VARpEM

class qolmat.imputations.em_sampler.VARpEM(method: Literal['mle', 'sample'] = 'sample', max_iter_em: int = 200, n_iter_ou: int = 50, ampli: float = 1, random_state: Optional[Union[int, RandomState]] = None, dt: float = 0.02, tolerance: float = 0.0001, stagnation_threshold: float = 0.005, stagnation_loglik: float = 2, period: int = 1, verbose: bool = False, p: Union[None, int] = None, max_lagp: int = 2)[source]

VAR(p) EM imputer.

Imputation of missing values using a vector autoregressive model through EM optimization and using a projected Ornstein-Uhlenbeck process. Equations and notations and from the following reference, matrices are transposed for consistency: Lütkepohl (2005) New Introduction to Multiple Time Series Analysis

X^n+1 = nu + sum_k A_k^T @ X_k^n + G_n @ S

Parameters
methodLiteral[“mle”, “sample”]

Method for imputation, choose among “sample” or “mle”.

max_iter_emint, optional

Maximum number of steps in the EM algorithm

n_iter_ouint, optional

Number of iterations for the Gibbs sampling method (+ noise addition), necessary for convergence, by default 50.

amplifloat, optional

Whether to sample the posterior (1) or to maximise likelihood (0), by default 1.

random_stateint, optional

The seed of the pseudo random number generator to use, for reproducibility.

dtfloat

Process integration time step, a large value increases the sample bias and can make the algorithm unstable, but compensates for a smaller n_iter_ou. By default, 2e-2.

tolerancefloat, optional

Threshold below which a L infinity norm difference indicates the convergence of the parameters

stagnation_thresholdfloat, optional

Threshold below which a L infinity norm difference indicates the convergence of the parameters

stagnation_loglikfloat, optional

Threshold below which an absolute difference of the log likelihood indicates the convergence of the parameters

periodint, optional

Integer used to fold the temporal data periodically

verbose: bool

default False

Examples

>>> import numpy as np
>>> from qolmat.imputations.em_sampler import VARpEM
>>> imputer = VARpEM(method="sample", random_state=11)
>>> X = np.array([[1, 1, 1, 1], [np.nan, np.nan, 3, 2], [1, 2, 2, 1], [2, 2, 2, 2]])
>>> imputer.fit_transform(X)  
Attributes
X_intermediatelist

List of pd.DataFrame giving the results of the EM process as function of the iteration number.

__init__(method: Literal['mle', 'sample'] = 'sample', max_iter_em: int = 200, n_iter_ou: int = 50, ampli: float = 1, random_state: Optional[Union[int, RandomState]] = None, dt: float = 0.02, tolerance: float = 0.0001, stagnation_threshold: float = 0.005, stagnation_loglik: float = 2, period: int = 1, verbose: bool = False, p: Union[None, int] = None, max_lagp: int = 2) None[source]
combine_parameters() None[source]

Combine statistics computed for each sample in the update step.

The estimation of nu and B corresponds to the MLE, whereas S is approximated.

get_gamma(n_cols: int) ndarray[tuple[int, ...], dtype[_ScalarType_co]][source]

Compute gamma.

If the noise matrix is not full-rank, defines the projection matrix keeping the sampling process in the relevant subspace. Rescales the process to avoid instabilities.

Parameters
n_colsint

Number of variables in the data matrix

Returns
NDArray

Gamma matrix

get_loglikelihood(X: ndarray[tuple[int, ...], dtype[_ScalarType_co]]) float[source]

Get the log-likelihood.

Value of the log-likelihood up to a constant for the provided X, using the attributes nu, B and S for the VAR(p) distribution.

Parameters
XNDArray

Input matrix with variables in column

Returns
float

Computed value

gradient_X_loglik(X: ndarray[tuple[int, ...], dtype[_ScalarType_co]]) ndarray[tuple[int, ...], dtype[_ScalarType_co]][source]

Compute the gradient of the log-likelihood for the provided X.

It uses the attributes means and cov_inv for the VAR(p) distribution.

Parameters
XNDArray

Input matrix with variables in column

Returns
NDArray

The gradient of the log-likelihood with respect to the input variable X.

init_imputation(X: ndarray[tuple[int, ...], dtype[_ScalarType_co]]) ndarray[tuple[int, ...], dtype[_ScalarType_co]][source]

First simple imputation before iterating.

Parameters
XNDArray

Data matrix, with missing values

Returns
NDArray

Imputed matrix

pretreatment(X, mask_na) Tuple[ndarray[tuple[int, ...], dtype[_ScalarType_co]], ndarray[tuple[int, ...], dtype[_ScalarType_co]]][source]

Pretreat the data before imputation by EM, making it more robust.

In the case of the VAR(p) model we freeze the naive imputation on the first observations if all variables are missing to avoid explosive imputations.

Parameters
XNDArray

Data matrix without nans

mask_naNDArray

Boolean matrix indicating which entries are to be imputed

Returns
Tuple[NDArray, NDArray]

A tuple containing: - X the pretreated data matrix - mask_na the updated mask

reset_learned_parameters()[source]

Reset lists of parameters before starting a new estimation.

set_parameters(B: ndarray[tuple[int, ...], dtype[_ScalarType_co]], S: ndarray[tuple[int, ...], dtype[_ScalarType_co]])[source]

Set the model parameters from a user value.

Parameters
BNDArray

Specified value for the autoregression matrix

SNDArray

Specified value for the noise covariance matrix

update_criteria_stop(X: ndarray[tuple[int, ...], dtype[_ScalarType_co]])[source]

Update the variable to compute the stopping criteria.

Parameters
XNDArray

Input matrix with variables in column

update_parameters(X: ndarray[tuple[int, ...], dtype[_ScalarType_co]]) None[source]

Retain statistics relative to the current sample.

Parameters
XNDArray

Input matrix with variables in column