qolmat.benchmark.comparator.Comparator

class qolmat.benchmark.comparator.Comparator(dict_models: Dict[str, Any], generator_holes: _HoleGenerator, metrics: List = ['mae', 'wmape', 'kl_columnwise'], dict_config_opti: Optional[Dict[str, Any]] = {}, metric_optim: str = 'mse', max_evals: int = 10, verbose: bool = False)[source]

Comparator class.

This class implements a comparator for evaluating different imputation methods.

Parameters
dict_models: Dict[str, any]

dictionary of imputation methods

columnwise_evaluationOptional[bool], optional

whether the metric should be calculated column-wise or not, by default False

dict_config_opti: Optional[Dict[str, Dict[str, Union[str, float, int]]]]

dictionary of search space for each implementation method. By default, the value is set to {}.

max_evals: int = 10

number of calls of the optimization algorithm 10.

__init__(dict_models: Dict[str, Any], generator_holes: _HoleGenerator, metrics: List = ['mae', 'wmape', 'kl_columnwise'], dict_config_opti: Optional[Dict[str, Any]] = {}, metric_optim: str = 'mse', max_evals: int = 10, verbose: bool = False)[source]
compare(df_origin: DataFrame, use_parallel: bool = True, n_jobs: int = -1, parallel_over: str = 'auto') DataFrame[source]

Compare different imputers in parallel with hyperparams opti.

Parameters
df_originpd.DataFrame

df with missing values

n_splitsint, optional

number of ‘splits’, i.e. fake dataframe with artificial holes, by default 10

use_parallelbool, optional

if parallelisation, by default True

n_jobsint, optional

number of jobs to use for the parallelisation, by default -1

parallel_overstr, optional

‘splits’ or ‘imputers’, by default “auto”

Returns
pd.DataFrame

DataFrame (2-level index) with results. Columns are imputers. 0-level index are the metrics. 1-level index are the column names.

get_errors(df_origin: DataFrame, df_imputed: DataFrame, df_mask: DataFrame) DataFrame[source]

Get errors - estimate the reconstruction’s quality.

Parameters
df_originpd.DataFrame

reference/original signal

df_imputedpd.DataFrame

imputed signal

df_maskpd.DataFrame

masked dataframe (NA)

Returns
pd.DataFrame

DataFrame of results obtained via different metrics

static get_optimal_n_jobs(split_data: List, n_jobs: int = -1) int[source]

Determine the optimal number of parallel jobs to use.

If n_jobs is specified by the user, that value is used. Otherwise, the function returns the minimum between the number of CPU cores and the number of tasks (i.e., the length of split_data), ensuring that no more jobs than tasks are launched.

Parameters
split_dataList

A collection of data to be processed in parallel. The length of this collection determines the number of tasks.

n_jobsint

The number of jobs (parallel workers) to use, by default -1

Returns
int

The optimal number of jobs to run in parallel

process_imputer(imputer_data: Tuple[str, Any, List[DataFrame], DataFrame]) Tuple[str, DataFrame][source]

Process an imputer.

Parameters
imputer_dataTuple[str, Any, List[pd.DataFrame], pd.DataFrame]

contains (imputer_name, imputer, all_masks, df_origin)

Returns
Tuple[str, pd.DataFrame]

imputer name, errors results

process_split(split_data: Tuple[int, DataFrame, DataFrame]) DataFrame[source]

Process a split.

Parameters
split_dataTuple

contains (split_idx, df_mask, df_origin)

Returns
pd.DataFrame

errors results

Examples using qolmat.benchmark.comparator.Comparator

Benchmark for categorical data

Benchmark for categorical data

Comparison of basic imputers

Comparison of basic imputers

Tutorial for imputers based on diffusion models

Tutorial for imputers based on diffusion models

Benchmark for time series

Benchmark for time series