`qolmat.benchmark.comparator`.Comparator¶

class qolmat.benchmark.comparator.Comparator(dict_models: Dict[str, Any], generator_holes: _HoleGenerator, metrics: List = ['mae', 'wmape', 'kl_columnwise'], dict_config_opti: Optional[Dict[str, Any]] = {}, metric_optim: str = 'mse', max_evals: int = 10, verbose: bool = False)[source]¶

Comparator class.

This class implements a comparator for evaluating different imputation methods.

Parameters

dict_models: Dict[str, any]: dictionary of imputation methods
columnwise_evaluationOptional[bool], optional: whether the metric should be calculated column-wise or not, by default False
dict_config_opti: Optional[Dict[str, Dict[str, Union[str, float, int]]]]: dictionary of search space for each implementation method. By default, the value is set to {}.
max_evals: int = 10: number of calls of the optimization algorithm 10.

__init__(dict_models: Dict[str, Any], generator_holes: _HoleGenerator, metrics: List = ['mae', 'wmape', 'kl_columnwise'], dict_config_opti: Optional[Dict[str, Any]] = {}, metric_optim: str = 'mse', max_evals: int = 10, verbose: bool = False)[source]¶

compare(df_origin: DataFrame, use_parallel: bool = True, n_jobs: int = -1, parallel_over: str = 'auto') → DataFrame[source]¶

Compare different imputers in parallel with hyperparams opti.

Parameters

df_originpd.DataFrame: df with missing values
n_splitsint, optional: number of ‘splits’, i.e. fake dataframe with artificial holes, by default 10
use_parallelbool, optional: if parallelisation, by default True
n_jobsint, optional: number of jobs to use for the parallelisation, by default -1
parallel_overstr, optional: ‘splits’ or ‘imputers’, by default “auto”

Returns

pd.DataFrame: DataFrame (2-level index) with results. Columns are imputers. 0-level index are the metrics. 1-level index are the column names.

get_errors(df_origin: DataFrame, df_imputed: DataFrame, df_mask: DataFrame) → DataFrame[source]¶

Get errors - estimate the reconstruction’s quality.

Parameters

df_originpd.DataFrame: reference/original signal
df_imputedpd.DataFrame: imputed signal
df_maskpd.DataFrame: masked dataframe (NA)

Returns

pd.DataFrame: DataFrame of results obtained via different metrics

static get_optimal_n_jobs(split_data: List, n_jobs: int = -1) → int[source]¶

Determine the optimal number of parallel jobs to use.

If n_jobs is specified by the user, that value is used. Otherwise, the function returns the minimum between the number of CPU cores and the number of tasks (i.e., the length of split_data), ensuring that no more jobs than tasks are launched.

Parameters

split_dataList: A collection of data to be processed in parallel. The length of this collection determines the number of tasks.
n_jobsint: The number of jobs (parallel workers) to use, by default -1

Returns