qolmat.benchmark.comparator.Comparator¶
- class qolmat.benchmark.comparator.Comparator(dict_models: Dict[str, Any], generator_holes: _HoleGenerator, metrics: List = ['mae', 'wmape', 'kl_columnwise'], dict_config_opti: Optional[Dict[str, Any]] = {}, metric_optim: str = 'mse', max_evals: int = 10, verbose: bool = False)[source]¶
Comparator class.
This class implements a comparator for evaluating different imputation methods.
- Parameters
- dict_models: Dict[str, any]
dictionary of imputation methods
- columnwise_evaluationOptional[bool], optional
whether the metric should be calculated column-wise or not, by default False
- dict_config_opti: Optional[Dict[str, Dict[str, Union[str, float, int]]]]
dictionary of search space for each implementation method. By default, the value is set to {}.
- max_evals: int = 10
number of calls of the optimization algorithm 10.
- __init__(dict_models: Dict[str, Any], generator_holes: _HoleGenerator, metrics: List = ['mae', 'wmape', 'kl_columnwise'], dict_config_opti: Optional[Dict[str, Any]] = {}, metric_optim: str = 'mse', max_evals: int = 10, verbose: bool = False)[source]¶
- compare(df_origin: DataFrame, use_parallel: bool = True, n_jobs: int = -1, parallel_over: str = 'auto') DataFrame[source]¶
Compare different imputers in parallel with hyperparams opti.
- Parameters
- df_originpd.DataFrame
df with missing values
- n_splitsint, optional
number of ‘splits’, i.e. fake dataframe with artificial holes, by default 10
- use_parallelbool, optional
if parallelisation, by default True
- n_jobsint, optional
number of jobs to use for the parallelisation, by default -1
- parallel_overstr, optional
‘splits’ or ‘imputers’, by default “auto”
- Returns
- pd.DataFrame
DataFrame (2-level index) with results. Columns are imputers. 0-level index are the metrics. 1-level index are the column names.
- get_errors(df_origin: DataFrame, df_imputed: DataFrame, df_mask: DataFrame) DataFrame[source]¶
Get errors - estimate the reconstruction’s quality.
- Parameters
- df_originpd.DataFrame
reference/original signal
- df_imputedpd.DataFrame
imputed signal
- df_maskpd.DataFrame
masked dataframe (NA)
- Returns
- pd.DataFrame
DataFrame of results obtained via different metrics
- static get_optimal_n_jobs(split_data: List, n_jobs: int = -1) int[source]¶
Determine the optimal number of parallel jobs to use.
If n_jobs is specified by the user, that value is used. Otherwise, the function returns the minimum between the number of CPU cores and the number of tasks (i.e., the length of split_data), ensuring that no more jobs than tasks are launched.
- Parameters
- split_dataList
A collection of data to be processed in parallel. The length of this collection determines the number of tasks.
- n_jobsint
The number of jobs (parallel workers) to use, by default -1
- Returns
- int
The optimal number of jobs to run in parallel
- process_imputer(imputer_data: Tuple[str, Any, List[DataFrame], DataFrame]) Tuple[str, DataFrame][source]¶
Process an imputer.
- Parameters
- imputer_dataTuple[str, Any, List[pd.DataFrame], pd.DataFrame]
contains (imputer_name, imputer, all_masks, df_origin)
- Returns
- Tuple[str, pd.DataFrame]
imputer name, errors results