diagnostics.sample_sufficiency
sample_sufficiency(
df,
input_cols,
outcome_col,
skip_validation=False,
max_gap_ratio=0.2,
min_r2_score=0.5,
max_avg_cv=0.15,
max_max_cv=0.3,
)Performs a suite of statistical diagnostics to evaluate if the current sample size is sufficient.
This function tests input space coverage, basic model fit (signal-to-noise), and prediction stability via bootstrapping. It uses user-defined thresholds to determine if the sampling passes the sufficiency criteria required for reliable PoD analysis.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| df | pd.DataFrame | The simulation dataset containing inputs and outcomes. | required |
| input_cols | List[str] | A list of the input parameter column names. | required |
| outcome_col | str | The name of the outcome/signal column. | required |
| skip_validation | bool | If True, skips the initial data cleaning step. Defaults to False. | False |
| max_gap_ratio | float | The maximum allowable gap between data points as a fraction of the total range. Defaults to 0.20. | 0.2 |
| min_r2_score | float | The minimum cross-validated R-squared score required to pass the fit test. Defaults to 0.50. | 0.5 |
| max_avg_cv | float | The maximum allowable average relative width of the bootstrap predictions. Defaults to 0.15. | 0.15 |
| max_max_cv | float | The maximum allowable relative width at the tail ends (10th and 90th percentiles) of the predictions. Defaults to 0.30. | 0.3 |
Returns
| Name | Type | Description |
|---|---|---|
| pd.DataFrame | pd.DataFrame: A formatted table detailing the results of each diagnostic test, including the variable tested, the calculated metric, the target threshold, and a boolean ‘Pass’ status. |
Examples
import pandas as pd
from digiqual.diagnostics import sample_sufficiency
# Assume 'df' is a loaded DataFrame of simulation results
df = pd.DataFrame({
'Length': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Signal': [2.1, 4.0, 6.2, 8.1, 9.9, 12.0, 14.1, 15.9, 18.2, 20.0]
})
# Run diagnostics with custom stricter thresholds
results_df = sample_sufficiency(
df=df,
input_cols=['Length'],
outcome_col='Signal',
max_gap_ratio=0.15, # Require tighter spacing
min_r2_score=0.70 # Require a stronger signal fit
)
print(results_df)