diagnostics.sample_sufficiency

sample_sufficiency(
    df,
    input_cols,
    outcome_col,
    skip_validation=False,
    max_gap_ratio=0.2,
    min_r2_score=0.5,
    max_avg_cv=0.15,
    max_max_cv=0.3,
)

Performs a suite of statistical diagnostics to evaluate if the current sample size is sufficient.

This function tests input space coverage, basic model fit (signal-to-noise), and prediction stability via bootstrapping. It uses user-defined thresholds to determine if the sampling passes the sufficiency criteria required for reliable PoD analysis.

Parameters

Name Type Description Default
df pd.DataFrame The simulation dataset containing inputs and outcomes. required
input_cols List[str] A list of the input parameter column names. required
outcome_col str The name of the outcome/signal column. required
skip_validation bool If True, skips the initial data cleaning step. Defaults to False. False
max_gap_ratio float The maximum allowable gap between data points as a fraction of the total range. Defaults to 0.20. 0.2
min_r2_score float The minimum cross-validated R-squared score required to pass the fit test. Defaults to 0.50. 0.5
max_avg_cv float The maximum allowable average relative width of the bootstrap predictions. Defaults to 0.15. 0.15
max_max_cv float The maximum allowable relative width at the tail ends (10th and 90th percentiles) of the predictions. Defaults to 0.30. 0.3

Returns

Name Type Description
pd.DataFrame pd.DataFrame: A formatted table detailing the results of each diagnostic test, including the variable tested, the calculated metric, the target threshold, and a boolean ‘Pass’ status.

Examples

import pandas as pd
from digiqual.diagnostics import sample_sufficiency

# Assume 'df' is a loaded DataFrame of simulation results
df = pd.DataFrame({
    'Length': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Signal': [2.1, 4.0, 6.2, 8.1, 9.9, 12.0, 14.1, 15.9, 18.2, 20.0]
})

# Run diagnostics with custom stricter thresholds
results_df = sample_sufficiency(
    df=df,
    input_cols=['Length'],
    outcome_col='Signal',
    max_gap_ratio=0.15,  # Require tighter spacing
    min_r2_score=0.70    # Require a stronger signal fit
)

print(results_df)