diagnostics.sample_sufficiency

sample_sufficiency(
    df,
    input_cols,
    outcome_col,
    skip_validation=False,
    max_gap_ratio=0.2,
    min_r2_score=0.5,
    max_avg_cv=0.15,
    max_max_cv=0.3,
    max_allowed_vif=5.0,
)

Performs a suite of statistical diagnostics to evaluate if the current sample size is sufficient.

This function tests input space coverage, basic model fit (signal-to-noise), prediction stability via bootstrapping, and multicollinearity. It uses user-defined thresholds to determine if the sampling passes the sufficiency criteria required for reliable PoD analysis.

Parameters

Name	Type	Description	Default
df	pd.DataFrame	The simulation dataset containing inputs and outcomes.	required
input_cols	List[str]	A list of the input parameter column names.	required
outcome_col	str	The name of the outcome/signal column.	required
skip_validation	bool	If True, skips the initial data cleaning step. Defaults to False.	`False`
max_gap_ratio	float	The maximum allowable gap between data points as a fraction of the total range. Defaults to 0.20.	`0.2`
min_r2_score	float	The minimum cross-validated R-squared score required to pass the fit test. Defaults to 0.50.	`0.5`
max_avg_cv	float	The maximum allowable average relative width of the bootstrap predictions. Defaults to 0.15.	`0.15`
max_max_cv	float	The maximum allowable relative width at the tail ends (10th and 90th percentiles) of the predictions. Defaults to 0.30.	`0.3`
max_allowed_vif	float	The maximum allowable Variance Inflation Factor (VIF) to detect multicollinearity. Defaults to 5.0.	`5.0`

Returns

Name	Type	Description
	pd.DataFrame	pd.DataFrame: A formatted table detailing the results of each diagnostic test, including the variable tested, the calculated metric, the target threshold, and a boolean ‘Pass’ status.

Examples

import pandas as pd
from digiqual.diagnostics import sample_sufficiency

# Assume 'df' is a loaded DataFrame of simulation results
df = pd.DataFrame({
    'Length': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Signal': [2.1, 4.0, 6.2, 8.1, 9.9, 12.0, 14.1, 15.9, 18.2, 20.0]
})

# Run diagnostics with custom stricter thresholds
results_df = sample_sufficiency(
    df=df,
    input_cols=['Length'],
    outcome_col='Signal',
    max_gap_ratio=0.15,  # Require tighter spacing
    min_r2_score=0.70    # Require a stronger signal fit
)

print(results_df)