A streamlined guide using the SimulationStudy manager.
Use the SimulationStudy manager to handle the entire lifecycle: storage, diagnostics, active refinement, analysis, and intelligent caching.
Note
In this tutorial, we will intentionally feed the study “bad” data (with a large gap in the input space) to demonstrate how the active learning module automatically identifies and fixes issues. We will also demonstrate how the new Layered Caching engine makes exploring multi-dimensional parameter spaces nearly instantaneous.
1. Setup, Generation & Smart Initialization
We’ll define a simulated physics function with heteroscedastic noise and generate an initial design containing a deliberate gap between 3mm and 7mm.
Notice that we no longer need to explicitly define our input columns when creating the SimulationStudy; the manager will automatically infer them from the data!
import numpy as npimport pandas as pdimport timefrom digiqual import SimulationStudy# --- Define the Physics ---def apply_physics(df):"""Simulates a signal with quadratic trends and heteroscedastic noise.""" signal = (10.0+ (2.0* df["Length"])+ (0.5* df["Length"] **2)- (0.1* np.abs(df["Angle"])) ) noise_scale =0.5+ (1.5* df["Roughness"]) noise = np.random.normal(loc=0, scale=noise_scale, size=len(df))return signal + noise# --- Create Flawed Data (Gap between 3mm and 7mm) ---df1 = pd.DataFrame( {"Length": np.random.uniform(0.1, 3.0, 40),"Angle": np.random.uniform(-10, 10, 40),"Roughness": np.random.uniform(0, 0.5, 40), })df2 = pd.DataFrame( {"Length": np.random.uniform(7.0, 10.0, 40),"Angle": np.random.uniform(-10, 10, 40),"Roughness": np.random.uniform(0, 0.5, 40), })df_initial = pd.concat([df1, df2], ignore_index=True)df_initial["Signal"] = apply_physics(df_initial)# --- The NEW Smarter Initialization ---# 1. Initialize with a clean slatestudy = SimulationStudy()# 2. Add data: It automatically infers 'Length', 'Angle', 'Roughness' as inputs!study.add_data(df_initial, outcome_col="Signal", overwrite=True)# Test the UI Helper methodsprint(f"Inferred Inputs: {study.inputs}")threshold_defaults = study.get_data_summary("Signal")print(f"Auto-Threshold suggestion (Median Signal): {threshold_defaults['median']:.2f}")
Data initialized/overwritten. Total rows: 80
Inferred Inputs: ['Length', 'Angle', 'Roughness']
Auto-Threshold suggestion (Median Signal): 34.23
2. Diagnosis (Detecting the Issue)
We ask the SimulationStudy manager to diagnose the health of our experiment.
Warning
Because of our deliberate gap, we expect the diagnostic engine to flag the Input Coverage test as failed.
Instead of manually guessing where to add new simulations, we use the refine() method.
Detect: It automatically targets the exact region where the gap or uncertainty exists.
Generate: It outputs a dataframe of new input coordinates.
Execute & Store: Appending the results back using add_data() automatically clears the mathematical caches to ensure your next analysis uses the updated physics!
# Generate active learning samples (20 points to fill the gap)new_samples = study.refine(n_points=20)ifnot new_samples.empty:# Apply the exact same physics model to the new points new_samples["Signal"] = apply_physics(new_samples)# Add back to study (This strips metadata and drops the caches!) study.add_data(new_samples)# Re-run diagnostics to confirm the green lightstudy.diagnose()
Diagnostics flagged issues. Initiating Active Learning...
-> Strategy: Exploration (Filling gaps in Length)
Data appended. Total rows: 100
Running validation...
Validation passed. 100 valid rows ready.
Checking sample sufficiency...
Test
Variable
Metric
Value
Threshold
Pass
0
Input Coverage
Length
Max Gap Ratio
0.0785
< 0.20
True
1
Input Coverage
Angle
Max Gap Ratio
0.0651
< 0.20
True
2
Input Coverage
Roughness
Max Gap Ratio
0.0598
< 0.20
True
3
Model Fit (CV)
Signal
Mean R2 Score
0.9962
> 0.50
True
4
Bootstrap Convergence
Signal
Avg CV (Rel Std Dev)
0.0136
< 0.15
True
5
Bootstrap Convergence
Signal
Max CV (Rel Std Dev)
0.0282
< 0.30
True
4. Reliability Analysis & Layered Caching
With a validated dataset, we can now run the complete PoD pipeline.
Tip
The SimulationStudy engine uses a Layered Caching Architecture.
When you run .pod() for the first time, it evaluates all possible models and stores the mathematical heavy-lifting in memory. If you want to change a parameter slice (e.g., investigating a different ‘Angle’), you can use .update_slice() to instantly generate a new PoD curve without refitting the model!
The single pod() method automatically handles:
Monte-Carlo Integration (marginalising out Nuisance variables)
Model Selection (finding the best polynomial fit)
Distribution Inference (finding the best error distribution)
Bootstrapping (generating confidence bounds)
# --- RUN 1: Baseline Analysis ---# Expect this to take a moment as it trains models and optimizes variance bandwidtht0 = time.time()_ = study.pod( poi_col="Length", nuisance_col=["Roughness"], # Integrate over roughness slice_values={"Angle": 0.0}, # Hold Angle strictly at 0.0! threshold=34.0, n_boot=0, # Skipping bootstrap for the speed demonstration)print(f"Run 1 completed in: {time.time() - t0:.4f} seconds")# --- RUN 2: Interactive Slicing ---# Changing the constant 'Angle' slice triggers a Cache Hit for the model and variance!t1 = time.time()new_results = study.update_slice({"Angle": 5.0})print(f"Run 2 completed in: {time.time() - t1:.4f} seconds (Near Instant!)")print(f"New a90/95 point at Angle 5.0: {new_results['a90_95']:.3f}")
--- Starting Reliability Analysis (PoIs: ['Length'] - Nuisance: ['Roughness']) ---
-> Slicing surface at: {'Angle': 0.0}
1. Training all surrogate models (Cache Miss)...
-> Selected Model: Polynomial (Degree 2)
2. Fitting Variance Model & Inferring Distribution (Cache Miss)...
-> Optimizing bandwidth via LOO-CV...
-> Calculating Total-Order Sobol Indices...
-> Smoothing Bandwidth: 3.3288
-> Selected Distribution: norm
3. Integrating PoD Curve from Scratch (Cache Miss)...
4. Skipping Bootstrap (n_boot=0)...
--- Analysis Complete ---
Run 1 completed in: 0.8890 seconds
--- Starting Reliability Analysis (PoIs: ['Length'] - Nuisance: ['Roughness']) ---
-> Slicing surface at: {'Angle': 5.0}
1. Loading surrogate models from cache (Cache Hit)...
-> Selected Model: Polynomial (Degree 2)
2. Loading Variance Model, Distribution & Sobol from cache (Cache Hit)...
-> Smoothing Bandwidth: 3.3288
-> Selected Distribution: norm
3. Integrating PoD Curve from Scratch (Cache Miss)...
4. Skipping Bootstrap (n_boot=0)...
--- Analysis Complete ---
Run 2 completed in: 0.2263 seconds (Near Instant!)
New a90/95 point at Angle 5.0: 5.498
# Visualise the results of the final active slicestudy.visualise()
(a) Best Fitting Model (Statistics)
(b) Signal Response Model (Physics)
(c) Probability of Detection Curve (Reliability)
Figure 1: Reliability Analysis Results
Appendix: Full Script
import numpy as npimport pandas as pdimport timefrom digiqual import SimulationStudy# --- 1. Define the Physics ---def apply_physics(df):"""Simulates a signal with quadratic trends and heteroscedastic noise.""" signal = (10.0+ (2.0* df["Length"])+ (0.5* df["Length"] **2)- (0.1* np.abs(df["Angle"])) ) noise_scale =0.5+ (1.5* df["Roughness"]) noise = np.random.normal(loc=0, scale=noise_scale, size=len(df))return signal + noise# --- 2. Create Flawed Data ---df1 = pd.DataFrame( {"Length": np.random.uniform(0.1, 3.0, 40),"Angle": np.random.uniform(-10, 10, 40),"Roughness": np.random.uniform(0, 0.5, 40), })df2 = pd.DataFrame( {"Length": np.random.uniform(7.0, 10.0, 40),"Angle": np.random.uniform(-10, 10, 40),"Roughness": np.random.uniform(0, 0.5, 40), })df_initial = pd.concat([df1, df2], ignore_index=True)df_initial["Signal"] = apply_physics(df_initial)# --- 3. Initialize & Ingest ---study = SimulationStudy()study.add_data(df_initial, outcome_col="Signal", overwrite=True)# --- 4. Diagnose & Refine ---study.diagnose()new_samples = study.refine(n_points=20)ifnot new_samples.empty: new_samples["Signal"] = apply_physics(new_samples) study.add_data(new_samples)# --- 5. Fast-Path Reliability Exploration (Caching) ---# Calculate the base physics model (n_boot=0) to prime Layers 1 and 2study.pod(poi_col="Length", threshold=15.0, n_boot=0)# Generate a 100-threshold spectrum matrix (Layer 4 Cache) instantlyspectrum = study.compute_pod_spectrum(poi_col="Length")print("Spectrum generated for thresholds:", spectrum["thresholds"][:3], "...")# --- 6. Uncertainty Quantification (Bootstrapping) ---# Run the heavy, parallelised bootstrap to get 95% Confidence Intervalsresults = study.pod(poi_col="Length", threshold=15.0, n_boot=1000, n_jobs=-1)print(f"Classical a90/95: {results['a90_95']:.2f}")# --- 7. Visualisation ---study.visualise()