SimulationStudy Class

A streamlined guide using the SimulationStudy manager.

Use the SimulationStudy manager to handle the entire lifecycle: storage, diagnostics, active refinement, analysis, and intelligent caching.

Note

In this tutorial, we will intentionally feed the study “bad” data (with a large gap in the input space) to demonstrate how the active learning module automatically identifies and fixes issues. We will also demonstrate how the new Layered Caching engine makes exploring multi-dimensional parameter spaces nearly instantaneous.

1. Setup, Generation & Smart Initialization

We’ll define a simulated physics function with heteroscedastic noise and generate an initial design containing a deliberate gap between 3mm and 7mm.

Notice that we no longer need to explicitly define our input columns when creating the SimulationStudy; the manager will automatically infer them from the data!

import numpy as np
import pandas as pd
import time
from digiqual import SimulationStudy


# --- Define the Physics ---
def apply_physics(df):
    """Simulates a signal with quadratic trends and heteroscedastic noise."""
    signal = (
        10.0
        + (2.0 * df["Length"])
        + (0.5 * df["Length"] ** 2)
        - (0.1 * np.abs(df["Angle"]))
    )
    noise_scale = 0.5 + (1.5 * df["Roughness"])
    noise = np.random.normal(loc=0, scale=noise_scale, size=len(df))
    return signal + noise


# --- Create Flawed Data (Gap between 3mm and 7mm) ---
df1 = pd.DataFrame(
    {
        "Length": np.random.uniform(0.1, 3.0, 40),
        "Angle": np.random.uniform(-10, 10, 40),
        "Roughness": np.random.uniform(0, 0.5, 40),
    }
)
df2 = pd.DataFrame(
    {
        "Length": np.random.uniform(7.0, 10.0, 40),
        "Angle": np.random.uniform(-10, 10, 40),
        "Roughness": np.random.uniform(0, 0.5, 40),
    }
)

df_initial = pd.concat([df1, df2], ignore_index=True)
df_initial["Signal"] = apply_physics(df_initial)

# --- The NEW Smarter Initialization ---
# 1. Initialize with a clean slate
study = SimulationStudy()

# 2. Add data: It automatically infers 'Length', 'Angle', 'Roughness' as inputs!
study.add_data(df_initial, outcome_col="Signal", overwrite=True)

# Test the UI Helper methods
print(f"Inferred Inputs: {study.inputs}")
threshold_defaults = study.get_data_summary("Signal")
print(f"Auto-Threshold suggestion (Median Signal): {threshold_defaults['median']:.2f}")
Data initialized/overwritten. Total rows: 80
Inferred Inputs: ['Length', 'Angle', 'Roughness']
Auto-Threshold suggestion (Median Signal): 34.23

2. Diagnosis (Detecting the Issue)

We ask the SimulationStudy manager to diagnose the health of our experiment.

Warning

Because of our deliberate gap, we expect the diagnostic engine to flag the Input Coverage test as failed.

study.diagnose()
Running validation...
Validation passed. 80 valid rows ready.
Checking sample sufficiency...
Test Variable Metric Value Threshold Pass
0 Input Coverage Length Max Gap Ratio 0.4200 < 0.20 False
1 Input Coverage Angle Max Gap Ratio 0.0749 < 0.20 True
2 Input Coverage Roughness Max Gap Ratio 0.0598 < 0.20 True
3 Model Fit (CV) Signal Mean R2 Score 0.9978 > 0.50 True
4 Bootstrap Convergence Signal Avg CV (Rel Std Dev) 0.0138 < 0.15 True
5 Bootstrap Convergence Signal Max CV (Rel Std Dev) 0.0275 < 0.30 True

3. Adaptive Refinement (The Fix)

Instead of manually guessing where to add new simulations, we use the refine() method.

  1. Detect: It automatically targets the exact region where the gap or uncertainty exists.
  2. Generate: It outputs a dataframe of new input coordinates.
  3. Execute & Store: Appending the results back using add_data() automatically clears the mathematical caches to ensure your next analysis uses the updated physics!
# Generate active learning samples (20 points to fill the gap)
new_samples = study.refine(n_points=20)

if not new_samples.empty:
    # Apply the exact same physics model to the new points
    new_samples["Signal"] = apply_physics(new_samples)

    # Add back to study (This strips metadata and drops the caches!)
    study.add_data(new_samples)

# Re-run diagnostics to confirm the green light
study.diagnose()
Diagnostics flagged issues. Initiating Active Learning...
 -> Strategy: Exploration (Filling gaps in Length)
Data appended. Total rows: 100
Running validation...
Validation passed. 100 valid rows ready.
Checking sample sufficiency...
Test Variable Metric Value Threshold Pass
0 Input Coverage Length Max Gap Ratio 0.0785 < 0.20 True
1 Input Coverage Angle Max Gap Ratio 0.0651 < 0.20 True
2 Input Coverage Roughness Max Gap Ratio 0.0598 < 0.20 True
3 Model Fit (CV) Signal Mean R2 Score 0.9962 > 0.50 True
4 Bootstrap Convergence Signal Avg CV (Rel Std Dev) 0.0136 < 0.15 True
5 Bootstrap Convergence Signal Max CV (Rel Std Dev) 0.0282 < 0.30 True

4. Reliability Analysis & Layered Caching

With a validated dataset, we can now run the complete PoD pipeline.

Tip

The SimulationStudy engine uses a Layered Caching Architecture.

When you run .pod() for the first time, it evaluates all possible models and stores the mathematical heavy-lifting in memory. If you want to change a parameter slice (e.g., investigating a different ‘Angle’), you can use .update_slice() to instantly generate a new PoD curve without refitting the model!

The single pod() method automatically handles:

  • Monte-Carlo Integration (marginalising out Nuisance variables)
  • Model Selection (finding the best polynomial fit)
  • Distribution Inference (finding the best error distribution)
  • Bootstrapping (generating confidence bounds)
# --- RUN 1: Baseline Analysis ---
# Expect this to take a moment as it trains models and optimizes variance bandwidth
t0 = time.time()
_ = study.pod(
    poi_col="Length",
    nuisance_col=["Roughness"],  # Integrate over roughness
    slice_values={"Angle": 0.0},  # Hold Angle strictly at 0.0!
    threshold=34.0,
    n_boot=0,  # Skipping bootstrap for the speed demonstration
)
print(f"Run 1 completed in: {time.time() - t0:.4f} seconds")


# --- RUN 2: Interactive Slicing ---
# Changing the constant 'Angle' slice triggers a Cache Hit for the model and variance!
t1 = time.time()
new_results = study.update_slice({"Angle": 5.0})
print(f"Run 2 completed in: {time.time() - t1:.4f} seconds (Near Instant!)")
print(f"New a90/95 point at Angle 5.0: {new_results['a90_95']:.3f}")
--- Starting Reliability Analysis (PoIs: ['Length'] - Nuisance: ['Roughness']) ---
-> Slicing surface at: {'Angle': 0.0}
1. Training all surrogate models (Cache Miss)...
-> Selected Model: Polynomial (Degree 2)
2. Fitting Variance Model & Inferring Distribution (Cache Miss)...
   -> Optimizing bandwidth via LOO-CV...
-> Calculating Total-Order Sobol Indices...
   -> Smoothing Bandwidth: 3.3288
   -> Selected Distribution: norm
3. Integrating PoD Curve from Scratch (Cache Miss)...
4. Skipping Bootstrap (n_boot=0)...
--- Analysis Complete ---
Run 1 completed in: 0.8890 seconds
--- Starting Reliability Analysis (PoIs: ['Length'] - Nuisance: ['Roughness']) ---
-> Slicing surface at: {'Angle': 5.0}
1. Loading surrogate models from cache (Cache Hit)...
-> Selected Model: Polynomial (Degree 2)
2. Loading Variance Model, Distribution & Sobol from cache (Cache Hit)...
   -> Smoothing Bandwidth: 3.3288
   -> Selected Distribution: norm
3. Integrating PoD Curve from Scratch (Cache Miss)...
4. Skipping Bootstrap (n_boot=0)...
--- Analysis Complete ---
Run 2 completed in: 0.2263 seconds (Near Instant!)
New a90/95 point at Angle 5.0: 5.498
# Visualise the results of the final active slice
study.visualise()
(a) Best Fitting Model (Statistics)
(b) Signal Response Model (Physics)
(c) Probability of Detection Curve (Reliability)
Figure 1: Reliability Analysis Results

Appendix: Full Script

import numpy as np
import pandas as pd
import time
from digiqual import SimulationStudy


# --- 1. Define the Physics ---
def apply_physics(df):
    """Simulates a signal with quadratic trends and heteroscedastic noise."""
    signal = (
        10.0
        + (2.0 * df["Length"])
        + (0.5 * df["Length"] ** 2)
        - (0.1 * np.abs(df["Angle"]))
    )
    noise_scale = 0.5 + (1.5 * df["Roughness"])
    noise = np.random.normal(loc=0, scale=noise_scale, size=len(df))
    return signal + noise


# --- 2. Create Flawed Data ---
df1 = pd.DataFrame(
    {
        "Length": np.random.uniform(0.1, 3.0, 40),
        "Angle": np.random.uniform(-10, 10, 40),
        "Roughness": np.random.uniform(0, 0.5, 40),
    }
)
df2 = pd.DataFrame(
    {
        "Length": np.random.uniform(7.0, 10.0, 40),
        "Angle": np.random.uniform(-10, 10, 40),
        "Roughness": np.random.uniform(0, 0.5, 40),
    }
)
df_initial = pd.concat([df1, df2], ignore_index=True)
df_initial["Signal"] = apply_physics(df_initial)

# --- 3. Initialize & Ingest ---
study = SimulationStudy()
study.add_data(df_initial, outcome_col="Signal", overwrite=True)

# --- 4. Diagnose & Refine ---
study.diagnose()
new_samples = study.refine(n_points=20)

if not new_samples.empty:
    new_samples["Signal"] = apply_physics(new_samples)
    study.add_data(new_samples)

# --- 5. Fast-Path Reliability Exploration (Caching) ---
# Calculate the base physics model (n_boot=0) to prime Layers 1 and 2
study.pod(poi_col="Length", threshold=15.0, n_boot=0)

# Generate a 100-threshold spectrum matrix (Layer 4 Cache) instantly
spectrum = study.compute_pod_spectrum(poi_col="Length")
print("Spectrum generated for thresholds:", spectrum["thresholds"][:3], "...")

# --- 6. Uncertainty Quantification (Bootstrapping) ---
# Run the heavy, parallelised bootstrap to get 95% Confidence Intervals
results = study.pod(poi_col="Length", threshold=15.0, n_boot=1000, n_jobs=-1)

print(f"Classical a90/95: {results['a90_95']:.2f}")

# --- 7. Visualisation ---
study.visualise()