Architecture¶
This document describes the high-level architecture and design principles of gwmock.
Overview¶
gwmock is designed as an orchestration layer that leverages existing third-party packages for the physics layer. The package provides:
- Configuration Management: YAML-based configuration with inheritance and template expansion
- Reproducible Workflows: Full state tracking with checksums and metadata
- Protocol-backed Extensibility: Third-party backends plug in through public protocols and backend resolution
- Adapter-backed Layout: In-tree
signal/,noise/, andpopulation/packages expose adapters only
Core Design Principles¶
1. Avoid Reinventing the Wheel¶
gwmock wraps existing, battle-tested libraries rather than reimplementing signal processing algorithms. This approach:
- Ensures correctness by relying on established implementations
- Reduces maintenance burden
- Allows users to leverage decades of gravitational-wave research
Key dependencies:
- gwmock-signal: public signal protocol and adapter surface
- gwmock-noise: public noise protocol and adapter surface
- gwmock-pop: public population protocol and adapter surface
- typer and pydantic: CLI and configuration plumbing
2. Stable CLI Interface¶
The command-line interface remains unchanged regardless of backend changes. New
behavior is added by updating adapters, orchestration, and the public protocols,
not by adding physics implementations to gwmock.
3. Orchestration Helpers (simulator/, utils/)¶
The package keeps shared orchestration helpers for deterministic seeds, state tracking, and output layout. These helpers support the adapters, but they are not physics implementations.
Benefits:
- Deterministic orchestration
- Clean separation between adapters and backend physics
- Centralized checkpoint and seed handling
- Simple to extend without changing the CLI surface
Project Structure¶
gwmock/
├── __init__.py
├── cli/
│ ├── __init__.py
│ ├── main.py # Typer CLI entry point
│ ├── simulate.py # Simulation command
│ ├── batch.py # Batch helpers
│ ├── merge.py # Merge helpers
│ ├── config.py # Configuration utilities
│ ├── validate.py # Validation helpers
│ ├── adapter_orchestration.py
│ └── simulate_utils.py
├── simulator/
│ ├── __init__.py
│ ├── base.py # Base Simulator class
│ ├── state.py # StateAttribute descriptor and checkpoint state helpers
│ └── seeds.py # Deterministic seed derivation
├── signal/
│ ├── __init__.py
│ └── adapter.py # Signal adapter
├── noise/
│ ├── __init__.py
│ └── adapter.py # Noise adapter
├── population/
│ ├── __init__.py
│ └── adapter.py # Population adapter
├── data/
│ ├── __init__.py
│ └── ... # Data utilities
├── monitor/
│ ├── __init__.py
│ └── resource.py # Resource monitoring helpers
├── repository/
│ ├── __init__.py
│ └── zenodo.py # Repository metadata helpers
├── utils/
│ ├── __init__.py
│ ├── io.py # File I/O utilities
│ ├── log.py # Logging setup
│ ├── random.py # Random number management
│ ├── download.py # Download helpers
│ └── validation.py # Configuration validation
└── version.py # Version information
Key Components¶
1. CLI Layer (cli/)¶
Purpose: User-facing command-line interface
Key files:
main.py: Typer application with commandssimulate.py: Main simulation commandutils/: Configuration loading, checkpointing, templating
Features:
- Commands:
gwmock simulate config.yaml - Flags:
--overwrite,--dry-run,--metadata - Argument validation and help text
2. Simulator Framework (simulator/)¶
Purpose: Core simulator interface and registration
Key classes:
Simulator: Abstract base with state managementStateAttribute: Descriptor for state trackingPopulationIterationState: Legacy population checkpoint state for orchestration resume
4. Adapter Layer (signal/, noise/, population/)¶
Purpose: Translate orchestration configs into the public subpackage protocols.
These packages do not contain physics implementations. They resolve public backends, validate conformance, and hand off to the relevant subpackage or third-party class.
5. Backend Integration¶
Purpose: Third-party backend support through the public contracts.
Backends may be shipped by gwmock, discovered through entry points, or
referenced directly as module:Class.
6. Configuration System (cli/utils/config.py)¶
Features:
- YAML parsing and validation
- Jinja2 template expansion
- Configuration inheritance
- Runtime variable substitution
Example flow:
config.yaml (user input)
↓
YAML parsing
↓
Inheritance resolution (if inherits field present)
↓
Template expansion (Jinja2)
↓
Backend resolution
↓
Validated SimulationPlan
7. Checkpointing (cli/utils/checkpoint.py)¶
Purpose: Resume interrupted simulations
Checkpoint structure:
{
"last_completed_batch": 5,
"last_completed_file": "file.gwf",
"random_state": {...},
"processed_samples": 5,
"timestamp": "2025-01-01T12:00:00Z"
}
Resume logic:
- Load checkpoint file
- Restore random state
- Skip completed batches
- Continue from last incomplete batch
8. State Management (simulator/state.py)¶
Purpose: Track simulator state across batches
StateAttribute descriptor:
class StateAttribute:
"""Descriptor for state tracking without class-level pollution."""
def __set_name__(self, owner, name):
self.name = name
def __get__(self, obj, objtype=None):
if obj is None:
return self
return obj._state.get(self.name)
def __set__(self, obj, value):
obj._state[self.name] = value
Key feature: Instance-level state isolation prevents cross-contamination in tests
Data Flow¶
Simulation Workflow¶
User Input (config.yaml)
↓
CLI parsing (Typer)
↓
Configuration Loading
- Parse YAML
- Resolve inheritance
- Expand templates
↓
Validation
- Check file paths
- Validate classes
- Verify parameters
↓
SimulationPlan creation
↓
Checkpoint check
- Load if exists
- Skip completed batches
↓
Simulator instantiation
- Resolve class from registry
- Inject configuration
↓
Batch iteration
├── Generate data
├── Create time series
├── Write GWF file
├── Generate metadata
└── Update checkpoint
↓
Output
- Data files (*.gwf)
- Metadata files (*.metadata.json)
- Checkpoint file (.gwmock_checkpoint/simulation.checkpoint.json)
Data Generation¶
Adapter.resolve()
↓
Public protocol backend
↓
Generated strain or population data
↓
Adapter output formatting
↓
gwf file + metadata
Extension Points¶
Adding a Third-Party Backend¶
- Implement the upstream protocol in your package.
- Expose the class through an entry point or importable
module:Classreference. -
Reference it in config:
orchestration: noise: backend: my_package.noise:MyCustomNoise arguments: param1: value1
Adding an Orchestration Helper¶
-
Create the helper in
simulator/orutils/:class MyHelper: """Provides orchestration-only functionality.""" def my_method(self): pass -
Use it from an adapter or CLI helper, not from a physics package.
Thread Safety & Concurrency¶
Current implementation:
- Single-threaded batch processing
- Checkpointing ensures fault tolerance
- Random state management prevents seed collisions
Future considerations:
- Thread-pool execution for batch parallelization
- Process-pool for computationally intensive simulations
- Distributed simulation across multiple machines
Testing Strategy¶
Unit Tests¶
- Mock third-party libraries
- Test configuration parsing
- Test state management
- Test CLI argument handling
Integration Tests¶
- End-to-end simulation workflows
- Checkpoint/resume functionality
- File I/O operations
Performance Tests¶
- Benchmark common operations
- Memory profiling for large datasets
- Stress testing with extended simulations
Design Decisions¶
Why Mixins?¶
- Flexibility: Combine features as needed
- Reusability: Same mixin in multiple simulators
- Maintainability: Changes in one mixin don't affect others
- Testability: Easy to mock individual mixins
Why StateAttribute?¶
- Instance isolation: Prevents test interference
- Clean interface: Transparent to users
- Automatic tracking: Integrated with checkpointing
Why Registry?¶
- Dynamic loading: Simulators added without code changes
- Configuration-driven: Full control via YAML
- Third-party integration: Easy to wrap external libraries
- Discovery: Automatic detection of available simulators
Performance Considerations¶
- Lazy loading: Simulators instantiated only when needed
- Streaming: Process data in chunks to reduce memory
- Caching: Cache compiled templates and registry lookups
- Checkpointing: Resume from intermediate states
- Parallelization: Process multiple batches concurrently