Threading and Memory Scaling
Threading, memory scaling, and operational sizing guidance.
Use this section when the one-handle rule is already settled and the next question is scale: how much memory will N models use, when does NUMA matter, when is GPU sharing still reasonable, and which model sizes deserve extra caution.
Related Chapters For concurrency rules governing individual handles, see Concurrency and Shutdown. For multi-model deployment patterns, see Multi-Model Patterns.
EDA integrations often run many TDSE models in parallel to maximize throughput. This chapter is about capacity planning and resource envelopes, not about same-handle ownership policy.
Capacity Planning For N Parallel Models
Per-Model Memory Footprint
Each TDSE model handle has a memory footprint that depends on:
- Port count (number of ports
np) - History depth (for IR models)
- Matrix size and sparsity
- Solver backend (dense vs sparse, CPU vs GPU)
- Internal workspace buffers
Think about the footprint in layers:
- Model metadata: ~1-10 KB (constant overhead)
- Matrix storage: scales with
np^2for dense,O(nnz)for sparse - Solver workspace: backend-dependent (LU factors, sparse symbolic structure)
- IR convolution buffers: scales with history depth × port count
- Internal scratch buffers: temporary workspace during step operations
Linear Scaling Model
For N parallel model handles, memory scales approximately linearly:
Total memory approximately N x (per-model footprint) + shared_runtime_overhead
The shared runtime overhead is typically small (< 10 MB) and does not scale with N.
Practical Planning Ranges
Based on typical workloads:
- Small models (np < 10, sparse): ~1-10 MB per model
- Medium models (np < 100, moderate density): ~10-100 MB per model
- Large models (np > 100, dense or IR-heavy): ~100 MB - 1 GB per model
These are rough estimates, not support limits. Actual memory usage depends on matrix sparsity, history depth, and backend selection.
Ways To Reduce The Footprint
To reduce memory footprint with many parallel models:
- Reuse one model sequentially when throughput is not the goal: for repeated sweeps of the same pack, one model plus
tdse_model_reset(...)can be cheaper than many parallel copies. See Multi-Model Patterns. - Limit history depth: IR convolution memory scales linearly with history depth
- Prefer sparse backends: For sparse matrices, sparse backends use significantly less memory than dense
- Batch sweeps: For AC sweeps with many frequency points, use sweep parallelism instead of separate models
What To Measure First
Use the following to monitor memory:
tdse_model_info(...)returns model metadata including port count- OS-level tools (Task Manager,
top,ps) for process memory - For GPU workloads, use CUDA tools (
nvidia-smi) to monitor GPU memory
NUMA-Aware Allocation Strategy
Current NUMA Support
Runtime does not currently implement explicit NUMA-aware allocation. Memory allocation follows the OS default policy:
- On Linux: memory is allocated from the NUMA node where the allocating thread is running
- On Windows: memory allocation follows Windows NUMA policies
Best Practices For NUMA Systems
For optimal performance on multi-socket NUMA systems:
- Thread locality: Keep the thread that creates a model handle on the same NUMA node as the thread that steps it
- Avoid remote ownership paths: create and step a model on the same node whenever possible
- Bind threads to NUMA nodes: Use OS tools (
numactl,SetProcessAffinityMask) to pin threads to specific NUMA nodes - Memory-first strategy: Allocate models on NUMA nodes with sufficient memory
Recommended NUMA Pattern
// On Linux with numactl
// Run on NUMA node 0 with memory from node 0
numactl --cpunodebind=0 --membind=0 ./your_simulation
// Or programmatically:
numa_set_preferred(0); // Prefer node 0
// Create model handles here
When NUMA Usually Matters
- TDSE does not provide explicit NUMA control APIs
- NUMA effects are most visible with very large models (> 1 GB) or high model counts (> 100)
- For typical workloads (< 50 models, < 100 MB each), OS default policies are often sufficient
GPU Multi-Stream Parallelism
Current GPU Parallelism Model
Think of GPU sharing in two layers:
- Single-stream per model: Each model handle uses one CUDA stream
- AC sweep parallelism: Multiple frequency points can be solved in parallel using CUDA batched operations
- No explicit multi-stream API: TDSE does not expose CUDA stream management to the host
AC Sweep Parallelism on GPU
When CUDA sweep parallelism is enabled:
- The adapter uses multiple worker threads, each with its own CUDA stream
- Frequency points are distributed across workers and streams
- Batched solves (2-4 points per chunk) improve GPU utilization
Diagnostics fields to monitor:
sweep_parallel_enabled: Whether parallelism activatedsweep_parallel_workers: Number of worker threads/streamssweep_parallel_points: Number of frequency points processed in parallel
Multi-Model GPU Parallelism
For multiple model handles on GPU:
- Each model handle uses its own CUDA context/stream
- Multiple models can run concurrently on the same GPU
- GPU memory is shared across all models on the same device
GPU Memory Management
GPU memory considerations:
- Per-model GPU memory: Matrix + solver workspace + convolution buffers
- Shared GPU memory: CUDA context overhead, driver overhead
- Memory limits: Use
nvidia-smito monitor GPU memory usage - Out-of-memory handling: TDSE returns
TDSE_ERR_OUT_OF_MEMORYif GPU allocation fails
Recommendations
- Monitor GPU memory: Use
nvidia-smito ensure sufficient GPU memory for your model count - Batch sweeps: Use AC sweep parallelism instead of separate models when possible
- Prefer CPU for very small models: Small models may not benefit from GPU overhead
- Limit concurrent GPU models: If GPU memory is constrained, limit the number of concurrent GPU models
Practical Limits On Port Count And History Depth
Port Count Limits
There is no hard-coded maximum port count. Practical limits are determined by:
- Memory: Dense matrices scale as
O(np^2), sparse asO(nnz) - Performance: Factorization cost grows super-linearly with matrix size
- Numerical stability: Very large matrices may have conditioning issues
Empirical guidance:
- < 10 ports: Fast, suitable for dense backends
- 10-100 ports: Medium scale, sparse backends often beneficial
- 100-1000 ports: Large scale, sparse backends recommended
- > 1000 ports: Very large, sparse backends essential; performance depends heavily on sparsity
History Depth Limits
For IR-based models, history depth affects:
- Memory: Linear scaling with history depth × port count
- Performance: Convolution cost scales with history depth
- Accuracy: Deeper history improves accuracy for long-time responses
Empirical guidance:
- < 100 samples: Fast, suitable for narrowband responses
- 100-1000 samples: Medium depth, typical for many applications
- 1000-10000 samples: Deep history, for wideband or long-time responses
- > 10000 samples: Very deep, may require significant memory and time
Solver Backend Interaction
Port count and history depth interact with solver backend selection:
- Dense backends: Port count dominates memory and performance
- Sparse backends: Sparsity pattern matters more than raw port count
- IR convolution: History depth dominates memory and performance
- GPU backends: Benefit from larger problems to amortize transfer overhead
When To Expect Issues
Watch for these warning signs:
- Memory spikes: Sudden large memory increases with small parameter changes
- Performance degradation: Non-linear performance drop with increasing size
- Numerical warnings:
near_singularflags, large condition numbers - GPU out-of-memory: CUDA allocation failures
Diagnostics to monitor:
solver_backend: Which backend is selectedmatrix_nnz,matrix_density: Matrix characteristicssolver_rcond_estimate,solver_rgrowth: Numerical healthnear_singular: Near-singularity warning
Summary Checklist
Use this checklist when you are sizing a deployment, not when you are debugging a same-handle race:
- Estimate per-model memory: Use port count, history depth, and backend selection
- Plan for N x scaling: Total memory approximately N x per-model footprint
- Consider NUMA: On multi-socket systems, bind threads to NUMA nodes
- Monitor GPU memory: Use
nvidia-smifor GPU workloads - Start with conservative limits: Begin with moderate port counts and history depths
- Use diagnostics: Monitor
solver_backend, matrix metrics, and numerical health - Profile scaling: Test with your actual workload to verify scaling behavior
