Time-Domain System Equivalent logoTime-Domain System EquivalentLinear dynamics, solved faster.Discuss Integration

Threading and Memory Scaling

Threading, memory scaling, and operational sizing guidance.

Use this section when the one-handle rule is already settled and the next question is scale: how much memory will N models use, when does NUMA matter, when is GPU sharing still reasonable, and which model sizes deserve extra caution.

Related Chapters For concurrency rules governing individual handles, see Concurrency and Shutdown. For multi-model deployment patterns, see Multi-Model Patterns.

EDA integrations often run many TDSE models in parallel to maximize throughput. This chapter is about capacity planning and resource envelopes, not about same-handle ownership policy.

Capacity Planning For N Parallel Models

Per-Model Memory Footprint

Each TDSE model handle has a memory footprint that depends on:

  • Port count (number of ports np)
  • History depth (for IR models)
  • Matrix size and sparsity
  • Solver backend (dense vs sparse, CPU vs GPU)
  • Internal workspace buffers

Think about the footprint in layers:

  • Model metadata: ~1-10 KB (constant overhead)
  • Matrix storage: scales with np^2 for dense, O(nnz) for sparse
  • Solver workspace: backend-dependent (LU factors, sparse symbolic structure)
  • IR convolution buffers: scales with history depth × port count
  • Internal scratch buffers: temporary workspace during step operations

Linear Scaling Model

For N parallel model handles, memory scales approximately linearly:

Total memory approximately N x (per-model footprint) + shared_runtime_overhead

The shared runtime overhead is typically small (< 10 MB) and does not scale with N.

Practical Planning Ranges

Based on typical workloads:

  • Small models (np < 10, sparse): ~1-10 MB per model
  • Medium models (np < 100, moderate density): ~10-100 MB per model
  • Large models (np > 100, dense or IR-heavy): ~100 MB - 1 GB per model

These are rough estimates, not support limits. Actual memory usage depends on matrix sparsity, history depth, and backend selection.

Ways To Reduce The Footprint

To reduce memory footprint with many parallel models:

  1. Reuse one model sequentially when throughput is not the goal: for repeated sweeps of the same pack, one model plus tdse_model_reset(...) can be cheaper than many parallel copies. See Multi-Model Patterns.
  2. Limit history depth: IR convolution memory scales linearly with history depth
  3. Prefer sparse backends: For sparse matrices, sparse backends use significantly less memory than dense
  4. Batch sweeps: For AC sweeps with many frequency points, use sweep parallelism instead of separate models

What To Measure First

Use the following to monitor memory:

  • tdse_model_info(...) returns model metadata including port count
  • OS-level tools (Task Manager, top, ps) for process memory
  • For GPU workloads, use CUDA tools (nvidia-smi) to monitor GPU memory

NUMA-Aware Allocation Strategy

Current NUMA Support

Runtime does not currently implement explicit NUMA-aware allocation. Memory allocation follows the OS default policy:

  • On Linux: memory is allocated from the NUMA node where the allocating thread is running
  • On Windows: memory allocation follows Windows NUMA policies

Best Practices For NUMA Systems

For optimal performance on multi-socket NUMA systems:

  1. Thread locality: Keep the thread that creates a model handle on the same NUMA node as the thread that steps it
  2. Avoid remote ownership paths: create and step a model on the same node whenever possible
  3. Bind threads to NUMA nodes: Use OS tools (numactl, SetProcessAffinityMask) to pin threads to specific NUMA nodes
  4. Memory-first strategy: Allocate models on NUMA nodes with sufficient memory
// On Linux with numactl
// Run on NUMA node 0 with memory from node 0
numactl --cpunodebind=0 --membind=0 ./your_simulation

// Or programmatically:
numa_set_preferred(0);  // Prefer node 0
// Create model handles here

When NUMA Usually Matters

  • TDSE does not provide explicit NUMA control APIs
  • NUMA effects are most visible with very large models (> 1 GB) or high model counts (> 100)
  • For typical workloads (< 50 models, < 100 MB each), OS default policies are often sufficient

GPU Multi-Stream Parallelism

Current GPU Parallelism Model

Think of GPU sharing in two layers:

  • Single-stream per model: Each model handle uses one CUDA stream
  • AC sweep parallelism: Multiple frequency points can be solved in parallel using CUDA batched operations
  • No explicit multi-stream API: TDSE does not expose CUDA stream management to the host

AC Sweep Parallelism on GPU

When CUDA sweep parallelism is enabled:

  • The adapter uses multiple worker threads, each with its own CUDA stream
  • Frequency points are distributed across workers and streams
  • Batched solves (2-4 points per chunk) improve GPU utilization

Diagnostics fields to monitor:

  • sweep_parallel_enabled: Whether parallelism activated
  • sweep_parallel_workers: Number of worker threads/streams
  • sweep_parallel_points: Number of frequency points processed in parallel

Multi-Model GPU Parallelism

For multiple model handles on GPU:

  • Each model handle uses its own CUDA context/stream
  • Multiple models can run concurrently on the same GPU
  • GPU memory is shared across all models on the same device

GPU Memory Management

GPU memory considerations:

  • Per-model GPU memory: Matrix + solver workspace + convolution buffers
  • Shared GPU memory: CUDA context overhead, driver overhead
  • Memory limits: Use nvidia-smi to monitor GPU memory usage
  • Out-of-memory handling: TDSE returns TDSE_ERR_OUT_OF_MEMORY if GPU allocation fails

Recommendations

  1. Monitor GPU memory: Use nvidia-smi to ensure sufficient GPU memory for your model count
  2. Batch sweeps: Use AC sweep parallelism instead of separate models when possible
  3. Prefer CPU for very small models: Small models may not benefit from GPU overhead
  4. Limit concurrent GPU models: If GPU memory is constrained, limit the number of concurrent GPU models

Practical Limits On Port Count And History Depth

Port Count Limits

There is no hard-coded maximum port count. Practical limits are determined by:

  • Memory: Dense matrices scale as O(np^2), sparse as O(nnz)
  • Performance: Factorization cost grows super-linearly with matrix size
  • Numerical stability: Very large matrices may have conditioning issues

Empirical guidance:

  • < 10 ports: Fast, suitable for dense backends
  • 10-100 ports: Medium scale, sparse backends often beneficial
  • 100-1000 ports: Large scale, sparse backends recommended
  • > 1000 ports: Very large, sparse backends essential; performance depends heavily on sparsity

History Depth Limits

For IR-based models, history depth affects:

  • Memory: Linear scaling with history depth × port count
  • Performance: Convolution cost scales with history depth
  • Accuracy: Deeper history improves accuracy for long-time responses

Empirical guidance:

  • < 100 samples: Fast, suitable for narrowband responses
  • 100-1000 samples: Medium depth, typical for many applications
  • 1000-10000 samples: Deep history, for wideband or long-time responses
  • > 10000 samples: Very deep, may require significant memory and time

Solver Backend Interaction

Port count and history depth interact with solver backend selection:

  • Dense backends: Port count dominates memory and performance
  • Sparse backends: Sparsity pattern matters more than raw port count
  • IR convolution: History depth dominates memory and performance
  • GPU backends: Benefit from larger problems to amortize transfer overhead

When To Expect Issues

Watch for these warning signs:

  • Memory spikes: Sudden large memory increases with small parameter changes
  • Performance degradation: Non-linear performance drop with increasing size
  • Numerical warnings: near_singular flags, large condition numbers
  • GPU out-of-memory: CUDA allocation failures

Diagnostics to monitor:

  • solver_backend: Which backend is selected
  • matrix_nnz, matrix_density: Matrix characteristics
  • solver_rcond_estimate, solver_rgrowth: Numerical health
  • near_singular: Near-singularity warning

Summary Checklist

Use this checklist when you are sizing a deployment, not when you are debugging a same-handle race:

  1. Estimate per-model memory: Use port count, history depth, and backend selection
  2. Plan for N x scaling: Total memory approximately N x per-model footprint
  3. Consider NUMA: On multi-socket systems, bind threads to NUMA nodes
  4. Monitor GPU memory: Use nvidia-smi for GPU workloads
  5. Start with conservative limits: Begin with moderate port counts and history depths
  6. Use diagnostics: Monitor solver_backend, matrix metrics, and numerical health
  7. Profile scaling: Test with your actual workload to verify scaling behavior