Use this section when the one-handle rule is already settled and the next question is scale: how much memory will N models use, when does NUMA matter, when is GPU sharing still reasonable, and which model sizes deserve extra caution.

Related Chapters For concurrency rules governing individual handles, see Concurrency and Shutdown. For multi-model deployment patterns, see Multi-Model Patterns.

EDA integrations often run many TDSE models in parallel to maximize throughput. This chapter is about capacity planning and resource envelopes, not about same-handle ownership policy.

Capacity Planning For N Parallel Models

Per-Model Memory Footprint

Each TDSE model handle has a memory footprint that depends on:

Port count (number of ports np)
History depth (for IR models)
Matrix size and sparsity
Solver backend (dense vs sparse, CPU vs GPU)
Internal workspace buffers

Think about the footprint in layers:

Model metadata: ~1-10 KB (constant overhead)
Matrix storage: scales with np^2 for dense, O(nnz) for sparse
Solver workspace: backend-dependent (LU factors, sparse symbolic structure)
IR convolution buffers: scales with history depth × port count
Internal scratch buffers: temporary workspace during step operations

Linear Scaling Model

For N parallel model handles, memory scales approximately linearly:

Total memory approximately N x (per-model footprint) + shared_runtime_overhead

The shared runtime overhead is typically small (< 10 MB) and does not scale with N.

Practical Planning Ranges

Based on typical workloads:

Small models (np < 10, sparse): ~1-10 MB per model
Medium models (np < 100, moderate density): ~10-100 MB per model
Large models (np > 100, dense or IR-heavy): ~100 MB - 1 GB per model

These are rough estimates, not support limits. Actual memory usage depends on matrix sparsity, history depth, and backend selection.

Ways To Reduce The Footprint

To reduce memory footprint with many parallel models:

Reuse one model sequentially when throughput is not the goal: for repeated sweeps of the same pack, one model plus tdse_model_reset(...) can be cheaper than many parallel copies. See Multi-Model Patterns.
Limit history depth: IR convolution memory scales linearly with history depth
Prefer sparse backends: For sparse matrices, sparse backends use significantly less memory than dense
Batch sweeps: For AC sweeps with many frequency points, use sweep parallelism instead of separate models

What To Measure First

Use the following to monitor memory:

tdse_model_info(...) returns model metadata including port count
OS-level tools (Task Manager, top, ps) for process memory
For GPU workloads, use CUDA tools (nvidia-smi) to monitor GPU memory

NUMA-Aware Allocation Strategy

Current NUMA Support

Runtime does not currently implement explicit NUMA-aware allocation. Memory allocation follows the OS default policy:

On Linux: memory is allocated from the NUMA node where the allocating thread is running
On Windows: memory allocation follows Windows NUMA policies

Best Practices For NUMA Systems

For optimal performance on multi-socket NUMA systems:

Thread locality: Keep the thread that creates a model handle on the same NUMA node as the thread that steps it
Avoid remote ownership paths: create and step a model on the same node whenever possible
Bind threads to NUMA nodes: Use OS tools (numactl, SetProcessAffinityMask) to pin threads to specific NUMA nodes
Memory-first strategy: Allocate models on NUMA nodes with sufficient memory

Recommended NUMA Pattern

// On Linux with numactl
// Run on NUMA node 0 with memory from node 0
numactl --cpunodebind=0 --membind=0 ./your_simulation

// Or programmatically:
numa_set_preferred(0);  // Prefer node 0
// Create model handles here

When NUMA Usually Matters

TDSE does not provide explicit NUMA control APIs
NUMA effects are most visible with very large models (> 1 GB) or high model counts (> 100)
For typical workloads (< 50 models, < 100 MB each), OS default policies are often sufficient

GPU Multi-Stream Parallelism

Current GPU Parallelism Model

Think of GPU sharing in two layers:

Single-stream per model: Each model handle uses one CUDA stream
AC sweep parallelism: Multiple frequency points can be solved in parallel using CUDA batched operations
No explicit multi-stream API: TDSE does not expose CUDA stream management to the host

AC Sweep Parallelism on GPU

When CUDA sweep parallelism is enabled:

The adapter uses multiple worker threads, each with its own CUDA stream
Frequency points are distributed across workers and streams
Batched solves (2-4 points per chunk) improve GPU utilization

Diagnostics fields to monitor:

sweep_parallel_enabled: Whether parallelism activated
sweep_parallel_workers: Number of worker threads/streams
sweep_parallel_points: Number of frequency points processed in parallel

Multi-Model GPU Parallelism

For multiple model handles on GPU:

Each model handle uses its own CUDA context/stream
Multiple models can run concurrently on the same GPU
GPU memory is shared across all models on the same device

GPU Memory Management

GPU memory considerations:

Per-model GPU memory: Matrix + solver workspace + convolution buffers
Shared GPU memory: CUDA context overhead, driver overhead
Memory limits: Use nvidia-smi to monitor GPU memory usage
Out-of-memory handling: TDSE returns TDSE_ERR_OUT_OF_MEMORY if GPU allocation fails

Recommendations

Monitor GPU memory: Use nvidia-smi to ensure sufficient GPU memory for your model count
Batch sweeps: Use AC sweep parallelism instead of separate models when possible
Prefer CPU for very small models: Small models may not benefit from GPU overhead
Limit concurrent GPU models: If GPU memory is constrained, limit the number of concurrent GPU models

Practical Limits On Port Count And History Depth

Port Count Limits

There is no hard-coded maximum port count. Practical limits are determined by:

Memory: Dense matrices scale as O(np^2), sparse as O(nnz)
Performance: Factorization cost grows super-linearly with matrix size
Numerical stability: Very large matrices may have conditioning issues

Empirical guidance:

< 10 ports: Fast, suitable for dense backends
10-100 ports: Medium scale, sparse backends often beneficial
100-1000 ports: Large scale, sparse backends recommended
> 1000 ports: Very large, sparse backends essential; performance depends heavily on sparsity

History Depth Limits

For IR-based models, history depth affects:

Memory: Linear scaling with history depth × port count
Performance: Convolution cost scales with history depth
Accuracy: Deeper history improves accuracy for long-time responses