Time-Domain System Equivalent logoTime-Domain System EquivalentLinear dynamics, solved faster.Discuss Integration

Backend Selection and Performance

Backend selection, runtime plans, CUDA, precision, thread control, and benchmark guidance.

Audience: Integration engineers choosing execution backends for production deployments, and anyone tuning GPU or CPU performance for large models.

Use this chapter when you are choosing an execution backend for deployment or trying to improve throughput on existing hardware. It shows how to discover available backends, pick the right one for a model, and tune the settings that matter most.

For best results, read this chapter after Profiler. The Profiler tells you how to measure your real workload; this chapter tells you how to interpret the measurements, pick a backend policy, and separate sizing guidance from stronger deployment evidence.

Backend Overview

TDSE runtime backends control how convolution and linear algebra operations are executed. The backend is selected per model before the first step.

Discovering Available Backends

uint32_t count = tdse_backend_registry_count();
for (uint32_t i = 0; i < count; ++i) {
    tdse_backend_capability_t cap;
    tdse_backend_registry_get(i, &cap);
    printf("backend %u: %s (np_max=%zu, cuda=%d)\n",
           i, tdse_backend_id_name(cap.id), (size_t)cap.max_np, cap.has_cuda);
}

Backend Identifier Reference

Backend IDDescriptionBest For
CPU_GENERICportable CPU backendsmall models, any hardware
CPU_BLASBLAS-accelerated CPUmedium models with OpenBLAS/MKL
CPU_BLAS_SPARSEsparse BLAS CPUlarge sparse models
CUDANVIDIA GPU backendlarge models, high throughput

Not all backends are available in every build. Use the registry API to check what is compiled in.

Setting the Backend

tdse_backend_id_t id = tdse_backend_id_from_name("CPU_BLAS");
tdse_status_t st = tdse_backend_set(model, id);

Important: tdse_backend_set() must be called before the first successful tdse_step_begin(). After the first step begins, configuration is frozen and tdse_backend_set() returns TDSE_EXT_STATUS_UNSUPPORTED.

Query the active backend at any time:

tdse_backend_id_t active = tdse_backend_get_active(model);

Runtime Plans

For repeatable deployments, prefer tdse_backend_apply_plan() over manual backend selection. A runtime plan is a JSON document that captures backend selection and optional per-scenario overrides in one place.

Applying a Plan

tdse_backend_apply_plan(model, plan_json, plan_json_len);

The plan JSON structure:

{
  "default": {
    "backend": "CPU_BLAS",
    "local_threads": 4,
    "compute_precision": "fp64"
  },
  "scenarios": {
    "large_np": {
      "backend": "CUDA",
      "cuda_config": {
        "pipeline_mode": "async"
      }
    }
  }
}

The default section applies to all models. The scenarios section provides named overrides that the host can activate when a model or workload needs a different policy.

Plans are typically generated by the TDSE Profiler. See Profiler for details.

CUDA Configuration

When the CUDA backend is available, configure per-model options:

tdse_cuda_backend_config_t cuda_cfg;
tdse_cuda_backend_get_config(model, &cuda_cfg);
cuda_cfg.pipeline_mode = TDSE_CUDA_PIPELINE_ASYNC;
tdse_cuda_backend_set_config(model, &cuda_cfg);

Pipeline Modes

ModeDescriptionWhen to Use
syncsynchronous host-device transferdebugging, small models
asyncoverlapping compute and transferproduction, large models

GPU Memory Management

If GPU allocation fails, tdse_model_create() or step APIs return TDSE_ERR_OUT_OF_MEMORY. Monitor GPU memory with nvidia-smi.

For multi-model GPU deployment, resource sharing, and memory planning, see the multi-model deployment patterns in Multi-Model Deployment.

GPU Recommendations

  • Prefer GPU for models with np > 10 and nh > 100
  • Small models may not amortize the CPU-GPU transfer overhead
  • Limit concurrent GPU models if memory is constrained

Compute Precision

Control the floating-point precision used for history convolution:

tdse_compute_precision_set(model, TDSE_COMPUTE_PRECISION_FP32);
PrecisionDescriptionImpact
FP64double precision (default)highest accuracy, moderate performance
FP32single precisionfaster, reduced accuracy in history term

When to use FP32: large nh where the history term dominates step cost and the application tolerates reduced mantissa precision in the history accumulation.

When to stay with FP64: stiff systems, high-frequency dynamics, or when bit-exact reproducibility is required across hardware.

Query current setting:

tdse_compute_precision_t prec = tdse_compute_precision_get(model);

Thread Control

Override the number of CPU threads used internally by the runtime:

tdse_local_threads_set(model, 4);

By default, the runtime uses the number of logical CPU cores reported by tdse_ext_runtime_logical_cores().

Guidelines:

  • For single-model workflows, set local_threads to the number of physical cores
  • For N-model parallel workflows, divide available cores across models
  • Setting local_threads higher than available cores provides no benefit

Backend Selection Guide

How large is your model (np)?
├─ np < 10
│  └─ CPU_GENERIC or CPU_BLAS
│     └─ GPU overhead exceeds benefit
├─ np 10-100
│  └─ CPU_BLAS (with OpenBLAS/MKL)
│     └─ Consider CUDA if throughput matters
└─ np > 100
   └─ CUDA (if available) or CPU_BLAS_SPARSE
      └─ GPU transfer overhead is amortized

Additional factors:

FactorBackend Impact
Sparse matrix structureCPU_BLAS_SPARSE may outperform dense even at moderate np
Multiple models in parallelEach gets its own handle; divide CPU threads or GPU memory
Variable dtFast-path backends (CPU_BLAS, CUDA) optimize for uniform stepping
Pack sizeLarge nh increases convolution cost; GPU benefits more

Build Features

Check what features the current build was compiled with:

/* Query required buffer size */
size_t json_len = 0;
tdse_perf_get_build_features_json(NULL, &json_len);

/* Allocate and query */
char* json = malloc(json_len);
tdse_perf_get_build_features_json(json, &json_len);
printf("Features: %s\n", json);
free(json);

This returns a JSON object listing compiled-in features such as CUDA support, BLAS backend, telemetry, and other optional components.

Performance Benchmarks

The numbers below are sizing guidance, not guarantees. They help you estimate whether TDSE is in the right range for your deployment before you run your own measurements. Unless noted otherwise, all data uses the default CPU backend (CPU_GENERIC) on a representative x86_64 workstation.

Treat this section as early sizing guidance, not release evidence. Procurement, PoC, and real-time sign-off decisions should be based on measurements from your target machine, target pack shape, and target host-loop policy.

Use the numbers below for triage questions such as "is CPU enough?", "is CUDA worth testing?", or "is this pack likely to fit in memory?". Do not use them as substitutes for customer acceptance criteria, target-machine qualification, or published product guarantees.

Representative Step Latency

Measurements with nh = 256, dt = 1e-6, and 10,000 warm-up steps followed by 10,000 measured steps:

npnqBackendStep Latency (μs)Throughput (steps/s)
11CPU_GENERIC0.3-0.81.2M-3.3M
33CPU_GENERIC0.5-1.5670K-2.0M
33CPU_BLAS0.4-1.2830K-2.5M
1010CPU_GENERIC2-8125K-500K
1010CPU_BLAS1-4250K-1.0M
1010CUDA5-15*67K-200K
5050CPU_BLAS20-8012K-50K
5050CUDA8-2540K-125K
100100CPU_BLAS100-4002.5K-10K
100100CUDA15-5020K-67K

*CUDA numbers include host-device transfer overhead. Small models may not benefit from GPU due to transfer latency.

Memory Footprint

Approximate per-model memory usage:

npnhDense MemoryNotes
3256~50 KBSmall model, typical transmission line
101024~2 MBMedium model, multi-port subsystem
502048~80 MBLarge model, distribution network
1004096~600 MBVery large, dense subsystem

Memory scales approximately as nh * nq * np * 8 bytes for the H tensor plus nq * np * 8 bytes for workspace.

Scaling Behavior

  • Linear in nh: doubling history depth approximately doubles per-step time
  • Quadratic in np: port count has the strongest impact; minimize ports where possible
  • Linear in model count: N independent models consume approximately N times the memory and can run in parallel on separate threads

Benchmarking Your Workload

To measure performance for your specific model:

tdse profiler calibrate --np <your_np> --nh <your_nh> --dtype 64 \
  --out-json ./profile.json --out-md ./profile.md

The profiler report includes:

  • per-step latency for each available backend
  • optimal backend recommendation
  • generated runtime plan for tdse_backend_apply_plan()

For real-time deployments, also measure WCET with deterministic mode enabled (see Platform Notes).

For procurement, PoC, or customer-facing reporting, keep three evidence classes separate:

  • sizing guidance from this chapter
  • profiler output for your exact pack and hardware
  • target-machine timing or field qualification records from your deployment program

When sharing performance numbers outside the immediate engineering team, record at least:

  • CPU and GPU model
  • operating system and compiler/toolchain
  • build type and enabled backend/features
  • model shape (np, nq, nh, dt)
  • whether the host used fixed-step, variable-step, single-model, or multi-model execution
  • whether telemetry, deterministic mode, or additional tracing was enabled during measurement

Performance Monitoring Checklist

  • Verify backend selection with tdse_backend_get_active() after set
  • Compare step latency between backends for your model
  • Set local_threads intentionally rather than relying on default
  • Monitor guard metrics when tuning precision or dt strategy
  • Use Profiler to derive an optimal plan rather than manual tuning
  • Archive the runtime plan alongside pack artifacts for release evidence

Before Deployment Sign-Off

Use the next chapter based on the kind of risk you are trying to close:

If the open question is...Go next
target platform support or qualification boundaryPlatform Notes
plugin deployment, manifest, or ABI compatibilityPlugin System
circuit-input subset or netlist support riskElement Reference
long-running observability rather than benchmark measurementTelemetry