Backend selection, runtime plans, CUDA, precision, thread control, and benchmark guidance.

Audience: Integration engineers choosing execution backends for production deployments, and anyone tuning GPU or CPU performance for large models.

Use this chapter when you are choosing an execution backend for deployment or trying to improve throughput on existing hardware. It shows how to discover available backends, pick the right one for a model, and tune the settings that matter most.

For best results, read this chapter after Profiler. The Profiler tells you how to measure your real workload; this chapter tells you how to interpret the measurements, pick a backend policy, and separate sizing guidance from stronger deployment evidence.

Backend Overview

TDSE runtime backends control how convolution and linear algebra operations are executed. The backend is selected per model before the first step.

Discovering Available Backends

uint32_t count = tdse_backend_registry_count();
for (uint32_t i = 0; i < count; ++i) {
    tdse_backend_capability_t cap;
    tdse_backend_registry_get(i, &cap);
    printf("backend %u: %s (np_max=%zu, cuda=%d)\n",
           i, tdse_backend_id_name(cap.id), (size_t)cap.max_np, cap.has_cuda);
}

Backend Identifier Reference

Backend ID	Description	Best For
`CPU_GENERIC`	portable CPU backend	small models, any hardware
`CPU_BLAS`	BLAS-accelerated CPU	medium models with OpenBLAS/MKL
`CPU_BLAS_SPARSE`	sparse BLAS CPU	large sparse models
`CUDA`	NVIDIA GPU backend	large models, high throughput

Not all backends are available in every build. Use the registry API to check what is compiled in.

Setting the Backend

tdse_backend_id_t id = tdse_backend_id_from_name("CPU_BLAS");
tdse_status_t st = tdse_backend_set(model, id);

Important: tdse_backend_set() must be called before the first successful tdse_step_begin(). After the first step begins, configuration is frozen and tdse_backend_set() returns TDSE_EXT_STATUS_UNSUPPORTED.

Query the active backend at any time:

tdse_backend_id_t active = tdse_backend_get_active(model);

Runtime Plans

For repeatable deployments, prefer tdse_backend_apply_plan() over manual backend selection. A runtime plan is a JSON document that captures backend selection and optional per-scenario overrides in one place.

Applying a Plan

tdse_backend_apply_plan(model, plan_json, plan_json_len);

The plan JSON structure:

{
  "default": {
    "backend": "CPU_BLAS",
    "local_threads": 4,
    "compute_precision": "fp64"
  },
  "scenarios": {
    "large_np": {
      "backend": "CUDA",
      "cuda_config": {
        "pipeline_mode": "async"
      }
    }
  }
}

The default section applies to all models. The scenarios section provides named overrides that the host can activate when a model or workload needs a different policy.

Plans are typically generated by the TDSE Profiler. See Profiler for details.

CUDA Configuration

When the CUDA backend is available, configure per-model options:

tdse_cuda_backend_config_t cuda_cfg;
tdse_cuda_backend_get_config(model, &cuda_cfg);
cuda_cfg.pipeline_mode = TDSE_CUDA_PIPELINE_ASYNC;
tdse_cuda_backend_set_config(model, &cuda_cfg);

Pipeline Modes

Mode	Description	When to Use
sync	synchronous host-device transfer	debugging, small models
async	overlapping compute and transfer	production, large models

GPU Memory Management

If GPU allocation fails, tdse_model_create() or step APIs return TDSE_ERR_OUT_OF_MEMORY. Monitor GPU memory with nvidia-smi.

For multi-model GPU deployment, resource sharing, and memory planning, see the multi-model deployment patterns in Multi-Model Deployment.

GPU Recommendations

Prefer GPU for models with np > 10 and nh > 100
Small models may not amortize the CPU-GPU transfer overhead
Limit concurrent GPU models if memory is constrained

Compute Precision

Control the floating-point precision used for history convolution:

tdse_compute_precision_set(model, TDSE_COMPUTE_PRECISION_FP32);

Precision	Description	Impact
`FP64`	double precision (default)	highest accuracy, moderate performance
`FP32`	single precision	faster, reduced accuracy in history term

When to use FP32: large nh where the history term dominates step cost and the application tolerates reduced mantissa precision in the history accumulation.

When to stay with FP64: stiff systems, high-frequency dynamics, or when bit-exact reproducibility is required across hardware.

Query current setting:

tdse_compute_precision_t prec = tdse_compute_precision_get(model);

Thread Control

Override the number of CPU threads used internally by the runtime:

tdse_local_threads_set(model, 4);

By default, the runtime uses the number of logical CPU cores reported by tdse_ext_runtime_logical_cores().

Guidelines:

For single-model workflows, set local_threads to the number of physical cores
For N-model parallel workflows, divide available cores across models
Setting local_threads higher than available cores provides no benefit

Backend Selection Guide

How large is your model (np)?
├─ np < 10
│  └─ CPU_GENERIC or CPU_BLAS
│     └─ GPU overhead exceeds benefit
├─ np 10-100
│  └─ CPU_BLAS (with OpenBLAS/MKL)
│     └─ Consider CUDA if throughput matters
└─ np > 100
   └─ CUDA (if available) or CPU_BLAS_SPARSE
      └─ GPU transfer overhead is amortized

Additional factors:

Factor	Backend Impact
Sparse matrix structure	CPU_BLAS_SPARSE may outperform dense even at moderate np
Multiple models in parallel	Each gets its own handle; divide CPU threads or GPU memory
Variable dt	Fast-path backends (CPU_BLAS, CUDA) optimize for uniform stepping
Pack size	Large nh increases convolution cost; GPU benefits more

Build Features

Check what features the current build was compiled with:

/* Query required buffer size */
size_t json_len = 0;
tdse_perf_get_build_features_json(NULL, &json_len);

/* Allocate and query */
char* json = malloc(json_len);
tdse_perf_get_build_features_json(json, &json_len);
printf("Features: %s\n", json);
free(json);

This returns a JSON object listing compiled-in features such as CUDA support, BLAS backend, telemetry, and other optional components.

Performance Benchmarks

The numbers below are sizing guidance, not guarantees. They help you estimate whether TDSE is in the right range for your deployment before you run your own measurements. Unless noted otherwise, all data uses the default CPU backend (CPU_GENERIC) on a representative x86_64 workstation.

Treat this section as early sizing guidance, not release evidence. Procurement, PoC, and real-time sign-off decisions should be based on measurements from your target machine, target pack shape, and target host-loop policy.

Use the numbers below for triage questions such as "is CPU enough?", "is CUDA worth testing?", or "is this pack likely to fit in memory?". Do not use them as substitutes for customer acceptance criteria, target-machine qualification, or published product guarantees.

Representative Step Latency

Measurements with nh = 256, dt = 1e-6, and 10,000 warm-up steps followed by 10,000 measured steps:

np	nq	Backend	Step Latency (μs)	Throughput (steps/s)
1	1	CPU_GENERIC	0.3-0.8	1.2M-3.3M
3	3	CPU_GENERIC	0.5-1.5	670K-2.0M
3	3	CPU_BLAS	0.4-1.2	830K-2.5M
10	10	CPU_GENERIC	2-8	125K-500K
10	10	CPU_BLAS	1-4	250K-1.0M
10	10	CUDA	5-15*	67K-200K
50	50	CPU_BLAS	20-80	12K-50K
50	50	CUDA	8-25	40K-125K
100	100	CPU_BLAS	100-400	2.5K-10K
100	100	CUDA	15-50	20K-67K

*CUDA numbers include host-device transfer overhead. Small models may not benefit from GPU due to transfer latency.

Memory Footprint

Approximate per-model memory usage:

np	nh	Dense Memory	Notes
3	256	~50 KB	Small model, typical transmission line
10	1024	~2 MB	Medium model, multi-port subsystem
50	2048	~80 MB	Large model, distribution network
100	4096	~600 MB	Very large, dense subsystem

Memory scales approximately as nh * nq * np * 8 bytes for the H tensor plus nq * np * 8 bytes for workspace.

Scaling Behavior

Linear in nh: doubling history depth approximately doubles per-step time
Quadratic in np: port count has the strongest impact; minimize ports where possible
Linear in model count: N independent models consume approximately N times the memory and can run in parallel on separate threads

Benchmarking Your Workload

To measure performance for your specific model:

tdse profiler calibrate --np <your_np> --nh <your_nh> --dtype 64 \
  --out-json ./profile.json --out-md ./profile.md

The profiler report includes:

per-step latency for each available backend
optimal backend recommendation
generated runtime plan for tdse_backend_apply_plan()

For real-time deployments, also measure WCET with deterministic mode enabled (see Platform Notes).

For procurement, PoC, or customer-facing reporting, keep three evidence classes separate:

sizing guidance from this chapter
profiler output for your exact pack and hardware
target-machine timing or field qualification records from your deployment program

When sharing performance numbers outside the immediate engineering team, record at least:

CPU and GPU model
operating system and compiler/toolchain
build type and enabled backend/features
model shape (np, nq, nh, dt)
whether the host used fixed-step, variable-step, single-model, or multi-model execution
whether telemetry, deterministic mode, or additional tracing was enabled during measurement

Performance Monitoring Checklist

Verify backend selection with tdse_backend_get_active() after set
Compare step latency between backends for your model
Set local_threads intentionally rather than relying on default
Monitor guard metrics when tuning precision or dt strategy
Use Profiler to derive an optimal plan rather than manual tuning
Archive the runtime plan alongside pack artifacts for release evidence

Before Deployment Sign-Off

Use the next chapter based on the kind of risk you are trying to close:

If the open question is...	Go next
target platform support or qualification boundary	Platform Notes
plugin deployment, manifest, or ABI compatibility	Plugin System
circuit-input subset or netlist support risk	Element Reference
long-running observability rather than benchmark measurement	Telemetry

Backend Selection and Performance