Troubleshooting
Diagnostic surfaces, failure modes, and the evidence to collect before escalation.
Use this chapter when something fails and you need the shortest path from a symptom to a likely cause. It starts with the fastest checks, then moves into diagnostics, common failure modes, and the information worth saving before you escalate an issue.
This chapter is useful before a failure happens too. Use it while designing host-side error handling, evidence capture, and support workflows so the first real incident does not force you to invent a diagnostics contract under time pressure.
For PoC or customer-facing evaluations, save one small evidence bundle for every blocking issue: the failing input artifact, exact command or API sequence, returned status, and the minimum diagnostic bundle named in this chapter. That habit makes later reproduction and support much faster.
Start With the Symptom
Use this table before reading the detailed status-code reference.
| Symptom | First place to look | What to collect |
|---|---|---|
| Model create fails | create diagnostics and pack validation | failing API, status, pack_error_code, pack provenance |
| First step output looks wrong | prime-step order and Builder dimensions | tdse_model_info, step index, t, dt, first accepted primary vector |
step_ir fails during a run | IR horizon and committed step count | ir_nsteps, committed step index, intended simulation horizon |
destroy or shutdown hangs | same-handle ownership and wait policy | active worker thread, destroy wait budget, tdse_model_destroy_result_t |
| Adapter cannot parse or solve input | adapter diagnostics and CLI JSON output | command line, --json-out - output, netlist or RAW provenance |
| Performance is worse than expected | backend selection and profiler output | backend info, thread count, profiler recommendation JSON |
Use This Chapter In Two Passes
When you are under time pressure, do not read this chapter from top to bottom.
Use it in two passes:
- use the symptom table above, then jump to Minimum Diagnostic Bundle and First Five Minutes: Evidence Order
- return to the deeper diagnostics or reference tails only if the fast path did not explain the failure
The back half of this chapter is intentionally deeper reference material. It is there to support difficult incidents, not to slow down first-response troubleshooting.
First Response Checklist
When a failure first appears, this is the shortest reliable response path:
- identify the failing API or command
- capture the returned status or CLI error
- gather the minimum diagnostic bundle
- classify the problem as create, step-loop, concurrency, shutdown, shape, adapter, or performance
- use the symptom playbook below before reading the deeper reference tails
If that does not explain the failure, use the diagnostic interpretation sections that follow.
Pick The Right Failure Surface
Many incident reports get slower than they need to because the team is debugging the wrong layer first. Use this split before going deeper:
| If the failure starts here | Treat it first as | Go next |
|---|---|---|
tdse_model_create(...), step APIs, or destroy/close | Runtime lifecycle or step failure | Symptom Playbook |
tdse_builder_* or pack validation | Builder or handoff failure | Builder and Data Contracts and this chapter's create/shape symptoms |
tdse adapter circuit ... command or adapter API | Adapter input, solver, or conversion failure | Adapter Circuit and CLI Reference |
| profiler command or runtime-plan mismatch | performance qualification failure | Profiler and Backend and Performance |
As a rule of thumb, do not start with Runtime if the adapter never produced trustworthy output, and do not start with Builder if the first bad signal is already a live step-loop failure.
Diagnostic Interpretation
TDSE Core exposes structured diagnostics so you do not have to infer every failure from logs alone. The sections below show where to look first, how to read the snapshots, and how to turn them into the next troubleshooting step.
The Main Diagnostics Surfaces
When Builder or Runtime fails, these are usually the most useful places to look first:
- Builder and runtime status codes
tdse_status_message(...)andtdse_builder_status_message(...)tdse_model_create_diagnostics_ttdse_model_info(...)tdse_model_state_info(...)tdse_model_last_error_info(...)- exact step index,
t, anddtat failure time - destroy wait budget and
tdse_model_destroy_result_twhen shutdown is involved
For most Runtime failures, read them in this order:
- the failing API name and returned status code
- the create call's diagnostics struct, if create failed
- state snapshot
- sticky last-error snapshot
- logs and any saved diagnostics
What Each Surface Answers
Three snapshot APIs matter most:
| API | Use It For | Operational Question |
|---|---|---|
tdse_model_info(...) | static metadata such as np, nq, nh, dt, IR presence, and ir_nsteps | what model did I actually load? |
tdse_model_state_info(...) | current active-step and committed-step execution state | what state is the handle in right now? |
tdse_model_last_error_info(...) | most recent non-OK runtime result on the handle | what was the last runtime failure worth remembering? |
Recommended baseline:
- capture
tdse_model_info(...)once at model create - capture
tdse_model_state_info(...)andtdse_model_last_error_info(...)on every non-OK runtime path - archive the failing API name, status, step index,
t, anddtnext to those snapshots
The key difference is time perspective:
tdse_model_info(...)is static model metadatatdse_model_state_info(...)is the current state at query timetdse_model_last_error_info(...)is the failure-time state captured when the last non-OK runtime call happened
If the current state and the last-error snapshot disagree, that is usually informative rather than suspicious. It often means the host continued running after the failure or a different thread changed ownership.
Create Diagnostics
When tdse_model_create(...) fails, start with its diagnostics struct:
tdse_model_create_diagnostics_t diag = tdse_model_create_diagnostics_init();
tdse_status_t st = tdse_model_create(pack_data, pack_size, &diag, &model);
Check:
stdiag.statusdiag.pack_error_codetdse_pack_error_token(diag.pack_error_code)
This is the standard create failure path. Do not build new integrations around historical process-global create diagnostics.
Recommended support record for every create failure:
- pack identity and provenance
- pack validation result on the same bytes
diag.statusdiag.pack_error_codetdse_pack_error_token(...)- whether diagnostics came from
tdse_model_create_diagnostics_init()
Current-State Snapshot Semantics
tdse_model_state_info_t reports the current execution state only.
It does not tell you what failed earlier on the handle.
The highest-value fields are:
| Field | Meaning | Support Interpretation |
|---|---|---|
step_active | a trial step is active right now | query-side state is inside a begun-but-not-yet-committed step |
has_committed_step | at least one commit has succeeded | dr can become valid only after this is true |
committed_steps | number of accepted commits | use as the committed-history index, not the number of attempted trials |
committed_t / committed_dt | latest committed sample time and step size | anchor for the last accepted step |
active_t / active_dt | active trial step coordinates | meaningful only while step_active != 0 |
sim_time | accumulated committed simulation time | should advance only through successful commit |
dr_last_valid | direct-response query is currently legal | 1 means post-commit dr is queryable with no active trial step |
Read the fields as a coherent state, not as isolated booleans. For example:
step_active=1andhas_committed_step=0means the handle is in its first trial before any accepted history existsstep_active=0,has_committed_step=1, anddr_last_valid=1means the handle has committed history and is in the normal post-commit idle statestep_active=1withdr_last_valid=0is the expected shape during an ordinary trial step and explains whytdse_step_dr(...)would be rejected
Sticky Last-Error Snapshot Semantics
tdse_model_last_error_info_t is a per-handle sticky snapshot of the most recent non-OK runtime result.
Two rules matter in practice:
- successful runtime calls do not clear it
tdse_model_reset()clears it
That means the last-error snapshot answers "what most recently failed on this handle?" not "what is failing right now?"
The highest-value fields are:
| Field | Meaning | Support Interpretation |
|---|---|---|
valid | a failure snapshot exists | 0 means the handle has no recorded runtime failure since create/reset |
status | most recent non-OK runtime status | classify the failure family first from this |
api_kind | runtime API that returned status | tells you whether failure came from begin, query, commit, close, or destroy |
step_active / has_committed_step | lifecycle snapshot at failure time | tells you whether failure happened pre-step, mid-trial, or post-commit |
committed_steps | accepted-history depth at failure time | shows whether this was startup, first-step, or late-run behavior |
committed_t / committed_dt | last accepted step at failure time | useful for post-commit incidents |
active_t / active_dt | trial step coordinates at failure time | useful for begin/query/commit incidents |
sim_time | committed simulation time at failure time | useful when the host kept running after the fault |
dr_last_valid | whether committed dr was queryable at failure time | often separates pre-commit misuse from a post-commit query |
api_kind is especially valuable because it collapses ambiguity fast.
Common values to watch are:
api_kind | Meaning |
|---|---|
TDSE_RUNTIME_API_STEP_BEGIN | failure while creating or re-entering a trial step |
TDSE_RUNTIME_API_STEP_HR / TDSE_RUNTIME_API_STEP_IR | query failure during a trial |
TDSE_RUNTIME_API_STEP_COMMIT | accepted-primary handoff failed |
TDSE_RUNTIME_API_STEP_DR | direct-response queried in the wrong lifecycle state |
TDSE_RUNTIME_API_MODEL_CLOSE | explicit close raced with in-flight same-handle work |
TDSE_RUNTIME_API_MODEL_DESTROY | bounded destroy timed out or otherwise failed |
Pairing Current State With Last Error
The strongest diagnostics read comes from comparing the current state snapshot with the sticky last-error snapshot.
Use this interpretation table:
| Current State | Last Error | Likely Story |
|---|---|---|
| idle, no active step | invalid-state from STEP_HR or STEP_IR with no committed step | caller queried step terms before begin |
active trial at (t, dt) | last error from STEP_DR with step_active=1 | caller tried to read dr during a live trial |
| post-commit idle | last error still points to an older STEP_HR failure | host recovered or continued; sticky snapshot has not been reset |
query APIs now return TDSE_ERR_INVALID_STATE | last error shows MODEL_DESTROY timeout or MODEL_CLOSE overlap | shutdown or ownership transition has begun |
| committed step count unchanged across retries | repeated query/solve failures without commit | host is retrying trials without advancing committed history |
The practical support question is:
- is the host looking at the present handle state, or at the last place the handle failed?
Issue reports often go sideways when those two are confused.
High-Value Status Codes
The most important Runtime statuses for most integrations are:
| Status | Typical Meaning | First Action |
|---|---|---|
TDSE_ERR_INVALID_ARG | shape, pointer, or struct-size problem | validate caller inputs, buffer dimensions, and struct_size tags |
TDSE_ERR_INVALID_STATE | API is valid in general but wrong for the current lifecycle state | inspect call order, shutdown ownership, and whether a trial is active |
TDSE_ERR_CONCURRENT_API_USE | same handle was entered concurrently | inspect thread ownership and same-handle overlap |
TDSE_ERR_IR_STEP_OUT_OF_RANGE | current step falls outside IR support | inspect ir_nsteps, model dt, and host horizon |
TDSE_ERR_TIMEOUT | bounded destroy exceeded its wait budget | inspect in-flight same-handle work and shutdown policy |
Status Families
It helps to group statuses by the question they answer:
| Family | Representative Status | Meaning |
|---|---|---|
| caller contract | TDSE_ERR_INVALID_ARG | the caller provided illegal shape, pointer, or struct input |
| lifecycle contract | TDSE_ERR_INVALID_STATE | the API is valid in general but wrong for the current handle state |
| ownership / concurrency | TDSE_ERR_CONCURRENT_API_USE | the same handle was entered concurrently |
| model horizon | TDSE_ERR_IR_STEP_OUT_OF_RANGE | runtime time advanced beyond packaged IR support |
| shutdown policy | TDSE_ERR_TIMEOUT | bounded destroy did not complete within budget |
Worked Snapshot Interpretations
Worked Example: Query Before begin
Observed behavior:
- failing API:
tdse_step_hr(...) - returned status:
TDSE_ERR_INVALID_STATE - current state:
step_active=0,has_committed_step=0,committed_steps=0 - last error:
valid=1,status=TDSE_ERR_INVALID_STATE,api_kind=TDSE_RUNTIME_API_STEP_HR
Interpretation:
- this is not a numerical problem
- the host never established a trial context before querying step terms
- because both current state and last-error state show no active or committed step, the incident is pure lifecycle misuse
Corrective action:
- ensure
tdse_step_begin(model, t, dt)succeeds beforetdse_step_hr(...),tdse_step_ir(...), ortdse_step_op(...) - keep the API name with the status in logs so the misuse is visible immediately
Worked Example: dr Called During A Trial
Observed behavior:
- failing API:
tdse_step_dr(...) - returned status:
TDSE_ERR_INVALID_STATE - current state:
step_active=1,has_committed_step=1,dr_last_valid=0 - last error:
api_kind=TDSE_RUNTIME_API_STEP_DR,step_active=1
Interpretation:
- committed history exists, but the handle is currently inside a trial step
dris a post-commit query, not a trial-side query- the failure is about timing inside the step lifecycle, not about missing history
Corrective action:
- move
draftercommit - do not query
drwhilestep_active != 0
Worked Example: IR Horizon Miss Mid-Run
Observed behavior:
- failing API:
tdse_step_ir(...) - returned status:
TDSE_ERR_IR_STEP_OUT_OF_RANGE - model info:
has_ir=1,ir_nsteps=<finite value>,dt=<model dt> - state snapshot:
committed_stepsclose to or beyond the supported IR horizon - last error:
api_kind=TDSE_RUNTIME_API_STEP_IR,active_tandactive_dtidentify the failing trial
Interpretation:
- Runtime is telling you the current trial step lies outside packaged IR support
- the most common causes are host horizon growth or Builder/runtime
dtmismatch - because
api_kindisSTEP_IR, do not waste time on operator or history math first
Corrective action:
- compare host step index against
ir_nsteps - confirm Builder and Runtime
dtmatch - regenerate the pack if the simulation window legitimately grew
Worked Example: Close Overlap During Active Work
Observed behavior:
- failing API:
tdse_model_close(...) - returned status:
TDSE_ERR_CONCURRENT_API_USE - last error:
valid=1status=TDSE_ERR_CONCURRENT_API_USEapi_kind=TDSE_RUNTIME_API_MODEL_CLOSE
- current state may still show an active trial or may already be post-commit by the time you inspect it
Interpretation:
- close was attempted while another same-handle runtime API was in flight
- this is an ownership bug or a shutdown sequencing bug, not a random close failure
- if the current state later looks idle, that only means the conflicting call finished after the incident
Corrective action:
- identify which thread owns close/destroy initiation
- stop treating close as a background-safe cleanup signal
- move ordinary host shutdown to
tdse_model_destroy(...)with an explicit wait policy
Worked Example: Destroy Timeout And Ownership Handoff
Observed behavior:
tdse_model_destroy(...)returnsTDSE_ERR_TIMEOUTtdse_model_destroy_result_t.timed_out == 1- last error:
status=TDSE_ERR_TIMEOUT,api_kind=TDSE_RUNTIME_API_MODEL_DESTROY - later diagnostics may still succeed because the handle remains live after timeout
Interpretation:
- destroy did not acquire terminal ownership within the caller's wait budget
- the timeout does not mean the handle was destroyed
- the host must choose a next action explicitly: retry destroy, defer to finalizer cleanup, or report the issue
Support rule:
- record both the configured wait budget and the observed
wait_ms - if a follower thread later receives
TDSE_ERR_INVALID_STATEfrom destroy or release, interpret that as ownership no longer being local to that thread
Typical Failure Patterns
Pack Rejected At Create
First check:
tdse_pack_validate(...)tdse_model_create_diagnostics_t- the pack token from
tdse_pack_error_token(...)
Typical cause:
- malformed pack bytes
- incompatible pack structure
- mismatch between expected and supplied payload
Evidence to collect:
- exact pack token
- output of pack validation
- artifact provenance or source route
Invalid Lifecycle State
First check:
- whether
beginwas called before step queries - whether
drwas queried before the first commit - whether a query was attempted after close or destroy began
- whether the host is reading a stale sticky last-error snapshot as if it were current state
Same-Handle Concurrent Use
First check:
- whether one handle is shared across threads
- whether the host wrapper allows overlapping entry on the same model handle
- whether shutdown paths can race with step APIs
Destroy Timeout
First check:
- the destroy wait budget
- the observed destroy wait result
- whether another same-handle API was still in flight
- whether the host expected
destroyto behave likerelease
Symptom-To-Surface Table
| Symptom | First Surface To Read | Second Surface | Typical Class |
|---|---|---|---|
| create fails | tdse_model_create_diagnostics_t | pack token | pack or compatibility issue |
| runtime query fails | failing API + returned status | tdse_model_state_info(...) | lifecycle or concurrency |
| destroy times out | destroy result | tdse_model_last_error_info(...) | shutdown overlap |
| direct response invalid | tdse_model_state_info(...) | tdse_model_last_error_info(...) | misuse of dr |
| wrong dimensions suspected | tdse_model_info(...) | host allocation site | shape mismatch |
| logs look stale or contradictory | compare state snapshot with last-error snapshot | failing API timeline | sticky snapshot confusion |
Diagnostics By Lifecycle Phase
| Phase | Highest-Value Diagnostics |
|---|---|
| create | create diagnostics, pack token, pack validation |
| steady execution | model info, state info, last-error snapshot, failing API name, t, dt, step index |
| shutdown | destroy result, wait budget, state snapshot, last-error snapshot, ownership interpretation |
Production Archive Baseline
For production integrations, archive at least:
tdse_model_create_diagnostics_tfor non-OK create pathstdse_model_info(...)tdse_model_state_info(...)tdse_model_last_error_info(...)- the failing API name
- the exact
t,dt, and step index
If perf controls are in use, also archive the active execution-plan context.
And when shutdown behavior matters, also archive:
- destroy wait budget
- destroy wait result
- whether close, destroy, or release was used
- which thread or component initiated shutdown
Minimum Diagnostic Bundle
A minimally useful diagnostic bundle for runtime incidents contains:
- pack identity and validation result
- create diagnostics
- model info
- current state snapshot
- sticky last-error snapshot
- failing API name and returned status
t,dt, and step index if execution is involved- destroy wait budget and result if shutdown is involved
If any one of those is missing, it becomes much harder to reconstruct facts that the host already knew at failure time.
Troubleshooting Workflow
Use this sequence when triaging a new Runtime issue:
- identify the failing API
- capture the returned status
- if create failed, read the create diagnostics first
- if runtime execution failed, read current state and sticky last-error snapshots together
- verify host dimension assumptions against
tdse_model_info(...) - check whether the issue is lifecycle, concurrency, timing, shutdown, or pack-related
Diagnostic Anti-Patterns
Avoid:
- logging only the status text without the API name
- dropping create diagnostics on failed create paths
- triaging shape issues without
tdse_model_info(...) - reading only logs when snapshot APIs already explain the state
- assuming the last-error snapshot clears itself after later success
- treating
releasebehavior as proof of destroy-policy correctness - reporting a destroy timeout without recording
wait_timeout_msand observedwait_ms
When To Report
Report an issue quickly when:
- the same qualified pack regresses on an unchanged supported host
- create diagnostics indicate a pack-compatibility surprise on a previously qualified route
- threading stress or destroy-timeout behavior changes across the same release line
- state and last-error snapshots imply a runtime contract break rather than host misuse
Keep triaging locally first when:
- the symptom is clearly a shape mismatch
- the host thread-ownership model is still ambiguous
- the support bundle does not yet contain both runtime snapshots
Validation And Testing
Recommended user-facing validation order:
- validate a representative pack
- create a runtime model successfully
- run one minimal step loop end to end
- confirm
op,hr, andirdimensions - intentionally exercise one known failure path
- confirm last-error stickiness and reset behavior in one local repro
For production confidence, keep these checks green:
- threading stress
IRcontract replay- Builder
IRingest contract - runtime direct-response contract
- lifecycle diagnostics around reset, close, and destroy
Symptom Playbook
Users usually start from a symptom, not from an API name. This section turns the runtime rules into a field guide you can use under pressure.
How To Use This Section
For each problem:
- match the visible symptom
- collect the named evidence
- classify the issue
- apply the first corrective action
- return to the Diagnostics and Error Handling section above only if needed
First Five Minutes: Evidence Order
Under support pressure, collect evidence in this order before changing the host code:
- failing API name and returned
tdse_status_t tdse_model_state_info(...)tdse_model_last_error_info(...)- create diagnostics or destroy result, depending on phase
- exact
t,dt, and step index from the host log
This order matters because it separates "what just failed" from "what state the handle was in" before logs or assumptions start to drift.
Minimal Incident Snapshot
Use this capture template on every non-OK path:
tdse_model_state_info_t state = tdse_model_state_info_init();
tdse_model_last_error_info_t last = tdse_model_last_error_info_init();
const tdse_status_t state_st = tdse_model_state_info(model, &state);
const tdse_status_t last_st = tdse_model_last_error_info(model, &last);
Archive together:
- failing API name
- returned status from the failing API
state_standlast_ststate.step_active,state.has_committed_step,state.committed_stepslast.valid,last.status,last.api_kind- host-side
t,dt, and step index
If shutdown is involved, also archive tdse_model_destroy_result_t.
How To Read Runtime Snapshots
The fastest support skill is reading state_info and last_error_info as one combined record.
State Snapshot Interpretation
| Snapshot Field Pattern | What It Usually Means | First Question |
|---|---|---|
step_active = 1 | a trial step is still open | which thread still owns the active trial? |
has_committed_step = 0 | no commit has succeeded yet | did the host skip the prime step or first ordinary commit? |
dr_last_valid = 0 with has_committed_step = 1 | committed history exists, but direct response is not currently queryable | did the host start a new trial before querying dr? |
committed_steps stops increasing while host step index advances | host is retrying or rejecting trials without commit | is that intentional or is commit being skipped accidentally? |
Last-Error Snapshot Interpretation
| Snapshot Field Pattern | What It Usually Means | First Action |
|---|---|---|
last.valid = 0 | no prior non-OK runtime API has been captured on this handle | rely on current failing status and state snapshot |
last.api_kind = TDSE_RUNTIME_API_STEP_IR with last.status = TDSE_ERR_IR_STEP_OUT_OF_RANGE | runtime time exceeded packaged IR window | compare host horizon with ir_nsteps and dt |
last.api_kind = TDSE_RUNTIME_API_MODEL_DESTROY with last.status = TDSE_ERR_TIMEOUT | bounded destroy timed out before ownership was acquired | inspect in-flight same-handle work and retry policy |
last.api_kind = TDSE_RUNTIME_API_MODEL_CLOSE with last.status = TDSE_ERR_CONCURRENT_API_USE | close raced with active same-handle traffic | stop treating close as a normal shutdown path |
Important reading rule:
tdse_model_last_error_info(...)is sticky across later successful calls until reset
That means the snapshot is valuable evidence, but not necessarily proof that the most recent call failed for the same reason. Always pair it with the current failing API and current state snapshot.
Triage Matrix
| If You See | Check First | Likely Class | Immediate Containment |
|---|---|---|---|
| create path returns non-OK | create diagnostics + pack token | pack contract | stop runtime triage and validate the pack bytes |
step API returns TDSE_ERR_INVALID_STATE | state.step_active and lifecycle order | sequencing or shutdown overlap | stop issuing more same-handle step calls |
shutdown returns TDSE_ERR_TIMEOUT | destroy result + state.step_active | shutdown overlap | keep the handle live and decide retry vs report |
sporadic TDSE_ERR_CONCURRENT_API_USE | thread ownership map + last-error api kind | same-handle overlap | isolate one handle per thread |
dr fails after the host starts next step | state.step_active | misuse of post-commit query window | move dr earlier or stop depending on it there |
Symptom: Model Create Fails
Visible signs:
tdse_model_create(...)returns non-OK- no usable runtime handle is produced
Collect first:
tdse_model_create_diagnostics_ttdse_pack_error_token(...)- pack validation result
Most likely causes:
- malformed pack bytes
- incompatible pack version or structure
- caller forgot to initialize the diagnostics struct
First corrective actions:
- rerun pack validation on the same bytes
- log the stable pack token
- confirm
tdse_model_create_diagnostics_init()was used
Support note:
- if
tdse_model_create(...)fails, do not spend time on step-loop or shutdown analysis - this is a pack or caller-contract problem until proven otherwise
Symptom: Step Loop Runs, Then step_ir Fails
Visible signs:
- early steps succeed
- later
tdse_step_ir(...)returnsTDSE_ERR_IR_STEP_OUT_OF_RANGE
Collect first:
- current step index
tanddtir_nstepsfromtdse_model_info(...)
Most likely causes:
- host simulation horizon exceeds packaged IR support
- Builder and Runtime assumptions about
dtdo not match
First corrective actions:
- verify the intended simulation horizon
- compare Builder
dtwith Runtime modeldt - regenerate the pack if the IR support window is too short
Worked interpretation:
- if
last.api_kindisTDSE_RUNTIME_API_STEP_IR - and
state.committed_stepsis close to or beyondir_nsteps - the integration has a horizon mismatch, not a random runtime failure
Support-facing rule:
- record the first failing step index and the total intended horizon
- that pair is usually enough to distinguish bad pack content from a bad host loop bound
Symptom: dr Is Rejected
Visible signs:
tdse_step_dr(...)returnsTDSE_ERR_INVALID_STATE
Collect first:
- whether a trial step is still active
- whether at least one commit already happened
Most likely causes:
drcalled before the first commitdrcalled during an active trial step
First corrective actions:
- move
draftercommit - ensure the trial step has ended before the
drquery
Fast discriminator:
state.step_active = 1means the host is still inside the trial windowstate.has_committed_step = 0means no successful commit exists yet, sodrcannot succeed
Symptom: Runtime Calls Fail Sporadically In Multi-Thread Runs
Visible signs:
- intermittent
TDSE_ERR_CONCURRENT_API_USE - non-OK statuses appear only under load
Collect first:
- thread ownership model
- handle identity per worker
- failing API name
Most likely causes:
- one live handle is shared across worker threads
- supervisory code overlaps same-handle entry
First corrective actions:
- enforce one-handle-per-thread
- document which thread owns create, step, and shutdown
- rerun threading stress after redesign
Tell-tale evidence pattern:
- the same handle identifier appears in logs from multiple workers
last.status = TDSE_ERR_CONCURRENT_API_USE- the failing API changes between incidents
That pattern usually means the ownership model is wrong, not that one specific API is buggy.
Symptom: Destroy Times Out
Visible signs:
tdse_model_destroy(...)returnsTDSE_ERR_TIMEOUTout_result->timed_outis true
Collect first:
- destroy wait budget
- destroy wait result
- whether another same-handle API was still active
Most likely causes:
- business logic started destroy while a step API was still running
- host assumed destroy behaved like silent release
First corrective actions:
- inspect in-flight same-handle activity
- decide whether to retry, report, or defer to finalizer cleanup
- document the intended shutdown policy instead of relying on defaults
Destroy interpretation matrix:
| Result | Meaning | What The Host Should Do Next |
|---|---|---|
TDSE_OK | shutdown completed | clear references and continue |
TDSE_ERR_TIMEOUT + timed_out = 1 | destroy did not acquire ownership within budget; handle remains live | retry later or report with evidence |
TDSE_ERR_INVALID_STATE | another thread already started terminal cleanup | treat ownership as no longer local |
Operational rule:
- never reinterpret
TDSE_ERR_TIMEOUTas "probably destroyed anyway" - timeout is specifically the case where the host must assume the handle still exists
Symptom: Shapes Look Wrong
Visible signs:
- host buffer sizes do not match what Runtime appears to expect
- operator shape assumptions do not line up with outputs
Collect first:
tdse_model_info(...)- chosen operator view policy
- host allocation site
Most likely causes:
- confusion between
npandnq - mixing square and rectangular operator views across call sites
First corrective actions:
- read
np,nq, andnhfromtdse_model_info(...) - pick one explicit operator-view policy
- update host allocations and asserts accordingly
Support note:
- shape problems are usually integration drift between allocation sites and the runtime contract
- they are rarely fixed by changing numerical tolerances or retry policy
Symptom: Results Differ At The First Ordinary Step
Visible signs:
- later steps look plausible, but the first ordinary step does not
Collect first:
- whether the implementation primes at
n = -1 - whether history was committed before the ordinary loop
Most likely causes:
- skipped or incorrect prime-step sequence
- misunderstanding of committed-history initialization
First corrective actions:
- compare implementation with the documented prime-step pattern
- commit the intended pre-step history explicitly
Useful evidence:
state.has_committed_step = 0at the start of the first ordinary step is a strong sign that the prime-step contract was skipped or never committed
Diagnostic Bundle for Reports
Before reporting an issue, collect:
- pack validation result
- create diagnostics
- model info
- state info
- last error info
- failing API name
- exact
t,dt, and step index - wait budget and wait result if shutdown is involved
Preferred incident bundle format:
api=tdse_step_ir
status=TDSE_ERR_IR_STEP_OUT_OF_RANGE
host_step=4096
t=4.096e-6
dt=1.000e-9
state.step_active=1
state.has_committed_step=1
state.committed_steps=4096
last.valid=1
last.status=TDSE_ERR_IR_STEP_OUT_OF_RANGE
last.api_kind=TDSE_RUNTIME_API_STEP_IR
This is the level of detail that lets support answer quickly without guessing.
Anti-Patterns During Triage
Avoid these moves even when the pressure is high:
- retrying the same failing API without capturing
state_infofirst - logging only status text without the API name
- treating a sticky
last_error_infosnapshot as proof of the current failure cause - assuming destroy timeout means storage was already released
- reporting concurrency failures before proving one-handle-per-thread ownership
Worked Diagnostic Scenarios
Scenario A. Step Failure That Is Really A Horizon Mismatch
Observed:
- host reports
TDSE_ERR_IR_STEP_OUT_OF_RANGE last.api_kind = TDSE_RUNTIME_API_STEP_IRstate.committed_stepsis near the configuredir_nsteps
Conclusion:
- the pack and runtime are behaving consistently
- the host loop exceeded the packaged IR support window
Best next action:
- shorten the run horizon or rebuild the pack with longer IR support
Scenario B. Shutdown Failure That Is Really Ownership Handoff
Observed:
- thread A starts destroy
- thread B later receives
TDSE_ERR_INVALID_STATEfrom destroy or release
Conclusion:
- this is not a local retry problem
- terminal cleanup ownership already moved to another thread
Best next action:
- stop same-handle use on thread B and clean up local references only
Scenario C. Invalid dr Call That Is Really Step-Order Drift
Observed:
tdse_step_dr(...)returnsTDSE_ERR_INVALID_STATEstate.step_active = 1
Conclusion:
- the host started a new trial before reading committed direct response
Best next action:
- move
drbefore the nextbegin, or stop depending ondrin that path
Report Versus Fix Locally
Usually fix locally first when:
- the issue is a visible shape mismatch
- the integration does not yet enforce one-handle-per-thread
- create diagnostics clearly indicate malformed input
Escalate sooner when:
- a previously qualified pack stops creating on a supported host
- the same qualified shutdown path starts timing out on unchanged inputs
- runtime snapshots contradict the expected lifecycle contract
Before the sections below: if you already understand the symptom and only need stable reference material, everything that follows is deep reference rather than first-response guidance.
Deep Reference: Known Limitations — Core
The sections below describe what is intentionally out of scope for the current release. Read them as product boundary statements, not as an apology list.
Workflow Boundary Limits
Current TDSE Core boundaries are:
- TDSE Core does not parse or solve domain models directly
- TDSE Core expects validated
Hand optionalIR, or an already built pack - adapter layers remain responsible for domain-specific preprocessing and validation
Runtime Execution Limits
Current Runtime limits are:
- same-handle concurrent step entry is not a supported contract
- execution-affecting runtime controls are mutable only before the first successful
tdse_step_begin(...) - Runtime does not infer port names, engineering units, or sign conventions
- bounded destroy is explicit; Runtime does not invent host shutdown policy on behalf of the caller
These are deliberate simplifications that keep the runtime contract deterministic and reviewable.
Builder Limits
Current Builder limits are:
- Builder validates dimensions and layouts but does not replace adapter-side semantic validation
- spectrum-to-
Hconversion is limited to documented correction methods and contracts - pack metadata is descriptive and does not alter Runtime mathematics
Integration Limits
Current integration limits are:
IRaccess is bounded by the configured sequence horizon- Runtime plan application depends on optional perf support and recognized plan compatibility
- RuntimeCore support claims are narrower than "all SDK features on all hosts"
Unsupported Assumptions
Do not assume any of the following unless a release note says otherwise:
- same-handle concurrent entry will be serialized for you
- Runtime will reconstruct missing domain semantics
- Builder metadata changes runtime mathematics
- Linux support implies all distro or architecture variants
Documentation Interpretation Rule
If a convenience path, platform promise, or feature stack is not documented in the public guide, treat it as unsupported until a release note or support addendum says otherwise.
Deep Reference: Known Limitations — Release
Current Release Limitations
The items below summarize limitations that matter during external evaluation. They are not theoretical restrictions; they reflect the current implementation boundaries described in this guide.
Element and Analysis Limits
NPORTis supported in AC small-signal analysis only.NPORTreturns unsupported for DC or transient paths in the current release.SW/CSWswitch elements are supported in transient paths only.- nonlinear
QandMdevice support is limited to transient paths. - AC /
Y/Z/ AC-probe paths reject unsupported nonlinear combinations by design.
Frequency-Data Import Limits
NPORTis intended forYorZdatasets, notS-parameter import through theNPORTelement.- Touchstone parsing for
NPORTaccepts.sNpsyntax butS-parameter semantics are not accepted forNPORTitself. S-parameter Touchstone usage belongs to theSelement path, not theNPORTpath.
Parser and Compatibility Limits
- parser compatibility is broad but not equivalent to full general-purpose SPICE compatibility
.model SW/CSWis supported; unrelated.modeltypes may be ignored by this adapter parser- relative file-path resolution depends on whether the source was provided as file input or text input
Operational Limits
- environment-variable solver overrides are suitable for experimentation, but explicit policy APIs are preferred for reproducible host integrations
- Builder handoff still requires the caller to choose appropriate spectrum-to-
Hconversion settings; Adapter Circuit does not replace Builder policy decisions - node-set mode rejects mutual-inductor boundaries that cross only one side of a
Kcoupling - node-set mode rejects partial inclusion of
S/NPORTmultiport blocks
Evaluation Guidance
Before reporting a limitation as a bug, check whether it falls into one of these intended boundaries:
- AC-only frequency-domain block import
- transient-only nonlinear element support
- solver/backend availability that depends on build configuration
Future releases may extend analysis or device coverage, but until such extensions are documented in the public guide, treat the limits above as the current external boundary.
Deep Reference: Known Limitations — General
Current public-boundary limitations include:
- adapter-domain behavior is constrained by documented RAW/netlist scope
- some simulation/probe combinations require exact option sets
- profiler policy quality depends on representative case selection and machine stability
- profiler summaries are not schema-authoritative; JSON report is authoritative
Practical guidance:
- treat command appendices as contract source of truth
- keep replay cases minimal and deterministic
- archive reports with machine/build metadata for release use
Deep Reference: Status Code System
TDSE uses a structured status code system that spans all SDK modules. Understanding the structure helps you classify errors quickly and route them to the right team.
Status Domains
Every status code belongs to a domain that identifies which SDK module produced it:
| Domain | Meaning | Typical Source |
|---|---|---|
TDSE_STATUS_DOMAIN_RUNTIME | Runtime execution errors | tdse_model_create, step APIs, lifecycle APIs |
TDSE_STATUS_DOMAIN_BUILDER | Builder pack-generation errors | tdse_builder_apply_h, tdse_builder_write_pack |
TDSE_STATUS_DOMAIN_CIRCUIT_AC | Adapter Circuit AC-path errors | circuit compilation, FRF extraction, netlist parsing |
Raw status codes from each domain overlap numerically. Always normalize to a unified code before logging or comparing:
tdse_ext_status_t unified = tdse_ext_status_from_runtime(st);
Classification
Every unified status code has a classification that tells you the error family:
| Classification | Meaning | Operational Response |
|---|---|---|
| OK | success | continue |
| Error | permanent failure | inspect inputs, shapes, or lifecycle state |
| Timeout | time-bounded operation expired | retry or escalate (see destroy timeout) |
| Unsupported | feature not available in this build or configuration | check build features or backend availability |
| InvalidState | API called in wrong lifecycle state | check call order and handle ownership |
Query the classification:
tdse_status_class_t cls = tdse_status_code_classify(unified);
const char* msg = tdse_status_code_message(unified);
Human-Readable Messages
Two message APIs exist:
| API | Scope | When To Use |
|---|---|---|
tdse_status_message(st) | raw domain-local status | quick logging within one module |
tdse_ext_status_message(unified) | unified cross-module status | production logging across module boundaries |
For production integrations, prefer the unified path:
tdse_ext_status_t unified = tdse_ext_status_from_runtime(st);
printf("error: %s\n", tdse_ext_status_message(unified));
Pack Error Tokens
Create failures produce a pack_error_code in the diagnostics struct. Decode it to a readable token:
const char* token = tdse_pack_error_token(diag.pack_error_code);
const char* name = tdse_ext_pack_error_name(diag.pack_error_code);
Common pack error tokens:
| Token | Meaning |
|---|---|
CRC_MISMATCH | pack bytes are corrupted |
INVALID_TOC | table of contents is malformed |
MISSING_H_META | required H metadata chunk is absent |
MISSING_H_DATA | required H data chunk is absent |
SHAPE_MISMATCH | declared dimensions contradict data size |
VERSION_UNSUPPORTED | pack version is not supported by this runtime build |
Complete Status Code Reference
Runtime Status Codes (tdse_status_t)
| Value | Constant | Meaning | Recovery |
|---|---|---|---|
| 0 | TDSE_OK | Success | continue |
| 1 | TDSE_ERR_INVALID_ARG | Invalid argument: null pointer, wrong struct size, bad shape | check caller inputs, buffer dimensions, and struct_size fields |
| 2 | TDSE_ERR_OUT_OF_MEMORY | Memory allocation failed | reduce model count, free unused handles, check GPU memory |
| 3 | TDSE_ERR_INTERNAL | Internal runtime error | report with full diagnostic bundle |
| 4 | TDSE_ERR_UNSUPPORTED | Feature not available in this build | check build features with tdse_perf_get_build_features_json |
| 6 | TDSE_ERR_IR_STEP_OUT_OF_RANGE | Step time exceeds IR sequence horizon | extend IR sequence, clamp simulation horizon, or rebuild pack |
| 8 | TDSE_ERR_CONCURRENT_API_USE | Same handle entered concurrently | enforce one-handle-per-thread ownership |
| 9 | TDSE_ERR_INVALID_STATE | API called in wrong lifecycle state | check step order, handle ownership, and shutdown sequencing |
| 10 | TDSE_ERR_TIMEOUT | Bounded destroy exceeded wait budget | check in-flight same-handle work, retry or escalate |
Builder Status Codes (tdse_builder_err_t)
| Value | Constant | Meaning | Recovery |
|---|---|---|---|
| 0 | TDSE_BUILDER_OK | Success | continue |
| 1 | TDSE_BUILDER_ERR_INVALID_ARG | Invalid builder argument | check descriptor fields, struct sizes, and dimension consistency |
| 2 | TDSE_BUILDER_ERR_OUT_OF_MEMORY | Builder memory allocation failed | reduce nh or np, free builder handle |
| 3 | TDSE_BUILDER_ERR_INTERNAL | Internal builder error | report with builder snapshot |
| 4 | TDSE_BUILDER_ERR_UNSUPPORTED | Unsupported builder operation | check SDK version and build configuration |
| 5 | TDSE_BUILDER_ERR_IO | File I/O failure during pack write | check output path, permissions, and disk space |
| 6 | TDSE_BUILDER_ERR_NP_MISMATCH | Port count mismatch between configure and descriptor | ensure np/nq in h_desc or ir_desc match builder options |
Extension Status Codes (tdse_ext_status_t)
| Value | Constant | Meaning | Recovery |
|---|---|---|---|
| 0 | TDSE_EXT_STATUS_OK | Success | continue |
| 1 | TDSE_EXT_STATUS_INVALID_ARG | Invalid extension API argument | check input parameters |
| 2 | TDSE_EXT_STATUS_OUT_OF_RANGE | Index or value out of range | check array bounds and valid ranges |
| 3 | TDSE_EXT_STATUS_IO_ERROR | I/O error in extension path | check file paths and permissions |
| 4 | TDSE_EXT_STATUS_FORMAT_ERROR | Format or parse error | check JSON/plan format |
| 5 | TDSE_EXT_STATUS_NUMERIC_ERROR | Numerical error in extension operation | check guard metrics and input validity |
| 6 | TDSE_EXT_STATUS_UNSUPPORTED | Unsupported extension operation | check build features |
| 7 | TDSE_EXT_STATUS_OUT_OF_MEMORY | Extension memory allocation failed | reduce workload or check resources |
| 8 | TDSE_EXT_STATUS_INTERNAL_ERROR | Internal extension error | report with diagnostic bundle |
| 9 | TDSE_EXT_STATUS_INVALID_STATE | Wrong state for extension API | check lifecycle state and call order |
| 10 | TDSE_EXT_STATUS_TIMEOUT | Extension operation timed out | retry or increase timeout |
Status Classification and Recovery Guide
| Classification | Recoverable? | Typical Response |
|---|---|---|
| OK | n/a | continue execution |
INVALID_ARG | yes, fix caller code | validate inputs before retry |
OUT_OF_MEMORY | yes, reduce load | free resources, reduce model count |
INTERNAL | no (report) | collect diagnostics and report |
UNSUPPORTED | no (design) | check build, use alternative path |
IO | yes, fix environment | check paths, permissions, disk |
INVALID_STATE | yes, fix sequence | check lifecycle order |
CONCURRENT_API_USE | yes, fix ownership | enforce one-handle-per-thread |
TIMEOUT | yes, retry or escalate | increase wait budget or quiesce first |
IR_STEP_OUT_OF_RANGE | yes, extend horizon | rebuild pack with longer IR |
Cross-Module Logging
For integrations that span Builder, Runtime, and Adapter Circuit, normalize all statuses to the unified system before writing to the same log sink:
void log_status(const char* source, tdse_status_t raw_st, int domain) {
tdse_ext_status_t unified;
switch (domain) {
case DOMAIN_RUNTIME: unified = tdse_ext_status_from_runtime(raw_st); break;
case DOMAIN_BUILDER: unified = tdse_ext_status_from_builder(raw_st); break;
case DOMAIN_CIRCUIT: unified = tdse_ext_status_from_circuit_ac(raw_st); break;
}
printf("[%s] %s (class=%d)\n", source,
tdse_ext_status_message(unified),
tdse_status_code_classify(unified));
}
Deep Reference: Deterministic Mode
For reproducibility testing and regression comparison, the SDK provides a deterministic mode that disables non-deterministic parallel and random behavior.
Enabling Deterministic Mode
tdse_ext_set_deterministic_mode(1);
This is a process-global setting. Call it before any model creation if you need reproducibility.
What It Disables
When deterministic mode is active:
- Internal parallelism is reduced to a single-threaded path for all step operations.
- Any non-deterministic scheduling (e.g., work-stealing, dynamic thread pools) is replaced with deterministic sequential execution.
- Results are bit-identical across runs on the same hardware and build.
What It Does NOT Change
- Pack contents (determinism is a runtime execution property, not a Builder property).
- The mathematical result (the same convolution is computed; only the execution strategy changes).
- Backend selection (you can still use any backend, but internal parallelism is serialized).
Querying Current State
int is_deterministic = tdse_ext_get_deterministic_mode();
When To Use It
| Scenario | Deterministic Mode | Rationale |
|---|---|---|
| Regression testing | ON | Bit-exact comparison across runs |
| Performance benchmarking | OFF | Measures real-world throughput |
| Debugging numerical issues | ON | Eliminates thread-scheduling as a variable |
| Production deployment | OFF | Enables full parallelism for throughput |
IMPORTANT: Deterministic mode may significantly reduce throughput. Do not enable it in production unless reproducibility is a hard requirement.
Deep Reference: Runtime Guard
The runtime guard monitors convolution stability and provides early warning when numerical behavior degrades. It does not change numerical outputs — it only observes and reports.
Configuration
tdse_ext_runtime_guard_config_t guard_cfg;
tdse_ext_get_runtime_guard_config(&guard_cfg);
guard_cfg.pivot_warning_threshold = 0.1;
tdse_ext_set_runtime_guard_config(&guard_cfg);
Reading Guard Metrics
tdse_ext_runtime_guard_metrics_t metrics;
tdse_ext_get_runtime_guard_metrics(model, &metrics);
Key metrics:
| Metric | Meaning | Watch For |
|---|---|---|
max_abs_g0 | max absolute value of h[0] entries this step | model-dependent; use as trend indicator |
pivot_min | minimum pivot from the instantaneous operator factorization | significant drop means operator is near-singular |
pivot_ratio | current pivot_min divided by the baseline from the first step | sustained values below 0.1 warrant investigation |
growth_factor | ratio of ` | hr |
Interpretation Guide
pivot_ratio trend:
├─ > 0.5 → healthy, operator well-conditioned
├─ 0.1-0.5 → watch; may be normal for stiff systems at startup
└─ < 0.1 → investigate; operator may be losing rank
growth_factor trend:
├─ ~1.0 → steady state, history term is stable
├─ 1.0-1.5 → mild growth, may be transient startup behavior
└─ > 1.5 → sustained growth suggests potential divergence
Resetting Metrics
tdse_ext_reset_runtime_guard_metrics(model);
Reset when starting a new simulation phase or after resolving a known transient event.
Relationship To Variable dt
When using variable time-stepping, the guard metrics are especially valuable for detecting whether aggressive dt changes are causing interpolation error accumulation. See Variable Time-Step Integration for the interaction between dt strategy and guard monitoring.
Deep Reference: Structured Logging
The SDK provides a structured logging system that integrations can route to their own log sinks.
Setting Up A Log Callback
void my_log_callback(tdse_log_level_t level, const char* message, void* user_data) {
if (level >= TDSE_LOG_LEVEL_WARN) {
fprintf(stderr, "[TDSE %d] %s\n", level, message);
}
}
tdse_ext_set_log_callback(my_log_callback, NULL);
Log Levels
| Level | Name | Typical Content |
|---|---|---|
TDSE_LOG_LEVEL_TRACE | Trace | Very detailed internal diagnostics |
TDSE_LOG_LEVEL_DEBUG | Debug | Step-level internals, useful during development |
TDSE_LOG_LEVEL_INFO | Info | Normal operational messages (model created, backend set, etc.) |
TDSE_LOG_LEVEL_WARN | Warning | Guard threshold crossings, near-singular conditions, fallback paths |
TDSE_LOG_LEVEL_ERROR | Error | API failures, pack validation errors, resource exhaustion |
TDSE_LOG_LEVEL_FATAL | Fatal | Unrecoverable internal errors |
Setting Log Level
tdse_ext_set_log_level(TDSE_LOG_LEVEL_WARN); /* only warn and above */
Query current level:
tdse_log_level_t level = tdse_ext_get_log_level();
Emitting Custom Log Messages
Integrations and adapters can use the same structured log system:
tdse_ext_log_emit(TDSE_LOG_LEVEL_INFO, "custom_module", "model step completed");
Plugin System
For plugin load failures, manifest issues, or backend routing problems, use
tdse_plugin_doctor to inspect installed plugins and their manifests:
TDSE_PLUGIN_MANIFEST=/opt/tdse/lib/plugins/sim/plugin_manifest.json \
tdse_plugin_doctor
The doctor reports ABI compatibility, health status, sha256 verification,
and manifest entry matches. Load failure diagnostics use stable categories
([file_not_found], [abi_mismatch], [hash_mismatch],
[manifest_invalid]) that can be parsed from logs.
Treat the doctor output as first-response evidence, not as the whole incident record. For blocking RC issues, save the exact package version, manifest path, failing backend request, and the smallest reproducer alongside the doctor report.
See the Plugin System chapter for full deployment and troubleshooting details.
Integration Guidance
- Set the log level to
WARNfor production deployments. - Set to
DEBUGorTRACEonly during active debugging — these levels produce significant output. - Always pair structured logs with status code classification for complete diagnostics.
- Route TDSE log output through your host application's logging infrastructure for unified observability.
