Concurrency and Shutdown
Concurrency, synchronization, cancellation, and orderly runtime shutdown guidance.
Use this section when TDSE Runtime is live in more than one thread, worker, or shutdown path. It explains the one-handle rule, which calls can overlap safely, and how to reason about contention and teardown races.
Related Chapters For the base lifecycle semantics, see Runtime Lifecycle. For threading and memory scaling with many models, see Threading and Scaling.
Most production Runtime incidents in this area are not numerical defects.
They are ownership defects: two threads think they own the same handle, shutdown starts while a step call is still live, or host wrappers treat release as a policy API instead of a finalizer.
Read this chapter together with Lifecycle and Ownership: lifecycle explains what local ownership means; this chapter explains how that ownership behaves under contention and teardown.
The One-Handle Rule
The core runtime concurrency rule is simple:
- one live
tdse_model_t*handle must not be entered concurrently for same-handle runtime APIs
The protected same-handle step surface is:
tdse_step_begin(...)tdse_step_op(...)tdse_step_hr(...)tdse_step_ir(...)tdse_step_commit(...)tdse_step_dr(...)tdse_model_close(...)tdse_model_destroy(...)tdse_model_release(...)
The runtime behavior is rejection, not automatic serialization. That is deliberate: silent serialization would hide ownership bugs and make timing behavior much harder to diagnose.
Which APIs Are Guarded Versus Snapshot-Style
Not every API participates in the same-handle execution guard. The most important distinction is:
- execution APIs and teardown APIs are guard-sensitive
- metadata and diagnostics snapshot APIs are live-handle reads
Snapshot-style query APIs are:
tdse_model_info(...)tdse_model_state_info(...)tdse_model_last_error_info(...)
These queries do not enter the same-handle execution guard.
On a live handle they are intended to remain readable even if step traffic is happening on another
thread.
They still stop being valid once close or destroy has already started on that handle, in which case
they return TDSE_ERR_INVALID_STATE.
Operational consequence:
- do not use snapshot-query success as evidence that a handle is safe to step from another thread
- do use snapshot queries to capture diagnostics before teardown begins or after a step failure
Runtime Guarantees Under Conflict
When runtime detects same-handle overlap on the guarded surface:
- conflicting entrants are rejected rather than queued
- step conflicts return
TDSE_ERR_CONCURRENT_API_USE - lifecycle conflicts may return
TDSE_ERR_CONCURRENT_API_USE,TDSE_ERR_TIMEOUT, orTDSE_ERR_INVALID_STATEdepending on which API raced and whether ownership already moved - runtime avoids deadlock as part of the supported behavior
Important nuance:
- forced overlap does not mean every entrant fails
- one caller may legitimately acquire the handle first and succeed
- conflicting entrants are rejected safely and observably
That distinction matters in stress tests. Under intentional race injection, "one winner plus one or more rejected entrants" is the expected shape.
Safe Parallelism Model
Supported host parallelism looks like this:
- one simulation worker owns one runtime handle
- different handles may run concurrently
- optional internal execution strategy may parallelize work inside one step call
- internal parallel execution does not make a single handle concurrently callable
Recommended host rule:
- assign both execution ownership and shutdown ownership to the same wrapper or thread controller
If those responsibilities are split across components, the ownership handoff protocol must be explicit rather than implied.
Teardown State Model
The shutdown APIs are easiest to reason about as an ownership state machine.
Two rules keep this understandable:
- once another thread acquires teardown ownership, the handle is no longer locally usable to you
TDSE_ERR_TIMEOUTfrom destroy is the one status that means the handle is still live after return
Close, Destroy, And Release Under Contention
tdse_model_close(...)
close is the immediate-answer lifecycle API.
Use it when:
- you want a synchronous non-waiting lifecycle result
- you want to detect overlap instead of waiting through it
Under contention:
- if another guarded same-handle API is still in flight, close returns
TDSE_ERR_CONCURRENT_API_USE - if another thread already started close or destroy, close returns
TDSE_ERR_INVALID_STATE
Operational reading:
closeis not a "fast destroy"- it is the right API when overlap itself is the information you need
tdse_model_destroy(...)
destroy is the recommended business-logic shutdown API because it makes wait policy explicit.
Use it when:
- your host owns lifecycle policy
- you need bounded wait behavior
- you want structured wait telemetry
Destroy outcomes:
TDSE_OK: teardown completed and storage is goneTDSE_ERR_TIMEOUT: destroy could not acquire teardown ownership within the budget; the handle remains validTDSE_ERR_INVALID_STATE: another thread already owns close or destroy; local ownership is gone
tdse_model_release(...)
release is for terminal cleanup, not shutdown policy.
Use it when:
- a destructor must not become a policy engine
- a
finallyor unwind path needs best-effort terminal cleanup
Under contention:
- release waits for an in-flight same-handle API to leave the guard once it owns terminal cleanup
- if another thread already started close, destroy, or release, release returns
TDSE_ERR_INVALID_STATE
Operational reading:
releaseis acceptable in finalizers because it is cleanup-orientedreleaseis a poor choice for ordinary host shutdown because it does not carry a bounded-wait policy
Race Matrix
Use this matrix when a shutdown report is unclear about which thread acted first.
| Situation | close sees | destroy sees | release sees | What the caller should assume |
|---|---|---|---|---|
| step call still in flight on same handle | TDSE_ERR_CONCURRENT_API_USE | TDSE_OK or TDSE_ERR_TIMEOUT depending on wait budget | waits until it can clean up | handle ownership is still local only if destroy timed out |
| another thread already started close/destroy | TDSE_ERR_INVALID_STATE | TDSE_ERR_INVALID_STATE | TDSE_ERR_INVALID_STATE | local ownership is gone |
| no same-handle activity, caller owns handle | TDSE_OK | TDSE_OK | TDSE_OK | storage is gone after return |
| destroy timed out while waiting | n/a | TDSE_ERR_TIMEOUT | n/a | handle is still live and policy must decide next step |
The support-facing rule is:
- only
TDSE_ERR_TIMEOUTfrom destroy preserves local ownership after return
Query Behavior During Shutdown
Support incidents often ask whether a host can still query state while shutdown is underway. Use the strict answer:
- before close or destroy starts, snapshot queries are allowed on a live handle
- after close or destroy has started,
tdse_model_info(...),tdse_model_state_info(...), andtdse_model_last_error_info(...)may returnTDSE_ERR_INVALID_STATE - after successful close, destroy, or release, no further handle use is valid
That means the host should capture evidence before starting teardown whenever possible.
Recommended diagnostic order on a failing live handle:
tdse_model_info(...)tdse_model_state_info(...)tdse_model_last_error_info(...)- chosen shutdown API and timeout policy
Bounded Destroy Policy
tdse_model_destroy_options_t.wait_timeout_ms is part of the public behavior, not a tuning footnote.
Interpret the wait budget as:
- negative value: intentional infinite wait
- small bounded value: supervisory shutdown that prefers a fast answer
- medium bounded value: ordinary business-logic cleanup where some overlap is tolerated
Recommended policy questions:
- Which thread is allowed to decide the wait budget?
- What does the host do on
TDSE_ERR_TIMEOUT? - At what point does the host stop retrying and defer to finalizer cleanup?
- Which path records the observed
wait_msinto logs or crash bundles?
Example bounded-destroy pattern:
tdse_model_destroy_options_t opt = tdse_model_destroy_options_init();
tdse_model_destroy_result_t result = tdse_model_destroy_result_init();
opt.wait_timeout_ms = 250.0;
tdse_status_t st = tdse_model_destroy(model, &opt, &result);
if (st == TDSE_OK) {
model = NULL;
} else if (st == TDSE_ERR_TIMEOUT) {
log_warn("destroy timeout wait_ms=%.3f timed_out=%d", result.wait_ms, result.timed_out);
/* handle remains live; supervisor decides retry or escalation */
} else if (st == TDSE_ERR_INVALID_STATE) {
model = NULL; /* ownership already moved elsewhere */
}
Worked Host Patterns
Pattern A. Worker-Owned Handle With Clean Shutdown
This is the preferred product integration pattern:
- worker thread creates the handle
- worker thread performs the step loop
- worker thread or its owner wrapper initiates
destroy - no other thread touches the handle after shutdown starts
Why it works:
- execution ownership and teardown ownership stay aligned
- there is no ambiguity about who records diagnostics or clears references
Pattern B. Supervisor Requests Stop, Worker Performs Destroy
This is often better than having the supervisor destroy directly:
- supervisor sets a stop request in host code
- worker exits its loop at a safe boundary
- worker performs
tdse_model_destroy(...) - supervisor observes the result through host telemetry
Why it works:
- it avoids same-handle races between a live step call and a remote destroy
- it keeps timeout policy near the code that already owns the handle
Pattern C. Destructor-Only Final Cleanup
Use this only when ordinary business shutdown has already failed or is unavailable:
class ModelGuard {
public:
~ModelGuard() noexcept {
if (handle_ != nullptr) {
(void)tdse_model_release(handle_);
handle_ = nullptr;
}
}
private:
tdse_model_t* handle_ = nullptr;
};
Why it is acceptable:
- destructors need a terminal cleanup target
- they should not be responsible for choosing timeout policy
Troubleshooting Shutdown Symptoms
Symptom: Destroy Times Out Repeatedly
Likely meaning:
- a same-handle step call is still live when destroy starts
- the host has no clear quiesce-before-destroy protocol
Collect:
- destroy
wait_timeout_ms - destroy
wait_ms - failing thread identities from host logs
- whether the worker loop had actually stopped before destroy
First corrective actions:
- move destroy to the owning worker or wrapper
- add an explicit stop-and-join phase before destroy
- keep bounded destroy, but treat repeated timeout as an ownership bug
Symptom: close Returns TDSE_ERR_CONCURRENT_API_USE
Likely meaning:
- another same-handle API is still active
Correct interpretation:
- runtime is working as designed
- the host attempted an immediate-answer close during active execution
Corrective action:
- use
destroyif bounded waiting is desired - keep
closeonly when fast overlap detection is the intent
Symptom: destroy Or release Returns TDSE_ERR_INVALID_STATE
Likely meaning:
- another thread already owns teardown
Corrective action:
- clear local references
- stop issuing further same-handle calls
- repair the host ownership model instead of retrying locally
Recommended Evidence for Concurrency Issues
When a concurrency issue is reported, collect the smallest bundle that explains ownership:
- handle identity in host logs
- failing API name
- thread or worker identity on both sides of the race
tdse_model_state_info(...)captured before teardown when availabletdse_model_last_error_info(...)captured before teardown when available- chosen shutdown API:
close,destroy, orrelease - destroy wait budget and returned
wait_ms - whether the host had already requested worker stop
- whether the issue reproduced with one handle per thread
This bundle is usually more valuable than a large raw trace with no ownership annotations.
Review Checklist
During integration review, ask:
- Which component owns each live handle?
- Which component is allowed to start close or destroy?
- Can a supervisor request stop without directly entering the handle?
- Where is the timeout policy for destroy chosen and logged?
- What happens to local references after
TDSE_ERR_INVALID_STATE? - Which path captures diagnostics before teardown starts?
Anti-Patterns
Avoid these patterns even if they look harmless in local tests:
- sharing one handle across workers and assuming Runtime will serialize it
- using
releaseas the default business-logic shutdown API - treating
closeas a faster version ofdestroy - retrying local teardown after
TDSE_ERR_INVALID_STATE - treating
TDSE_ERR_TIMEOUTas if the handle were already gone - starting teardown before the host has a stop or quiesce protocol
