Handling Async Boundaries in Node.js and Python
Problem Framing
An order service finishes processing a request and hands work off to an async queue consumer. In Jaeger, the resulting spans appear as two disconnected root traces rather than a single waterfall. The parent_id field on the consumer spans is empty. The application code looks correct — the OpenTelemetry SDK is initialised, headers are present in the message, but the consumer span has no parent. This is an async boundary context-loss failure, one of the most common sources of broken traces in production Node.js and Python services. The active span context lives in execution-local storage that is not automatically transferred when execution shifts to a new event-loop iteration, a thread-pool worker, or a message-queue consumer loop.
Prerequisites
Before working through the implementation steps below, ensure the following are in place:
- Node.js 16+ (for stable
AsyncLocalStorage) or Node.js 18+ (forAsyncLocalStorage.bind()) - Python 3.7+ with
contextvarsavailable (standard library) @opentelemetry/api≥ 1.4 and@opentelemetry/sdk-node≥ 1.4 for Node.jsopentelemetry-api≥ 1.4 andopentelemetry-sdk-trace-base≥ 1.4 for Python- A working OpenTelemetry SDK setup for backend services that initialises before any HTTP server or consumer starts
OTEL_PROPAGATORS=tracecontext,baggageset on every service in the call chain- Familiarity with SDK Implementation & Context Propagation concepts, in particular the inject/extract lifecycle
Concept Deep-Dive: How Async Execution Breaks Context
The W3C TraceContext specification assumes that the active span context travels with the call stack. When execution is synchronous, this works implicitly. The moment execution crosses an async boundary — a setTimeout, a Promise resolved on a different microtask tick, an asyncio.create_task(), or a thread-pool submission — the runtime creates a new execution context. Whether that new context inherits the parent’s tracing data depends entirely on how the language’s concurrency primitive copies (or fails to copy) thread-local or context-local storage.
The diagram below shows the two diverging paths: a propagated context (spans linked) vs a lost context (orphaned root span).
Node.js: AsyncLocalStorage mechanics
AsyncLocalStorage (ALS) is built on the async_hooks API. When asyncLocalStorage.run(store, callback) is called, Node.js creates an async context that all child async operations — promises, timers, I/O callbacks — inherit automatically, as long as they are initiated within that run call. The critical failure mode is creating an async operation before entering run(), or storing a reference to context.active() but never passing it through asyncLocalStorage.run() when the deferred work executes.
Python: contextvars mechanics
Python’s contextvars.ContextVar stores data per execution context. When asyncio.create_task() spawns a coroutine, Python copies a snapshot of the current context at that exact moment, so changes made inside the coroutine do not affect the parent — but crucially, if the parent context has no active OpenTelemetry span yet, the snapshot is empty. ThreadPoolExecutor submissions receive no context copy at all by default, which is why contextvars.copy_context().run(fn, *args) is the required pattern.
Step-by-Step Implementation
Step 1 — Node.js: Configure AsyncLocalStorage for the request lifecycle
Seed AsyncLocalStorage with the active OTel context at the start of every inbound request. This one wrapper ensures all middleware, route handlers, and deferred operations within the same request inherit the same context automatically.
const { AsyncLocalStorage } = require('async_hooks');
const { trace, context } = require('@opentelemetry/api');
const als = new AsyncLocalStorage();
// Express middleware — runs before all route handlers
app.use((req, res, next) => {
const activeCtx = context.active(); // capture the propagated OTel context
als.run(activeCtx, () => {
// every async operation started inside next() inherits activeCtx
next();
});
});
// Any code anywhere in the request cycle can now read the context:
function getCurrentTraceId() {
const ctx = als.getStore() ?? context.active();
return trace.getSpan(ctx)?.spanContext().traceId;
}
Step 2 — Node.js: Restore context for worker threads and message queue consumers
Long-lived consumers — Kafka, RabbitMQ, BullMQ — run outside any request’s als.run() scope. Extract the W3C TraceContext headers from the message payload and re-enter als.run() with the restored context before creating any spans.
const { trace, propagation, context } = require('@opentelemetry/api');
const { AsyncLocalStorage } = require('async_hooks');
const als = new AsyncLocalStorage();
async function processKafkaMessage(message) {
// message.headers carries traceparent / tracestate / baggage from the producer
const carrier = Object.fromEntries(
Object.entries(message.headers).map(([k, v]) => [k, v.toString()])
);
const extractedCtx = propagation.extract(context.active(), carrier);
await als.run(extractedCtx, async () => {
const tracer = trace.getTracer('kafka-consumer', '1.0.0');
const span = tracer.startSpan('kafka.message.process', {}, extractedCtx);
const spanCtx = trace.setSpan(extractedCtx, span);
try {
await context.with(spanCtx, () => handleBusinessLogic(message));
span.setStatus({ code: 0 }); // SpanStatusCode.OK
} catch (err) {
span.recordException(err);
span.setStatus({ code: 2, message: err.message }); // SpanStatusCode.ERROR
throw err;
} finally {
span.end();
}
});
}
Step 3 — Node.js: Bind deferred callbacks explicitly
setTimeout and setInterval callbacks execute in new macrotask queue entries, after the current als.run() scope has returned. Use als.bind() to capture the current store reference at scheduling time.
const { AsyncLocalStorage } = require('async_hooks');
const { context } = require('@opentelemetry/api');
const als = new AsyncLocalStorage();
// Inside an als.run() block:
als.run(context.active(), () => {
const store = als.getStore();
// WRONG: setTimeout callback runs outside the als.run() scope
setTimeout(() => {
console.log(als.getStore()); // undefined — context lost
}, 100);
// CORRECT: bind the callback to the current store at scheduling time
setTimeout(
als.bind(() => {
console.log(als.getStore()); // active context — propagated correctly
}),
100
);
});
Step 4 — Python: Propagate context across asyncio tasks
Before calling asyncio.create_task(), capture context.get_current(). Pass it into the coroutine as a parameter and call context.attach() as the first action inside the task. Always pair every attach() with a detach() in a finally block to avoid context token leaks.
import asyncio
from opentelemetry import trace, context as otel_context
async def process_item(item: dict, parent_ctx: object) -> None:
# Restore the parent OTel context in this new asyncio task
token = otel_context.attach(parent_ctx)
try:
tracer = trace.get_tracer("worker", "1.0.0")
with tracer.start_as_current_span("worker.process_item") as span:
span.set_attribute("item.id", item["id"])
await asyncio.sleep(0) # simulate async I/O
finally:
otel_context.detach(token) # always detach to prevent token leakage
async def request_handler(items: list) -> None:
# Capture context BEFORE creating tasks — the snapshot at this point
# includes any span that was started by the inbound request middleware.
current_ctx = otel_context.get_current()
tasks = [
asyncio.create_task(process_item(item, current_ctx))
for item in items
]
await asyncio.gather(*tasks)
Step 5 — Python: Copy context into ThreadPoolExecutor workers
ThreadPoolExecutor workers receive no context copy by default. Wrap the submitted callable with contextvars.copy_context().run() to carry the full context snapshot — including the active span — into the worker thread.
import contextvars
from concurrent.futures import ThreadPoolExecutor
from opentelemetry import trace, context as otel_context
def cpu_bound_task(data: bytes, ctx_snapshot: contextvars.Context) -> bytes:
# ctx_snapshot.run() sets all ContextVars from the snapshot for this thread
def _inner():
token = otel_context.attach(otel_context.get_current())
try:
tracer = trace.get_tracer("cpu-worker")
with tracer.start_as_current_span("cpu.task"):
return process_data(data) # actual CPU work
finally:
otel_context.detach(token)
return ctx_snapshot.run(_inner)
async def handle_upload(data: bytes) -> bytes:
ctx_snapshot = contextvars.copy_context() # copy BEFORE executor.submit
loop = asyncio.get_running_loop()
with ThreadPoolExecutor() as pool:
result = await loop.run_in_executor(
pool,
cpu_bound_task,
data,
ctx_snapshot,
)
return result
Step 6 — Python: Serialize context for subprocess and multiprocessing boundaries
When spawning a child process, contextvars cannot transfer across fork/exec. Encode the active W3C TraceContext headers into environment variables and read them back in the child.
import os
import subprocess
from opentelemetry.propagate import inject, extract
from opentelemetry import context as otel_context
def spawn_child_worker(payload_path: str) -> None:
headers: dict = {}
inject(headers) # encodes traceparent, tracestate, baggage
env = os.environ.copy()
env.update(headers) # child reads these as env vars
subprocess.run(
["python", "child_worker.py", payload_path],
env=env,
check=True,
)
# --- In child_worker.py ---
import os
from opentelemetry.propagate import extract
from opentelemetry import context as otel_context, trace
def restore_parent_context() -> None:
# Re-assemble a carrier dict from environment variables
carrier = {k: v for k, v in os.environ.items() if k.lower().startswith("trace")}
ctx = extract(carrier)
otel_context.attach(ctx)
Verification
After deploying the fixes, verify that context propagation is intact across every boundary you modified.
Query in Jaeger UI: Open the trace for a request that crosses the async boundary. Every span should show a populated parentSpanId. If any span appears as a root when it should be a child, the extract() call at that boundary is not restoring the context correctly.
CLI verification with OTLP output — export a single trace to stdout and inspect the JSON:
# Set the exporter to console output temporarily
OTEL_TRACES_EXPORTER=console node server.js
Look for "parentSpanId": "..." in the console output for every consumer span. An empty string indicates context loss.
Python context introspection — add this at any async boundary to confirm the context is active:
import contextvars
from opentelemetry import trace, context as otel_context
def debug_context(label: str) -> None:
ctx_items = list(contextvars.copy_context().items())
span = trace.get_current_span()
span_ctx = span.get_span_context()
print(f"[{label}] trace_id={span_ctx.trace_id:#018x} parent_valid={span_ctx.is_valid}")
print(f"[{label}] context vars: {[str(k) for k, _ in ctx_items]}")
Expected trace output for a healthy async boundary crossing:
{
"name": "kafka.message.process",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"parentSpanId": "a3ce929d0e0e4736",
"startTimeUnixNano": "1702300800000000000",
"endTimeUnixNano": "1702300800045000000",
"status": { "code": "STATUS_CODE_OK" }
}
Edge Cases and Gotchas
-
Unawaited promises in Node.js. A promise that is created but not awaited executes outside the current
als.run()scope’s lifetime. If you fire-and-forget withsomeAsyncFn()instead ofawait someAsyncFn(), any spans created inside that function will lose their parent. Either alwaysawaitor explicitly bind usingals.bind(). -
Python
asyncio.ensure_future(). This alias forcreate_task()behaves identically — it snapshots context at call time, so capturecontext.get_current()before calling it, not inside the coroutine. -
Django ORM in async views. Django’s ORM is synchronous. When called from an async view using
sync_to_async, the sync wrapper runs in a thread pool. Apply thecontextvars.copy_context().run()pattern inside thesync_to_asyncwrapper, or rely onasgiref≥ 3.7 which copies context automatically. Pinasgirefto avoid silent regressions on downgrade. -
Starlette/FastAPI background tasks.
BackgroundTasks.add_task()captures no context. Wrap the submitted callable with a closure that callscontext.attach(parent_ctx)before the real work begins. -
Context token double-detach. Each call to
context.attach()in Python returns a unique token. Callingcontext.detach(token)twice raises aRuntimeError. Use a singletry / finallyblock perattach()call — never store tokens in a list and detach them in bulk. -
Long-lived
AsyncLocalStoragestores. Storing large request objects (fullreq, full parsed body) inside ALS causes the entire object to be retained for the lifetime of all child async operations. Store only lightweight identifiers —traceId,spanId,userId— to avoid GC pressure. -
Promise.allwith mixed contexts. When parallel promises insidePromise.allwere each started in differentals.run()scopes, their context references diverge. Ensure all parallel work is initiated inside a singleals.run()block so they share the same store. -
gRPC streaming in Node.js. gRPC streaming calls persist across multiple event loop ticks. The initial
als.run()scope from the first message frame will have exited by the time subsequent frames arrive. Extract and re-attach context for each individual streaming message, just as with Kafka.
Performance and Scale Notes
ALS memory overhead. Each active als.run() scope retains a reference to the stored object. In high-throughput Node.js services handling 10 000 req/s, keeping only primitive identifiers (strings, numbers) in the store rather than full request objects eliminates a significant class of GC pause. Benchmark with --expose-gc and call global.gc() between load-test iterations to measure retained heap.
Python context snapshot cost. contextvars.copy_context() performs a shallow copy of all active ContextVar entries. For services with many active context variables, this copy happens once per task submission. It is typically sub-microsecond with fewer than 20 context variables, but grows linearly. Audit active ContextVar instances with contextvars.copy_context().items() and remove stale ones.
Batch processor tuning. When spans from async workers are exported in batch, the BatchSpanProcessor defaults (maxQueueSize=2048, scheduledDelayMillis=5000) may cause delayed export for high-volume consumers. Tune for lower latency at the cost of slightly higher CPU:
# OTEL_BSP_* environment variables (applies to both Node.js and Python)
OTEL_BSP_MAX_QUEUE_SIZE: 4096
OTEL_BSP_SCHEDULED_DELAY: 1000
OTEL_BSP_MAX_EXPORT_BATCH_SIZE: 512
Sampling interaction. When using head-based sampling, the sampling decision is made at the root span. All child spans — including those created across async boundaries — inherit the decision via TraceFlags. Ensure OTEL_TRACES_SAMPLER=parentbased_traceidratio is set so consumers honour the producer’s decision and do not independently re-sample, which would break the parent-child link.
Troubleshooting FAQ
Q: Why do spans lose their parent_id across an asyncio.create_task() call?
asyncio.create_task() snapshots the current context at call time. If the parent span was started after the create_task() call, or if context.attach() was called after the snapshot was taken, the task’s copy of the context does not include the parent span. Capture context.get_current() before the create_task() call, then call context.attach(parent_ctx) as the first line inside the coroutine.
Q: My Express middleware uses AsyncLocalStorage but context is missing inside a setTimeout callback. Why?
setTimeout schedules its callback in the macrotask queue after the current synchronous execution completes. If asyncLocalStorage.run() has already returned by then, the store is no longer active for that callback. Wrap the callback explicitly with als.run(store, callback) or use als.bind(callback) at scheduling time to capture the store reference before the scope exits.
Q: How do I propagate trace context through a Python ThreadPoolExecutor?
ThreadPoolExecutor workers receive no context by default. Use contextvars.copy_context().run(fn, *args) as the callable submitted to the executor. This copies the full context snapshot — including the active OTel span — into the worker thread’s environment. See Step 5 above for the complete pattern.
Q: What is the safest way to pass baggage across a Kafka consumer boundary in Node.js?
Inject both trace context and baggage into the Kafka message headers at produce time using propagation.inject(context.active(), headers). At consume time, call propagation.extract(context.active(), headers) before creating any spans, then run all span work inside als.run(extractedCtx, ...). This ensures the full context — trace headers and baggage — is restored for the entire consumer processing chain.
Q: After fixing context propagation, some spans still appear as roots in Jaeger. What should I check?
Verify that OTEL_PROPAGATORS is set identically on every service — a mismatch between a producer emitting W3C traceparent and a consumer configured for B3 headers causes extract() to return an empty context, making the new span a root. Also confirm the SDK is initialised before any HTTP server or queue consumer binds; spans created before SDK init cannot attach to a propagated context.
Related
- Fixing Dropped Spans in Async Python FastAPI Routes — deep dive into FastAPI middleware and dependency-injection context restoration
- Trace Context in Multi-Threaded Environments — extends these patterns to Java and Go concurrency models
- Auto-Instrumentation vs Manual Span Creation — when auto-instrumentation fails at async boundaries and manual wrapping is required
- Context Propagation Across Service Meshes — sidecar proxy and eBPF-layer context injection that complements SDK-level fixes
- Propagating Trace Context Through Kafka Consumers — end-to-end Kafka producer/consumer propagation patterns
↑ Back to SDK Implementation & Context Propagation