Manual span creation for custom business logic
When an async queue consumer, background scheduler, or in-process pipeline spawns work outside a standard HTTP/gRPC request lifecycle, the only reliable way to maintain trace continuity is to capture the active context propagation object before crossing the concurrency boundary and restore it explicitly inside the worker.
Context and when it matters
Framework-level auto-instrumentation patches known entry and exit points — Express routes, Django views, gRPC handlers — but it cannot infer execution semantics for work dispatched to a ThreadPoolExecutor, an asyncio task, a Celery worker, or a cron-triggered job. When that dispatch happens without an explicit context hand-off, the OpenTelemetry SDK initialises a fresh trace context in the worker thread or coroutine. The result is orphaned spans: child work appears as an unrelated root trace with a different trace_id, the span hierarchy breaks, and latency attribution becomes meaningless.
This scenario is distinct from simply choosing between auto and manual instrumentation at project setup time. It arises specifically when business logic crosses a concurrency primitive — thread handoff, event-loop task creation, message-queue consumption — and the SDK’s context storage mechanism cannot follow automatically.
Core mechanism: why the context does not cross automatically
The OpenTelemetry SDK stores the active context in a runtime-specific scope object — a threading.local slot in Python, an AsyncLocalStorage instance in Node.js, a ThreadLocal in Java, or a context.Context value threaded explicitly through every Go function call. When a concurrency primitive creates a new execution scope, that scope starts with either an empty context or a shallow copy that does not include the SDK’s trace state.
The observable failure symptoms are consistent across runtimes:
- Orphaned spans with
parent_idequal to0000000000000000or absent entirely. - Fragmented service dependency graphs where downstream calls appear as independent root traces.
- Inaccurate latency attribution, because wall-clock time is measured from the async boundary rather than the originating request.
- Silent baggage drops, causing correlation IDs or tenant metadata to vanish before reaching the worker.
The fix in every case follows the same three-step pattern: capture → attach → detach.
Minimal working implementation
The code below shows the corrected Python pattern. Every numbered comment maps to the tracing concept it implements.
from opentelemetry import context, trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from concurrent.futures import ThreadPoolExecutor
# Wire TracerProvider once at application startup
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("order-service", "1.0.0")
executor = ThreadPoolExecutor(max_workers=4)
def process_order(order_id: str) -> None:
# HTTP handler — auto-instrumented or wrapped manually
with tracer.start_as_current_span("validate_payment") as span:
span.set_attribute("order.id", order_id)
payment_status = validate(order_id)
# 1. Capture the active context BEFORE crossing the thread boundary.
# context.get_current() reads the thread-local slot synchronously.
current_ctx = context.get_current()
def worker_wrapper(ctx, oid: str, status: str) -> None:
# 2. Attach the captured context so this thread inherits the parent span.
# context.attach() returns a token needed for cleanup.
token = context.attach(ctx)
try:
# 3. Any span started here becomes a child of validate_payment.
with tracer.start_as_current_span("background_fulfillment") as ws:
ws.set_attribute("order.id", oid)
ws.set_attribute("messaging.operation", "process")
fulfill(oid, status)
finally:
# 4. Detach unconditionally to prevent context leaks across
# thread-pool reuse — thread-local storage persists between tasks.
context.detach(token)
executor.submit(worker_wrapper, current_ctx, order_id, payment_status)
Runtime-specific attach/detach patterns
The same three-step logic applies in every runtime, but the primitives differ:
# Python asyncio — use contextvars.copy_context() to snapshot ContextVars
import contextvars, asyncio
async def http_handler(order_id: str) -> None:
with tracer.start_as_current_span("validate_payment"):
payment_status = validate(order_id)
# copy_context() snapshots ALL ContextVars, including the OTel slot
ctx_copy = contextvars.copy_context()
asyncio.get_event_loop().run_in_executor(
None, ctx_copy.run, worker_wrapper, order_id, payment_status
)
// Node.js — AsyncLocalStorage is the SDK's context store
const { context, trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('order-service');
function processOrder(orderId) {
const span = tracer.startSpan('validate_payment');
const ctx = trace.setSpan(context.active(), span);
span.end();
// context.bind() wraps the callback so it runs in the captured context
const boundWorker = context.bind(ctx, () => {
const ws = tracer.startSpan('background_fulfillment', {}, ctx);
fulfill(orderId);
ws.end();
});
setImmediate(boundWorker); // or worker_threads, BullMQ, etc.
}
// Java — wrap Runnable with Context.current().wrap() before submitting
import io.opentelemetry.context.Context;
import java.util.concurrent.ExecutorService;
Context capturedCtx = Context.current();
executorService.submit(capturedCtx.wrap(() -> {
Span ws = tracer.spanBuilder("background_fulfillment")
.setParent(capturedCtx)
.startSpan();
try (Scope s = ws.makeCurrent()) {
fulfill(orderId, paymentStatus);
} finally {
ws.end();
}
}));
Decision criteria
Use explicit manual context capture when any of these conditions apply:
- Work is dispatched to a thread pool, process pool, or event-loop task after the framework handler has already returned its response.
- The consumer is triggered by a message queue (Kafka, RabbitMQ, SQS) or a scheduled job rather than an inbound HTTP/gRPC call.
- The worker may run minutes or hours after the originating request, in which case create a new root trace and add a
SpanLinkto the originatingtrace_idrather than holding the parent span open. - You need baggage metadata (tenant ID, user ID, correlation key) to survive the concurrency boundary intact.
Use the parent’s context as-is when work is await-ed synchronously within the same coroutine — standard asyncio does not break the context in that case.
Common pitfalls
- Calling
context.get_current()inside the worker instead of before dispatch. The worker’s context slot is already empty at that point; you capture nothing useful. - Omitting
context.detach(token). Thread pools reuse threads across tasks. Without detach, the next task on that thread inherits stale trace state from the previous one, producing incorrect parent-child relationships. - Mixing auto-instrumentation and manual spans on the same async boundary without isolation. Auto-instrumented libraries inject and extract context concurrently. On a shared boundary this causes race conditions, duplicate spans, and inflated latency figures. If you must mix both strategies, isolate the path with an explicit
Contextobject rather than relying on the implicit active-context slot.
Troubleshooting FAQ
Why does my worker produce a new root span even after I call context.attach()?
The most common cause is calling context.attach() after tracer.start_as_current_span(). The span constructor reads the active context at call time; attach must happen first.
How do I verify the fix without deploying?
Use InMemorySpanExporter in a unit test. After the executor finishes, assert spans[0].trace_id == spans[1].trace_id and spans[1].parent.span_id == spans[0].context.span_id. If the assertion fails, the context hand-off is still broken.
What happens if I forget context.detach() in a long-running service?
Thread-local and coroutine-local context slots accumulate stale entries. Over time this produces incorrect parent-child relationships for unrelated requests and can cause subtle memory leaks in runtimes that reference-count context objects.
Related
- Fixing Dropped Spans in Async Python FastAPI Routes — detailed diagnosis for
asynciocontext detachment in FastAPI - Handling Async Boundaries in Node.js and Python —
AsyncLocalStorageandcontextvarsdeep-dive - Trace Context in Multi-Threaded Environments — Java
ExecutorServicewrappers and Go goroutine patterns - Debugging Orphaned Spans in Async Workflows — how to read a span waterfall to identify the exact break point