OpenTelemetry SDK Setup for Backend Services

Problem Framing

Services that rely solely on auto-instrumentation agents emit telemetry for framework-level calls — HTTP routing, database drivers, gRPC stubs — but produce silent gaps wherever the interesting work actually happens: custom serialization, multi-step business transactions, background job dispatch, or fan-out to internal queues. When those gaps fall inside a trace ID chain, operators see a root span that takes 800 ms but contains only two child spans totalling 50 ms. The missing 750 ms is invisible, and root-cause analysis stalls. Programmatic OpenTelemetry SDK initialization resolves this by giving teams deterministic control over provider lifecycle, resource identity, export pipeline parameters, and context propagation boundaries — controls that agent defaults cannot provide.

Prerequisites

Concept Deep-Dive: SDK Initialization Flow

The diagram below shows the order in which SDK components must be assembled. Components at the top of the graph must be fully initialized before the components below them can function correctly. A provider set after instrumentation has already called get_tracer() yields a NoOpTracer — the most common cause of missing spans in new deployments.

OpenTelemetry SDK Initialization Sequence Directed graph showing that Resource attributes feed into TracerProvider and LoggerProvider, which feed into BatchSpanProcessor and BatchLogRecordProcessor respectively, which feed into the OTLP Exporter. The global set_tracer_provider call must precede all instrumentation library imports. Resource.create() service.name · version · env TracerProvider resource=resource LoggerProvider resource=resource BatchSpanProcessor queue · batch · delay BatchLogRecordProcessor OTLPSpanExporter endpoint · TLS · timeout set_tracer_provider() · set_global_textmap()

The sequence matters: Resource first, then providers, then processors wrapping exporters, then global registration. Any instrumentation call that executes before step ⑤ receives a no-op implementation.

Step-by-Step Implementation

Step 1 — Define resource attributes and initialize providers

Resource attributes establish the service identity embedded in every emitted span. The OTEL_SERVICE_NAME and OTEL_RESOURCE_ATTRIBUTES environment variables override programmatic values at runtime — document this explicitly to prevent configuration drift across CI/CD pipelines.

import os
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.logs import LoggerProvider       # opentelemetry-sdk >= 1.23
from opentelemetry import trace
from opentelemetry._logs import set_logger_provider

resource = Resource.create({
    "service.name":           os.environ.get("OTEL_SERVICE_NAME", "payment-processor"),
    "service.version":        os.environ.get("APP_VERSION", "unknown"),
    "deployment.environment": os.environ.get("ENV", "development"),
    "host.id":                os.environ.get("HOSTNAME", "unknown"),
})

tracer_provider  = TracerProvider(resource=resource)
logger_provider  = LoggerProvider(resource=resource)

# Must happen before any library calls get_tracer() or get_logger()
trace.set_tracer_provider(tracer_provider)
set_logger_provider(logger_provider)

Scoped vs. global providers: Global providers simplify cross-library instrumentation but complicate unit testing. In test suites, instantiate a fresh TracerProvider per test and pass it directly to get_tracer() to avoid state leakage between tests.

Step 2 — Configure the export pipeline

The BatchSpanProcessor is mandatory for production. The SimpleSpanProcessor blocks on every span export, which adds hundreds of microseconds of latency to each request. Batch parameters must be tuned to match your service’s peak request rate and container memory limits.

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

otlp_exporter = OTLPSpanExporter(
    endpoint="otel-collector.monitoring.svc:4317",
    insecure=False,   # Require TLS in production
    timeout=5,        # Exporter call timeout in seconds
)

span_processor = BatchSpanProcessor(
    otlp_exporter,
    max_queue_size=2048,         # Spans held in memory before export
    schedule_delay_millis=5000,  # Max wait before forced flush
    max_export_batch_size=512,   # Spans per network call
    export_timeout_millis=30000, # Per-batch export deadline
)

tracer_provider.add_span_processor(span_processor)

Queue sizing guidance: max_queue_size must accommodate your peak concurrent span rate multiplied by the schedule_delay_millis interval. If 1000 requests/second each produce three spans, the queue must hold at least 1000 × 3 × 5 = 15 000 spans over a 5-second window — a max_queue_size of 2048 would drop spans under that load. Monitor otel_span_processor_dropped_spans_total to detect overflow before it becomes user-visible data loss.

Step 3 — Register propagators globally

W3C TraceContext (traceparent, tracestate) is the current standard for carrying context propagation across service boundaries. B3 remains necessary when integrating with older services or Zipkin backends. Propagators must be registered before any HTTP or gRPC middleware executes, because middleware reads the global textmap at request time.

from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from opentelemetry.propagators.b3 import B3MultiFormat

set_global_textmap(CompositePropagator([
    TraceContextTextMapPropagator(),  # W3C first — wins on extraction
    B3MultiFormat(),                  # B3 fallback for legacy services
]))

Propagator ordering matters: the first propagator that successfully extracts a valid trace ID wins. Place W3C before B3 so modern services prefer the standard format.

Step 4 — Add manual spans for domain logic

Auto-instrumentation covers the infrastructure layer. Add manual spans at business logic boundaries — transaction stages, external API calls with retry logic, or multi-step workflows — where you need attribute-level visibility. See auto-instrumentation vs manual span creation for a decision matrix on where each approach pays off.

from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("payment.domain", schema_url="https://opentelemetry.io/schemas/1.24.0")

def process_refund(order_id: str, amount: float) -> dict:
    with tracer.start_as_current_span("refund.process") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("payment.amount", amount)
        span.set_attribute("payment.currency", "USD")

        try:
            result = execute_gateway_call(order_id, amount)
            span.set_status(StatusCode.OK)
            return result
        except GatewayTimeoutError as exc:
            span.record_exception(exc)
            span.set_status(StatusCode.ERROR, description="Gateway timeout")
            raise

Always use start_as_current_span as a context manager — it guarantees span closure even when exceptions propagate. Calling span.end() manually is error-prone in exception-heavy code paths where early returns or nested raises can bypass the call.

Step 5 — Handle async boundaries explicitly

Context propagation across async boundaries is the most common source of broken trace chains. Python’s contextvars module automatically propagates context within asyncio tasks created with asyncio.create_task(), but it does not copy context into threads. Always use explicit attach/detach when submitting work to a ThreadPoolExecutor or other thread pool. For Node.js patterns, see handling async boundaries in Node.js and Python.

import asyncio
import copy
from concurrent.futures import ThreadPoolExecutor
from opentelemetry import context, trace

tracer = trace.get_tracer("async.worker")

def blocking_db_query(ctx: object, query: str) -> list:
    """Runs in a thread — must re-attach the caller's context."""
    token = context.attach(ctx)
    span = tracer.start_span("db.query")
    inner_token = context.attach(trace.set_span_in_context(span))
    try:
        return execute_query(query)
    finally:
        span.end()
        context.detach(inner_token)
        context.detach(token)

async def handle_request(query: str) -> list:
    with tracer.start_as_current_span("request.handler"):
        # Capture the context before crossing the thread boundary
        current_ctx = context.copy()
        loop = asyncio.get_running_loop()
        return await loop.run_in_executor(
            ThreadPoolExecutor(),
            blocking_db_query,
            current_ctx,
            query,
        )

For trace context in multi-threaded environments, the rule is: always pass the captured context as an explicit argument to the worker function; never rely on implicit thread-local inheritance.

Step 6 — Add graceful shutdown

Failing to flush the export queue during shutdown causes telemetry loss during rolling deployments. Register shutdown handlers at application startup, not in request handlers.

import atexit, signal

def _shutdown():
    tracer_provider.shutdown()
    logger_provider.shutdown()

atexit.register(_shutdown)
signal.signal(signal.SIGTERM, lambda *_: (_shutdown(), exit(0)))

Verification

After deployment, confirm the pipeline is healthy with these checks:

# 1. Send a synthetic trace using otel-cli
otel-cli exec --service payment-processor \
  --name "smoke-test" \
  --otlp-endpoint http://otel-collector:4317 \
  -- echo "trace sent"

# 2. Confirm the trace appears in Jaeger
curl "http://jaeger:16686/api/traces?service=payment-processor&limit=1" | \
  python3 -m json.tool | grep traceID

# 3. Check the collector's span receipt metric
curl -s http://otel-collector:8889/metrics | \
  grep otel_span_processor_queue_size

Expected trace output in the collector log:

{"traceId":"4bf92f3577b34da6a3ce929d0e0e4736","spanId":"00f067aa0ba902b7",
 "operationName":"refund.process","service":"payment-processor",
 "status":"STATUS_CODE_OK","attributes":{"order.id":"ord-7421","payment.amount":49.99}}

Confirm that:

  • The traceId in the child service matches the traceId injected by the parent
  • payment.amount and order.id attributes appear — not just HTTP framework attributes
  • No STATUS_CODE_UNSET spans in the refund path (means exception handling ran correctly)

Edge Cases & Gotchas

  1. Provider initialized after get_tracer() is called. Any library that calls get_tracer() at import time receives a NoOpTracer if the SDK has not yet set the global provider. Move set_tracer_provider() to the absolute top of your entry point — before any non-stdlib import.

  2. Thread-local storage leakage. Python’s threading.local() does not inherit contextvars state. Code that stores span context in threading.local and assumes it propagates to child threads will silently drop traces. Always use context.attach() / context.detach() explicitly.

  3. Sidecar double-injection. Envoy and Istio sidecars inject traceparent headers automatically. If your SDK also injects on the same outbound request, you can overwrite the sidecar’s header and corrupt the trace. Configure the SDK to extract-only on the inbound path (OTEL_PROPAGATORS=tracecontext) while allowing the sidecar to own outbound injection, or verify that the sidecar and SDK agree on the same trace ID before the request is forwarded. For full context propagation across service meshes patterns, see the dedicated guide.

  4. OTEL_RESOURCE_ATTRIBUTES overrides programmatic values silently. If a Kubernetes pod spec sets OTEL_SERVICE_NAME=wrong-service and your code sets a different value via Resource.create(), the environment variable wins. This produces misrouted traces in backends that partition by service name.

  5. Sampling decision not propagated to child spans. If OTEL_TRACES_SAMPLER=parentbased_traceidratio is set but the parent service does not emit a sampled=1 flag in the traceparent header, the child service drops all spans. Verify the propagation header includes the sampling flag: the last hex digit of traceparent must be 01 for sampled traces.

  6. Clock skew between services. Spans with start_time > parent.end_time indicate NTP drift between hosts. This makes waterfall views in Jaeger and Tempo display negative-duration gaps. Enforce NTP synchronization and consider using monotonic clocks for duration calculation while using wall-clock time only for start_time.

Performance & Scale Notes

Attribute cardinality. Avoid using high-cardinality values — user IDs, request UUIDs, or raw SQL statements — as span attribute keys. Attribute keys drive backend index cardinality; misuse can exhaust Jaeger or Tempo index budgets in hours. Use fixed-cardinality keys (order.status, payment.method) and put high-cardinality values in attribute values where the backend can decide whether to index them.

Exporter protocol selection. gRPC (OTLPSpanExporter via proto-grpc) generally outperforms HTTP/JSON at high volume because it multiplexes streams and uses binary encoding. HTTP/JSON is easier to debug with curl and proxies. For deployments where spans cross a network boundary to a managed SaaS collector, prefer HTTP to avoid gRPC header size limits on load balancers.

BatchSpanProcessor tuning table.

Parameter Conservative Moderate Aggressive
max_queue_size 512 2048 8192
schedule_delay_millis 2000 5000 10000
max_export_batch_size 128 512 1024
Memory overhead ~5 MB ~20 MB ~80 MB
Best for Low-traffic services General production High-throughput ingest

Context propagation overhead. Explicit context.attach() / context.detach() calls add roughly 10–50 µs per thread-boundary crossing. Avoid propagating context into tight loops or high-frequency micro-tasks (e.g., per-packet network callbacks). Instrument at the logical operation level, not the physical call level.

Head-based sampling at the SDK vs. Collector. SDK-level TraceIdRatioBased sampling is efficient and adds zero latency, but it makes the sampling decision before the full trace is known, which means slow or erroring requests can be sampled out. For error-aware sampling, push the decision to the OpenTelemetry Collector’s tail sampling processor.

Troubleshooting FAQ

Q: Why do I see NoOpTracer output in logs even though I initialized the SDK?

Auto-instrumentation libraries — FastAPI middleware, SQLAlchemy integration, requests session hooks — often call get_tracer() at import time, before your application bootstrap runs. The fix is to call set_tracer_provider() before importing any instrumentation library. In Python, structure your entry point as: SDK bootstrap → provider registration → framework/library imports → application startup.

Q: Spans appear in the first service but not in downstream services. What is wrong?

The downstream service is not extracting the traceparent header. Confirm that: (a) set_global_textmap() has been called with TraceContextTextMapPropagator in the downstream process; (b) the HTTP framework instrumentation is installed and active; and © no reverse proxy or API gateway is stripping the traceparent header. Use curl -v to inspect the header before and after each hop.

Q: How do I prevent span drops under high traffic?

First, confirm drops are occurring: otel_span_processor_dropped_spans_total > 0 on the SDK metrics endpoint. Then increase max_queue_size and/or reduce schedule_delay_millis so the processor flushes more often. If the exporter is the bottleneck (slow network to the collector), horizontally scale the OpenTelemetry Collector and load-balance exporter endpoints.

Q: How do I propagate context from an asyncio task into a ThreadPoolExecutor?

Call context.copy() in the async coroutine before calling run_in_executor(), pass the copied context as an argument to the worker function, and call context.attach(ctx) inside the worker. See Step 5 above for the full pattern. Do not use contextvars.copy_context() directly — use the OpenTelemetry context module to stay compatible with the SDK’s context manager stack. For the complete Python async guide, see step-by-step OpenTelemetry Python SDK integration.

Q: Should I disable SDK propagation injection when behind an Envoy sidecar?

Not necessarily. The safest pattern: configure OTEL_PROPAGATORS=tracecontext so the SDK extracts the sidecar-injected traceparent on inbound requests and continues the same trace. On outbound requests, the SDK will inject into headers before the sidecar sees them. If the sidecar is also injecting, verify they agree on the trace ID — most modern Envoy configurations respect an existing traceparent header and do not overwrite it. Test by checking that traceId is identical across all hop-by-hop spans in Jaeger.


↑ Back to SDK Implementation & Context Propagation