Tenant Context Propagation in Multi-Tenant SaaS
Problem Framing
When a tenant identifier drops mid-flight in a multi-service SaaS system, the operational damage is immediate and hard to diagnose. Billing pipelines aggregate usage under the wrong account. Tenant-aware rate limiters see requests with no identity and either pass them all through or reject them all. Most critically, your observability pipeline loses the ability to correlate logs, metrics, and traces to a specific customer — turning every cross-tenant incident into a manual log trawl. The challenge is not injecting the tenant ID once at the edge; it is keeping that ID attached to every span, every async message, and every background job as the request fans out across dozens of services.
Prerequisites
Before implementing the patterns on this page, verify the following are in place:
- OpenTelemetry SDK v1.10 or later in all services (stable Baggage API)
- The W3C
tracecontextandbaggagepropagators registered in every process — not just the ingress service - A consistent tenant identifier format (UUIDv4 recommended) enforced at account creation
- Reverse proxies and API gateways configured to forward rather than strip the
baggageheader - Familiarity with how OpenTelemetry Baggage differs from Span Attributes — both are used here, for different purposes
How Tenant Context Flows Through a Distributed Request
The diagram below shows the lifecycle of a tenant.id from the API gateway through synchronous service calls and then across an async Kafka boundary.
The key insight: synchronous HTTP calls inherit baggage automatically once propagators are configured. Async boundaries — Kafka, SQS, RabbitMQ — break that automatic flow. You must serialise context into message headers on the producer side and reconstruct it on the consumer side.
Step-by-Step Implementation
Step 1 — Extract and Validate at the Ingress Layer
The API gateway or ingress controller is the only point where you can trust the tenant identifier. Extract it from the JWT tenant or sub claim, from subdomain routing, or from an X-Tenant-ID header (in that priority order). Validate the format against a strict allowlist regex and cross-reference against a tenant registry before passing anything downstream.
// Node.js ingress middleware
const { propagation, context } = require('@opentelemetry/api');
const TENANT_REGEX = /^[a-zA-Z0-9_-]{8,64}$/;
function injectTenantContext(req, res, next) {
// Prefer JWT claim over raw header — headers can be spoofed by callers
const tenantId = extractFromJWT(req) ?? req.headers['x-tenant-id'];
if (!tenantId) {
return next(Object.assign(new Error('Missing tenant context'), { status: 401 }));
}
if (!TENANT_REGEX.test(tenantId)) {
return next(Object.assign(new Error('Invalid tenant format'), { status: 400 }));
}
const bag = propagation.createBaggage().set('tenant.id', { value: tenantId });
const ctx = propagation.setBaggage(context.active(), bag);
context.with(ctx, () => next());
}
Step 2 — Attach tenant.id to OpenTelemetry Baggage
Once validated, write the tenant ID into OpenTelemetry Baggage so every downstream SDK call automatically forwards it. Also record it as a Span Attribute on the root span so your tracing backend can filter and aggregate by tenant without having to decode the baggage header.
// Go — attach to Baggage and record as a Span Attribute
import (
"go.opentelemetry.io/otel/baggage"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
func attachTenantContext(ctx context.Context, tenantID string) context.Context {
m, _ := baggage.NewMember("tenant.id", tenantID)
b, _ := baggage.New(m)
ctx = baggage.ContextWithBaggage(ctx, b)
// Also stamp the active span so Jaeger/Tempo can index it
span := trace.SpanFromContext(ctx)
span.SetAttributes(attribute.String("tenant.id", tenantID))
return ctx
}
// Java — makeCurrent() propagates via ThreadLocal; always close the Scope
import io.opentelemetry.api.baggage.Baggage;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.context.Scope;
// ~50 ns overhead per request from Scope allocation; negligible vs. network latency
try (Scope scope = Baggage.current().toBuilder()
.put("tenant.id", tenantId)
.build()
.makeCurrent()) {
Span.current().setAttribute("tenant.id", tenantId);
processRequest(); // downstream calls inherit the context
}
Step 3 — Configure Propagators and Reverse Proxies
A tenant ID in Baggage is only useful if the baggage header survives every network hop. Two places commonly strip it silently:
Reverse proxies (Nginx)
# nginx.conf — forward Baggage through to upstream services
location /api/ {
proxy_pass http://backend;
proxy_set_header baggage $http_baggage;
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
}
SDK propagator configuration
# Java agent — ensure both propagators are active
otel.propagators=tracecontext,baggage
# Restrict which keys pass through to prevent arbitrary metadata injection
otel.baggage.keys=tenant.id,request.region
// Node.js SDK initialisation
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { W3CBaggagePropagator, W3CTraceContextPropagator, CompositePropagator } = require('@opentelemetry/core');
const sdk = new NodeSDK({
textMapPropagator: new CompositePropagator({
propagators: [new W3CTraceContextPropagator(), new W3CBaggagePropagator()],
}),
});
sdk.start();
Service mesh sidecars (Envoy, Linkerd) need the same treatment: configure the baggage header in the proxy’s allowed-headers list or it will be dropped at the sidecar layer.
Step 4 — Preserve Context Across Async Boundaries
Context propagation across Kafka consumers requires explicit serialisation. The W3C Baggage propagator cannot inject headers automatically into a Kafka record — you must do it in the producer interceptor.
# Python Kafka producer — serialise context into record headers
from opentelemetry import context, propagate
from confluent_kafka import Producer
def publish_with_context(producer: Producer, topic: str, payload: bytes) -> None:
headers: dict[str, str] = {}
# Inject traceparent, tracestate, and baggage (including tenant.id) into headers
propagate.inject(headers)
producer.produce(
topic,
value=payload,
headers=[(k, v.encode()) for k, v in headers.items()],
)
producer.flush()
# Python Kafka consumer — reconstruct context before processing
from opentelemetry import propagate, context
def consume_message(msg) -> None:
# Decode headers from bytes and restore the full OTel context
carrier = {k: v.decode() for k, v in (msg.headers() or [])}
ctx = propagate.extract(carrier)
token = context.attach(ctx)
try:
process_message(msg) # tenant.id is now in Baggage and propagates further
finally:
context.detach(token)
For dead-letter queues and retry workers, include tenant.id in both the baggage header and a dedicated application-level field in the message envelope. That way, routing logic can read the tenant ID without parsing trace headers, and the trace link is preserved independently.
Step 5 — Assert Propagation at Each Service Hop
Add a middleware assertion in every service that reads the baggage and logs a warning (or rejects the request, in strict-isolation mode) when tenant.id is absent. This turns propagation failures into visible signals rather than silent data-quality problems.
// Go — per-hop tenant assertion middleware
func AssertTenantMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
bag := baggage.FromContext(r.Context())
tenantID := bag.Member("tenant.id").Value()
if tenantID == "" {
// In strict mode: reject. In permissive mode: log and continue.
http.Error(w, "propagation failure: missing tenant.id", http.StatusInternalServerError)
return
}
// Stamp the span so Tempo/Jaeger can index this hop by tenant
trace.SpanFromContext(r.Context()).SetAttributes(
attribute.String("tenant.id", tenantID),
)
next.ServeHTTP(w, r)
})
}
Step 6 — Apply Security and Compliance Filters at the Collector
The OpenTelemetry Collector is the right place to apply tenant-level security controls before trace data reaches storage. Use the attributes processor to enforce an allowlist and mask any values that should not be persisted.
# otel-collector-config.yaml — allowlist baggage keys, mask regulated values
processors:
attributes/tenant_filter:
actions:
# Keep only approved keys
- key: tenant.id
action: upsert
# Mask any email that leaked into attributes
- key: user.email
action: hash
# Drop any key not in the allowlist
- key: baggage.raw
action: delete
filter/drop_internal:
spans:
exclude:
match_type: regexp
attributes:
- key: tenant.id
value: "^internal-.*" # exclude internal health-check traffic
service:
pipelines:
traces:
processors: [attributes/tenant_filter, filter/drop_internal]
Verification
Query Jaeger or Tempo for spans where tenant.id is missing to identify propagation gaps:
-- Tempo TraceQL — find spans where tenant context dropped
{ span.tenant.id = "" || span.tenant.id !exists }
| select(span.service.name, span.http.route, rootSpan.startTime)
| limit 50
In Jaeger UI, filter by the tag tenant.id=<your-id> and inspect the waterfall for gaps. A gap — a span with no tenant.id attribute — points to the upstream service or proxy that stripped the header. Cross-reference that service’s ingress timestamp with the first missing span to isolate the break point.
You can also write a CI/CD assertion that replays a test request through your staging environment and asserts that every span in the resulting trace carries tenant.id:
# pytest integration test — assert tenant.id on all spans
def test_tenant_id_propagates(trace_exporter):
make_request(headers={"X-Tenant-ID": "test-tenant-abc"})
spans = trace_exporter.get_finished_spans()
assert len(spans) > 0, "No spans recorded"
for span in spans:
assert span.attributes.get("tenant.id") == "test-tenant-abc", (
f"Missing tenant.id on span: {span.name}"
)
Edge Cases and Gotchas
-
Thread-pool context bleeding (Java/Go): When a thread or goroutine is reused across requests, a
ScopeorContextthat was not closed carries the previous request’stenant.idinto the next one. Always usetry-with-resourcesin Java ordefer scope.Close()in Go, and never store a context in a struct field that outlives the request. -
Sampling drops tenant context before it is recorded: Head-based sampling makes the keep/drop decision at the first span — before baggage has propagated to child services. If you rely on
tenant.idfor billing or SLA reporting, switch to parent-based or tail-based sampling so that tenant context is available when the sampling decision is made. -
gRPC metadata key naming: The gRPC metadata spec requires lowercase keys. Using
Tenant-IDinstead oftenant-id(or mapping to a custom metadata key without registering it in the interceptor chain) causes the key to be silently ignored. Keep all baggage keys lowercase. -
Baggage surviving a redirect: HTTP 301/302 redirects cause most HTTP clients to drop non-standard headers, including
baggage. If your API gateway redirects requests (e.g., HTTP→HTTPS or path normalisation), ensure context is re-injected after the redirect rather than relying on the client to forward it. -
Out-of-order Kafka consumption: A consumer reading from multiple partitions may process messages for different tenants concurrently in the same thread. Never rely on thread-local context in async consumer loops — always extract and pass context explicitly into each message handler invocation.
-
Baggage header size limits: Keep the total
baggageheader under 4 KB in practice (8 KB is a common proxy hard limit). Use a compact tenant ID format (UUID, 36 chars) rather than long human-readable slugs, and avoid adding unbounded metadata to baggage.
Performance and Scale Notes
- Baggage extraction overhead: Reading a baggage entry is O(n) over the number of baggage members. Keep the number of baggage entries small (ideally one or two) to avoid degrading hot paths. The total overhead is under 1 µs per hop for a single
tenant.identry. - Span Attribute cardinality: Recording
tenant.idas a Span Attribute is safe if your tenant count is bounded (thousands, not millions). For very high tenant counts, consider recording the attribute only on root and exit spans, not on every internal child span, to avoid cardinality explosion in your metrics pipeline. - Collector throughput: The
attributesprocessor in the Collector runs synchronously in the pipeline. Keep allowlist rules simple; complex regex transforms at high throughput (>100k spans/s) can become a bottleneck. Pre-filter at the SDK level where possible. - Context propagation in async workers: Using
AsyncLocalStorage(Node.js) orcontextvars(Python) for async boundary handling is safe for I/O-bound work but requires careful scoping in CPU-bound thread pools where tasks outlive the originating async context.
Troubleshooting FAQ
Why does tenant.id disappear mid-trace even though my ingress injects it?
The most common cause is a reverse proxy stripping unrecognised headers. Add proxy_set_header baggage $http_baggage; in Nginx (or the equivalent directive in your gateway). The second most common cause is an SDK propagator list that omits the baggage propagator — verify otel.propagators includes baggage alongside tracecontext.
How do I carry tenant context through Kafka without losing the trace link?
Serialise both traceparent and baggage into Kafka record headers before producing (see Step 4 above). On the consumer, extract those headers back into an OpenTelemetry context before starting the consumer span. This preserves W3C TraceContext propagation across the async boundary and keeps the consumer span linked to the producer’s trace.
Should I store tenant.id in Baggage or as a Span Attribute?
Use both: Baggage propagates the value to every downstream service automatically; a Span Attribute makes it queryable in Jaeger or Tempo without decoding headers. See Baggage vs Span Attributes for the trade-offs in detail.
What is the safe maximum size for tenant-related baggage?
The W3C Baggage specification sets no hard limit, but most proxies and gRPC implementations truncate headers beyond 8 KB. Target under 4 KB for the entire baggage header in practice. UUIDs (36 chars) are ideal tenant ID formats.
How do I prevent external callers from injecting arbitrary baggage keys?
Validate the tenant ID format at the ingress layer before writing it into Baggage (regex or allowlist). Apply an allowlist-based attributes processor in the OpenTelemetry Collector to drop any baggage-derived attributes that are not explicitly permitted. This prevents log poisoning and stops cardinality attacks on your metrics backend.
Related
- Baggage vs Span Attributes: When to Use What
- How to Safely Propagate User IDs via OpenTelemetry Baggage
- Propagating Trace Context Through Kafka Consumers
- Security Boundaries in Distributed Tracing
- Handling Async Boundaries in Node.js and Python
↑ Back to Baggage & Metadata Routing Workflows