When to use tail-based sampling for microservices

Use tail-based sampling when you need guaranteed retention of every error trace, latency outlier, or business-critical request — and probabilistic head-based rates are silently dropping the failures you need to debug.

Context and when it matters

Head-based sampling commits a keep/drop decision at the first span of a request, before the outcome is known. At a 1–5 % probabilistic rate, failures that occur at sub-0.1 % frequency are statistically invisible: the sampler discards the trace before any ERROR status propagates through the span tree. This is the defining operational threshold for adopting tail-based sampling — the point where probability guarantees break down and deterministic post-hoc retention becomes necessary.

Tail-based sampling moves the decision to a collector-side buffer. Every span for a given trace_id is held in memory during a configurable decision_wait window (typically 30–60 seconds). Once the window closes, explicit policies evaluate the complete trace and commit only the traces that match retention rules. The cost is bounded in-memory overhead and an added decision latency; the benefit is 100 % retention of the traces that matter for root-cause analysis.

Head-based vs tail-based sampling decision points Head-based sampling drops traces at the first span before the outcome is known. Tail-based sampling buffers all spans and decides after the full trace is assembled. Head-based sampling Service A sample? DROP Service B Service C — ERROR error never stored Tail-based sampling Service A Service B Service C — ERROR Collector buffer (decision_wait: 30 s) KEEP — error policy matched
Head-based sampling drops the trace before the error is visible. Tail-based sampling buffers all spans and applies retention policies after the full trace is assembled.

Head-based vs tail-based: side-by-side comparison

Dimension Head-based Tail-based
Decision point First span, before outcome is known Collector, after all spans arrive
Error retention guarantee None — low-rate errors are dropped statistically 100 % of ERROR spans if policy is set
P99 latency outlier capture None — slow traces dropped at same rate as fast ones Guaranteed — latency policy applied to full trace
Memory overhead Stateless — negligible In-memory buffer: ~1–2 GB per 10 K traces/sec at 30 s window
Decision latency added Zero decision_wait window (30–60 s typical)
Config complexity Low — single SDK sampler Medium — collector pipeline + ordered policy list
Async boundary handling Fragile — independent samplers break trace continuity Robust — all spans correlated before decision
Best for High-volume, low-severity background traffic Error-critical, SLO-gated, or compliance-scoped workloads

Implementation: OpenTelemetry Collector tail sampling configuration

The tail_sampling processor in the OpenTelemetry Collector evaluates policies strictly top-to-bottom; the first matching policy wins. Ordering matters: place deterministic rules before the probabilistic fallback.

processors:
  tail_sampling:
    # Buffer window for trace completion.
    # 30 s is standard; raise to 60 s for high-latency async hops.
    decision_wait: 30s

    # Max concurrent traces held in the decision cache.
    # Exceeding this triggers LRU eviction — size carefully.
    num_traces: 50000

    # Expected throughput for cache pre-allocation.
    expected_new_traces_per_sec: 10000

    policies:
      # 1. Deterministic error retention — highest priority.
      #    Keeps any trace containing at least one ERROR span.
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # 2. Latency SLO breach capture.
      #    Retains traces where the root span exceeds 2 000 ms.
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 2000

      # 3. Business-attribute matching (tenant tier, payment flows).
      #    Use coarse tags — avoid PII and high-cardinality fields.
      - name: keep-critical-tenant
        type: string_attribute
        string_attribute:
          key: tenant_tier
          values: ["enterprise", "vip"]

      # 4. Probabilistic fallback for baseline traffic coverage.
      #    Fires only when no higher-priority policy matched.
      - name: probabilistic-fallback
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling]
      exporters: [otlp]

Cache sizing formula:

Memory (GB) ≈ (traces_per_sec × decision_wait_sec × avg_span_bytes) / 1_073_741_824

Example: 10 000 traces/sec × 30 s × 5 120 bytes ≈ 1.43 GB baseline. Add 20–30 % for Go runtime and policy evaluation overhead. Provision your collector pods accordingly; OOM during a traffic spike silently evicts pending traces.

Async boundary edge case: Kafka consumer span loss

Asynchronous message brokers are the most common source of broken trace continuity under head-based sampling. A producer runs at 10 % and publishes a message with a traceparent header. An independently deployed consumer runs at 1 %. The consumer processes a message that triggers a database timeout. Because the consumer’s head-based sampler makes a fresh keep/drop decision on the extracted W3C TraceContext context, it drops the span — and the error disappears from storage.

Diagnosing this:

  1. Query storage for partial traces: filter for span_count < expected_service_count grouped by trace_id. Identify gaps where the consumer hop is absent.
  2. Inspect broker message metadata for traceparent and tracestate headers. A missing or malformed traceparent breaks the parent-child linkage entirely; the consumer span becomes an orphan with a new root trace_id.
  3. Cross-reference application logs containing the database error with the missing trace_id values. If logs exist but spans do not, the consumer’s sampler is the culprit.

Tail-based sampling resolves this because the collector receives all spans — producer and consumer — and evaluates the complete trace once the decision_wait window closes. The async gap is transparent to the retention policy.

Decision rules

Use tail-based sampling when:

  • Your service error rate is below 1 % and head-based probabilistic rates routinely discard those failures before storage.
  • You need guaranteed capture of P99 latency outliers for SLO accountability across service boundaries.
  • You operate async consumers (Kafka, SQS, RabbitMQ) at independent deployment cadences, making consistent head-based rates across the trace impossible to enforce.
  • Compliance or audit requirements mandate 100 % retention of traces carrying specific tenant or transaction identifiers.

Continue with head-based sampling when:

  • Your primary goal is cost control on background health-check or metrics-scrape traffic where errors are irrelevant.
  • Your infrastructure cannot accommodate a stateful collector buffer (edge deployments, memory-constrained containers).
  • Your error rate is high enough (> 5 %) that probabilistic sampling already captures a statistically useful sample of failures.

Common pitfalls

  • Misordering policies. Placing the probabilistic fallback before deterministic rules causes the fallback to consume policy evaluation time before the error and latency policies execute, and can lead to errors being evaluated under the probabilistic rate rather than guaranteed retention.
  • Undersizing the cache. Setting num_traces too low triggers LRU eviction of in-flight traces during spikes. The evicted traces are silently dropped — exactly the data loss tail-based sampling is meant to prevent. Use the sizing formula above and monitor otelcol_processor_tail_sampling_num_traces_on_decision_service.
  • Matching on PII or high-cardinality attributes. The string_attribute policy evaluates raw span metadata. Avoid user_id, email, or IP address fields; use hashed tenant identifiers or coarse-grained service tags to prevent memory bloat and comply with data-governance boundaries. See security boundaries in distributed tracing for attribute-level PII controls.

Troubleshooting FAQ

Why are errors still missing after enabling tail-based sampling? Check that the keep-errors policy is positioned first in the policies list and that your SDK is emitting spans with StatusCode = ERROR (not just setting an error log). Use otelcol_processor_tail_sampling_sampling_policy_evaluation_errors to detect evaluation failures.

How do I confirm the decision window is long enough? Inject a synthetic request that traverses every service hop and measure the time from the first span start to the last span end using your Jaeger or Tempo UI. Set decision_wait to at least that duration plus 10 s of network jitter margin.

What happens during a collector restart? In-memory buffered traces are lost. During rolling restarts, route traffic back to a head-based sampler at the SDK layer until the new collector pod is ready to accept spans. Document this in your SRE runbook alongside the tail-sampling circuit-breaker fallback.


Related

↑ Back to Choosing Between Head-Based and Tail-Based Sampling