Configuring Jaeger Retention Policies for Compliance
Jaeger’s --es.num-days and --cassandra.ttl flags are configuration hints, not enforcement mechanisms — actual data deletion is controlled entirely by the storage backend’s own lifecycle engine, and misalignment between the two produces silent compliance violations.
Context and When It Matters
Regulatory frameworks like GDPR, SOC2, and HIPAA mandate strict data lifecycle controls. Distributed tracing pipelines are particularly exposed because span payloads frequently carry internal routing metadata, masked PII embedded in URL parameters, and authentication headers that compliance auditors classify as personal data. When retention windows are misconfigured, automated audit scanners query the storage backend directly and find lingering data — even when Jaeger’s configuration appears correct. The result is compliance drift: a state where Jaeger reports that a 7-day retention policy is active, while the underlying index or table has never had a single row deleted.
This page covers every step needed to close the gap: diagnosing where the drift occurs, aligning backend lifecycle policies with Jaeger’s flags, and wiring automated verification into CI/CD and monitoring pipelines.
How Jaeger Delegates Retention to the Storage Backend
The diagram below shows the decision path from a Jaeger retention flag to an actual physical deletion, and where the chain can silently break.
Jaeger does not implement native row-level or index-level deletion. It delegates retention enforcement to the storage backend: Elasticsearch Index Lifecycle Management (ILM), Cassandra’s default_time_to_live (TTL), or Badger’s value log garbage collection. A frequent operational mismatch occurs when operators configure --es.index-cleaner.schedule expecting it to force immediate deletion, while Elasticsearch’s default ILM policy keeps indices in the hot phase indefinitely. The cleaner merely attempts to drop old indices based on naming conventions; if ILM prevents rollover or blocks deletion due to missing phases, the flags become ineffective.
Selecting a storage backend fundamentally dictates your retention mechanics and compliance overhead — a reality thoroughly explored in the Trace Storage Backend Comparison: Jaeger vs Tempo overview.
The Retention Drift Scenario
The following values.yaml snippet represents the most common misconfiguration seen in Kubernetes Helm deployments:
# values.yaml — this configuration produces silent retention drift
jaeger:
storage:
type: elasticsearch
options:
es:
num-days: 7
index-cleaner:
enabled: true
schedule: "0 0 * * *"
# Missing: explicit ILM policy attachment and rollover alias configuration
What goes wrong: Elasticsearch rolls over indices daily but never deletes them because no ILM delete phase exists. After 30 days, automated compliance scanners querying GET /jaeger-span-*/_search return payloads containing internal service endpoints and masked user identifiers. The audit system flags these as retention violations, generating false-positive compliance gaps — the data physically exists despite Jaeger’s configuration claiming a 7-day window.
Step-by-Step Diagnostics for Retention Failures
Work through these four checks in order. The first two confirm whether the failure is in Jaeger’s scheduler or the backend; the third and fourth locate the exact scope of accumulated data.
Step 1 — Verify Cleaner Execution Logs
Inspect Jaeger collector or ingester logs for cleaner execution status, skipped cycles, or permission errors:
kubectl logs -l app=jaeger-collector -c jaeger-collector \
| grep -iE "cleaner|retention|delete|failed"
Step 2 — Query Backend Lifecycle Directly
Bypass Jaeger and inspect the storage engine’s actual retention configuration.
Elasticsearch:
curl -s -X GET "localhost:9200/_ilm/policy/jaeger-service" \
| jq '.["jaeger-service"].policy.phases.delete'
A null result means no delete phase exists — the cleaner calls are guaranteed to be no-ops.
Cassandra:
SELECT keyspace_name, table_name, default_time_to_live
FROM system_schema.tables
WHERE keyspace_name = 'jaeger_v1_dc1';
A value of 0 means TTL is disabled and rows persist indefinitely.
Step 3 — Cross-Reference Ingestion vs. Rollover Dates
Compare span ingestion timestamps against actual index creation and rollover dates to measure accumulated drift:
curl -s -X GET "localhost:9200/jaeger-span-*/_stats?level=shards" | \
jq '.indices | to_entries[] | {
index: .key,
creation_date: .value.creation_date,
docs_count: .value.primaries.docs.count
}'
Step 4 — Identify Orphaned Indices and Timezone Skew
Check for indices matching jaeger-span-YYYY-MM-DD that exceed num-days. Then verify the index cleaner container’s TZ environment variable matches the storage cluster’s timezone. A UTC vs. local timezone mismatch causes the cron scheduler to skip execution windows, leaving indices intact for days beyond the compliance threshold.
Aligning Jaeger Flags with Backend Lifecycle
Elasticsearch: ILM Policy Alignment
Create an ILM policy that explicitly maps the delete phase to Jaeger’s --es.num-days value. The min_age in the delete phase must match.
# Create the ILM policy with an explicit delete phase
curl -X PUT "localhost:9200/_ilm/policy/jaeger-retention-policy" \
-H "Content-Type: application/json" -d '{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_primary_shard_size": "50gb"
},
"set_priority": { "priority": 100 }
}
},
"delete": {
"min_age": "7d", // must match --es.num-days
"actions": { "delete": {} }
}
}
}
}'
Attach the policy to the index template and configure the rollover alias — this step is mandatory, not optional:
# Attach the policy to an index template
curl -X PUT "localhost:9200/_index_template/jaeger-span-template" \
-H "Content-Type: application/json" -d '{
"index_patterns": ["jaeger-span-*"],
"template": {
"settings": {
"index.lifecycle.name": "jaeger-retention-policy",
"index.lifecycle.rollover_alias": "jaeger-span-write"
}
}
}'
Cassandra: TTL Enforcement
Match Jaeger’s --cassandra.ttl (in seconds) to the schema-level default TTL. A value of 604800 enforces a 7-day window:
ALTER TABLE jaeger_v1_dc1.span_index
WITH default_time_to_live = 604800; -- 7 days in seconds
ALTER TABLE jaeger_v1_dc1.span
WITH default_time_to_live = 604800;
Tombstone overhead: Aggressive TTLs on high-throughput clusters cause tombstone accumulation. Monitor ReadRepairMismatchError metrics and schedule nodetool compact during maintenance windows to prevent read latency spikes and compaction backpressure.
Badger: Manual Compaction and Limits
Badger lacks automated ILM. Use --badger.retention to cap disk usage, but enforce cleanup via scheduled compaction:
# Trigger value log backup before forced cleanup
badger backup \
--dir /var/lib/jaeger/badger \
--backup-file /tmp/jaeger-bak.db
# Inspect manifest to verify GC eligibility
badger info \
--dir /var/lib/jaeger/badger \
--show-manifest
Decision Criteria: Which Approach to Apply
- Use Elasticsearch ILM when your deployment uses the Elasticsearch backend and requires index-level audit trails. The
deletephase provides timestamped deletion events in Elasticsearch audit logs, which regulatory reviewers accept as evidence. - Use Cassandra TTL when row-level expiry precision matters (e.g., HIPAA’s requirement to track deletion of individual patient-linked records). TTL deletes are deterministic per row, but require active compaction to physically reclaim storage.
- Use Badger only for development or low-volume deployments. Its GC is not deterministic enough for compliance contexts requiring verifiable deletion SLAs.
- Verify timezone alignment first before diagnosing any other retention failure. A misconfigured
TZenvironment variable is responsible for a disproportionately large share of “retention is configured but not working” incidents.
Compliance Validation and Automated Audit Workflows
Implement a verification script to audit actual data age against the configured policy. Run it daily in a cron job or CI pipeline:
#!/usr/bin/env bash
# retention_audit.sh — flags indices that exceed the retention window
MAX_AGE_DAYS=7
CURRENT_EPOCH=$(date +%s)
CUTOFF=$((CURRENT_EPOCH - (MAX_AGE_DAYS * 86400))) # seconds threshold
# Query all jaeger-span indices and compare creation date against cutoff
curl -s -X GET "localhost:9200/_cat/indices/jaeger-span-*?h=index,creation.date&format=json" | \
jq -r --arg cutoff "$CUTOFF" \
'.[] | select((.["creation.date"] | tonumber) < ($cutoff | tonumber))
| "\(.index) violates retention policy"'
Wire retention checks into monitoring by scraping Jaeger’s jaeger_index_cleaner_total metric alongside custom Elasticsearch or CQL exporters. Configure a Prometheus alert for immediate SRE notification when the cleaner stalls:
- alert: JaegerRetentionDrift
expr: jaeger_index_cleaner_total{status="success"} == 0
for: 24h
labels:
severity: critical
compliance: "SOC2-HIPAA"
annotations:
summary: "Index cleaner execution stalled; retention compliance at risk."
runbook: "Check ILM policy delete phase and rollover alias configuration."
For regulatory reviewers, generate a cryptographic deletion audit trail by logging cleaner execution timestamps, ILM phase transitions, and final index deletion confirmations to a WORM (Write Once, Read Many) storage bucket. This provides verifiable proof of data lifecycle enforcement without exposing raw trace payloads or compromising internal service topology.
Common Pitfalls
- Missing
index.lifecycle.rollover_aliason the initial write index. ILM policy attachment failures frequently stem from this omission. Always verify alias routing before relying on automated deletion — a policy attached to a template but without an alias on the bootstrap index will silently never transition. - Timezone mismatch between cron scheduler and storage cluster. The index cleaner container and the Elasticsearch cluster must share the same
TZsetting. A UTC vs. local mismatch shifts the cron window by hours, causing daily cycles to be skipped entirely. - Relying on
--es.num-dayswithout a corresponding ILMmin_age. The two values must match. Setting--es.num-days=7with an ILMmin_ageof30dmeans data survives for 30 days regardless of Jaeger’s configuration, because ILM governs the actual deletion.
Troubleshooting FAQ
Why does Jaeger’s --es.num-days flag not delete data when the ILM policy has no delete phase?
Jaeger’s index cleaner attempts to drop indices based on naming conventions, but Elasticsearch ILM controls whether deletion is permitted. If the ILM policy has no delete phase, the cleaner call is silently rejected and data persists until the policy is updated.
What causes tombstone accumulation in Cassandra when using Jaeger TTLs?
Cassandra marks expired rows with tombstones before compaction removes them. On high-throughput Jaeger clusters with aggressive TTLs, tombstone density can spike between compaction cycles, increasing read latency. Schedule nodetool compact during low-traffic windows and monitor with nodetool tpstats.
How do I generate a verifiable deletion audit trail for regulatory reviewers?
Log cleaner execution timestamps, ILM phase transitions, and index deletion confirmations to a WORM storage bucket. This provides cryptographically durable proof of data lifecycle enforcement without exposing raw trace payloads or compromising internal topology.
Related
- Trace Storage Backend Comparison: Jaeger vs Tempo — storage engine trade-offs that determine which retention mechanism applies
- Security Boundaries in Distributed Tracing — PII in span attributes, trust zones, and data classification
- Encrypting Trace Payloads at Rest and in Transit — complementary control for trace data that must survive its retention window securely