- Actively troubleshooting? Start with Diagnose by symptom to map what you are observing to a likely cause.
- Planning proactively? Start with Common trace patterns to understand the patterns and their effects.
Common trace patterns
These patterns are the most common structural causes of Refinery performance problems. Each one places stress on Refinery in a different way, and each requires a different response.| Pattern | Primary effect |
|---|---|
| Traces with a high span count | Increased per-trace memory usage; delayed sampling decisions |
| Fields with a high number of attributes or attribute values | Increased per-span memory usage; slower processing |
| Long-lived traces | Trace cache growth; sustained memory pressure |
| High-cardinality sampling keys | Unstable sampling rates |
Diagnose by symptom
Use the symptom table to map what you are observing in Refinery’s metrics to one or more likely causes. Follow the links to the pattern sections for identification steps and recommendations.| What you are observing | Likely cause |
|---|---|
memory_inuse rising with trace_span_count | Traces with a high span count |
memory_inuse rising without a change in incoming_router_span | Fields with a high number of attributes or attribute values and/or Long-lived traces |
collector_collect_loop_duration_ms increasing | Traces with a high span count |
collect_cache_entries_* growing without bound | Long-lived traces |
trace_duration_ms_* max values surpass TraceTimeout | Long-lived traces |
<sampler>_keyspace_size growing or fluctuating | High-cardinality sampling keys |
| Stress Relief activating frequently | Any of the above; check all patterns |
Traces with a high span count
A single trace that contains an unusually large number of spans, such as 10,000 or more, places an outsized memory burden on Refinery. This pattern is common in services that instrument batching processes or polling loops.How to identify
Check these metrics in your Refinery monitoring to confirm this pattern is the cause of your performance issue:trace_span_count_*shows spikes in spans per trace.collector_collect_loop_duration_msis elevated.memory_inuserises in correlation withtrace_span_count_*.
Why this affects Refinery
Refinery buffers all spans belonging to a single trace in memory before making a sampling decision. A high span count increases per-trace memory usage and delays sampling decisions, which reduces overall throughput.Recommendations
These changes address the pattern at both the Refinery configuration level and the instrumentation level:- Lower
SpanLimitin Refinery to trigger a trace decision sooner. - Modify your application instrumentation to reduce spans per trace. Visit Exotic Trace Shapes on the Honeycomb Blog to learn more.
Fields with a high number of attributes or attribute values
Spans that carry a large number of attributes, or attributes with unusually large values such as request and response bodies or full stack traces, can degrade Refinery performance. This pattern often results from verbose or unstructured instrumentation.How to identify
These metrics confirm that payload size, rather than span volume, is driving memory pressure:incoming_router_event_bytes_*shows elevated per-event payload sizes even when span count is low or stable. This metric is available in Refinery v2.9.4 and later.memory_inuseincreases even when span count is low or stable.
Why this affects Refinery
Large span payloads increase per-span memory usage, which leads to slower processing, higher memory pressure, and more frequent activation of Stress Relief.Recommendations
Addressing this pattern requires reducing payload size at the source before data reaches Refinery:- Normalize and trim unnecessary fields.
- Avoid storing large payloads, such as stack traces or full request bodies, as span attributes.
Long-lived traces
When spans for a single trace arrive over an extended period of time, Refinery must hold the incomplete trace in memory for longer than usual. This pattern is common in background jobs, retry logic, or workflows that span multiple asynchronous stages.How to identify
These metrics indicate that traces are accumulating in the cache faster than they are being processed:trace_duration_ms_*values approach the configuredTraceTimeout.trace_send_expiredis rising.collect_cache_entries_*increases gradually without a corresponding drop.memory_inusegrows without a change in span rate.
Why this affects Refinery
Accumulating long-lived traces increases trace cache size and memory pressure over time. When enough long traces accumulate, throughput drops and Stress Relief may activate. If a Dynamic Sampler is in use, this pattern can also affect sampling accuracy: the sampler’s keyspace may update faster than spans arrive for a trace, causing a sampling decision based on incomplete information. The effect compounds over time, which makes early detection important.Recommendations
These adjustments give Refinery an earlier opportunity to resolve traces and free memory:-
Adjust
TraceTimeoutto set an upper bound on how long Refinery holds an incomplete trace.IncreasingTraceTimeoutalso increases Refinery’s memory requirements. - Where possible, break long workflows across trace boundaries at async or retry handoffs.
High-cardinality sampling keys
Dynamic sampling based on high-cardinality fields, such ashttp.url or user.id, can result in a large and unstable sampling keyspace.
Refinery maintains a key-to-sample-rate map for each Dynamic Sampler; when that map grows too large or fluctuates frequently, sampling rates become unpredictable.
How to identify
These signals reveal whether your sampling keyspace is growing beyond a stable, manageable size:<dynamic-sampler-name>_keyspace_sizeincreases rapidly or fluctuates.- When querying your sampled trace data in Honeycomb,
COUNT_DISTINCT(meta.refinery.sample_key)is large or increases across different time granularities. Before running this query, enableAddRuleReasonToTracein your Refinery configuration to attachmeta.refinery.sample_keyto your sampled traces.
Why this affects Refinery
An unstable or excessively large keyspace causes erratic sampling rates and increases the likelihood of triggering Stress Relief. The instability affects both the quality of sampling decisions and overall cluster health.Recommendations
These changes stabilize the keyspace and restore predictable sampling behavior:- Use lower-cardinality fields as sampling keys.
- Enable
MaxKeysto cap the keyspace size. - Visit Refinery EMA Sampling for more on EMA Dynamic Samplers, or read Refinery and EMA Sampling on the Honeycomb blog.