Skip to main content
Refinery’s performance is shaped not just by traffic volume, but by the structure and characteristics of your trace data. Certain trace patterns, such as traces with thousands of spans, spans with large payloads, long-lived traces, and high-cardinality sampling keys, can cause memory pressure, throughput degradation, and Stress Relief activation. Use this page to diagnose an active performance problem, or to understand these patterns before they affect your cluster.
  • Actively troubleshooting? Start with Diagnose by symptom to map what you are observing to a likely cause.
  • Planning proactively? Start with Common trace patterns to understand the patterns and their effects.

Common trace patterns

These patterns are the most common structural causes of Refinery performance problems. Each one places stress on Refinery in a different way, and each requires a different response.
PatternPrimary effect
Traces with a high span countIncreased per-trace memory usage; delayed sampling decisions
Fields with a high number of attributes or attribute valuesIncreased per-span memory usage; slower processing
Long-lived tracesTrace cache growth; sustained memory pressure
High-cardinality sampling keysUnstable sampling rates

Diagnose by symptom

Use the symptom table to map what you are observing in Refinery’s metrics to one or more likely causes. Follow the links to the pattern sections for identification steps and recommendations.
What you are observingLikely cause
memory_inuse rising with trace_span_countTraces with a high span count
memory_inuse rising without a change in incoming_router_spanFields with a high number of attributes or attribute values and/or Long-lived traces
collector_collect_loop_duration_ms increasingTraces with a high span count
collect_cache_entries_* growing without boundLong-lived traces
trace_duration_ms_* max values surpass TraceTimeoutLong-lived traces
<sampler>_keyspace_size growing or fluctuatingHigh-cardinality sampling keys
Stress Relief activating frequentlyAny of the above; check all patterns

Traces with a high span count

A single trace that contains an unusually large number of spans, such as 10,000 or more, places an outsized memory burden on Refinery. This pattern is common in services that instrument batching processes or polling loops.

How to identify

Check these metrics in your Refinery monitoring to confirm this pattern is the cause of your performance issue:
  • trace_span_count_* shows spikes in spans per trace.
  • collector_collect_loop_duration_ms is elevated.
  • memory_inuse rises in correlation with trace_span_count_*.

Why this affects Refinery

Refinery buffers all spans belonging to a single trace in memory before making a sampling decision. A high span count increases per-trace memory usage and delays sampling decisions, which reduces overall throughput.

Recommendations

These changes address the pattern at both the Refinery configuration level and the instrumentation level:
  • Lower SpanLimit in Refinery to trigger a trace decision sooner.
  • Modify your application instrumentation to reduce spans per trace. Visit Exotic Trace Shapes on the Honeycomb Blog to learn more.

Fields with a high number of attributes or attribute values

Spans that carry a large number of attributes, or attributes with unusually large values such as request and response bodies or full stack traces, can degrade Refinery performance. This pattern often results from verbose or unstructured instrumentation.

How to identify

These metrics confirm that payload size, rather than span volume, is driving memory pressure:
  • incoming_router_event_bytes_* shows elevated per-event payload sizes even when span count is low or stable. This metric is available in Refinery v2.9.4 and later.
  • memory_inuse increases even when span count is low or stable.

Why this affects Refinery

Large span payloads increase per-span memory usage, which leads to slower processing, higher memory pressure, and more frequent activation of Stress Relief.

Recommendations

Addressing this pattern requires reducing payload size at the source before data reaches Refinery:
  • Normalize and trim unnecessary fields.
  • Avoid storing large payloads, such as stack traces or full request bodies, as span attributes.

Long-lived traces

When spans for a single trace arrive over an extended period of time, Refinery must hold the incomplete trace in memory for longer than usual. This pattern is common in background jobs, retry logic, or workflows that span multiple asynchronous stages.

How to identify

These metrics indicate that traces are accumulating in the cache faster than they are being processed:
  • trace_duration_ms_* values approach the configured TraceTimeout.
  • trace_send_expired is rising.
  • collect_cache_entries_* increases gradually without a corresponding drop.
  • memory_inuse grows without a change in span rate.

Why this affects Refinery

Accumulating long-lived traces increases trace cache size and memory pressure over time. When enough long traces accumulate, throughput drops and Stress Relief may activate. If a Dynamic Sampler is in use, this pattern can also affect sampling accuracy: the sampler’s keyspace may update faster than spans arrive for a trace, causing a sampling decision based on incomplete information. The effect compounds over time, which makes early detection important.

Recommendations

These adjustments give Refinery an earlier opportunity to resolve traces and free memory:
  • Adjust TraceTimeout to set an upper bound on how long Refinery holds an incomplete trace.
    Increasing TraceTimeout also increases Refinery’s memory requirements.
  • Where possible, break long workflows across trace boundaries at async or retry handoffs.

High-cardinality sampling keys

Dynamic sampling based on high-cardinality fields, such as http.url or user.id, can result in a large and unstable sampling keyspace. Refinery maintains a key-to-sample-rate map for each Dynamic Sampler; when that map grows too large or fluctuates frequently, sampling rates become unpredictable.

How to identify

These signals reveal whether your sampling keyspace is growing beyond a stable, manageable size:
  • <dynamic-sampler-name>_keyspace_size increases rapidly or fluctuates.
  • When querying your sampled trace data in Honeycomb, COUNT_DISTINCT(meta.refinery.sample_key) is large or increases across different time granularities. Before running this query, enable AddRuleReasonToTrace in your Refinery configuration to attach meta.refinery.sample_key to your sampled traces.

Why this affects Refinery

An unstable or excessively large keyspace causes erratic sampling rates and increases the likelihood of triggering Stress Relief. The instability affects both the quality of sampling decisions and overall cluster health.

Recommendations

These changes stabilize the keyspace and restore predictable sampling behavior: