When scaling Refinery, use Refinery’s metrics to determine if adjustment is needed in your general configuration and sampling rules.
Refinery emits a number of metrics to give indications about its health as well as its trace throughput and sampling statistics.
These metrics can be exposed to Prometheus or sent to Honeycomb, which will need configuration within config.yaml
.
Below is a summary of the key recorded metrics by type. For a complete list of available metrics, please refer to the Honeycomb Refinery Metrics Documentation
Refinery’s system metrics include memory_inuse
, num_goroutines
, hostname
, and process_uptime_seconds
.
We recommend monitoring process_uptime_seconds
alongside memory_inuse
.
If you see unexpected restarts, this could indicate that the process is hitting memory constraints.
is_ready
is_alive
The collector refers to Refinery’s mechanism that intercepts and collects traces in a buffer. Ideally, it holds onto each trace until the root span has arrived. At that point, Refinery sends the trace to the sampler to make a decision whether to keep or drop the trace. In some cases, Refinery may have to make a sampling decision on the trace before the root span arrives.
collect_cache_entries_*
collector_incoming_queue_*
collector_peer_queue_*
collector_collect_loop_duration_ms
Sampler metrics will vary with the type of sampler you have configured. Generally, there will be metrics on the number of traces dropped, the number of traces kept, and the sample rate. The fields below are an example of the metrics when the dynamic sampler is configured:
dynsampler_num_dropped
dynsampler_num_kept
dynsampler_sample_rate_*
A Refinery host may receive spans both from outside Refinery and from other hosts within the Refinery cluster.
In the following fields, incoming
refers to the process that is listening for incoming events from outside Refinery and peer
refers to the process that is listening for events redirected from a peer.
upstream
refers to the Honeycomb API.
incoming_router_batch
, peer_router_batch
incoming_router_event
, peer_router_event
incoming_router_dropped
, peer_router_dropped
incoming_router_span
, peer_router_span
incoming_router_nonspan
, peer_router_nonspan
The following fields can be used to get a better idea of the traffic that is flowing from incoming sources vs. from peer sources, and to track any errors from the Honeycomb API:
incoming_router_peer
, peer_router_peer
incoming_router_proxied
, peer_router_proxied
peer_enqueue_errors
, upstream_enqueue_errors
peer_response_20x
, upstream_response_20x
peer_response_errors
, upstream_response_errors
trace_accepted
trace_duration_ms_*
CacheCapacity
.
For more information, see collect_cache_buffer_overrun
.trace_send_dropped
DryRunFieldName
.trace_send_kept
DryRunFieldName
.trace_send_has_root
trace_send_no_root
collect_cache_buffer_overrun
.
Another reason why this could happen is if a node shuts down unexpectedly and sends the traces it currently has in its cache.trace_sent_cache_hit
trace_span_count_*
trace_send_no_root
, the trace_span_count_*
values may be undercounting, since this indicates that traces were not fully complete before they were sent.The Stress Relief system monitors these metrics to calculate the current stress level of the Refinery cluster:
collector_peer_queue_length
collector_incoming_queue_length
libhoney_peer_queue_length
libhoney_upstream_queue_length
memory_heap_allocation
The stress level is calculated and represented as the following two metrics:
stress_level
: a gauge from 0 to 100, where 0 is no stress and 100 is maximum stress.
By default, at stress_level
90 Stress Relief will activate, and then deactivate once it reaches 75.
These values are configurable as ActivationLevel
and DeactivationLevel
in the Refinery configuration file.
stress_relief_activated
: a gauge at 0 or 1.