Monitor Honeycomb Refinery

When scaling Refinery, use Refinery’s metrics to determine if adjustment is needed in your general configuration and sampling rules.

Understanding Refinery’s Metrics 

Refinery emits a number of metrics to give indications about its health as well as its trace throughput and sampling statistics. These metrics can be exposed to Prometheus or sent to Honeycomb, which will need configuration within config.yaml. Below is a summary of the key recorded metrics by type. For a complete list of available metrics, please refer to the Honeycomb Refinery Metrics Documentation

Note
Refinery exports a number of histogram metrics as seen below in next section. Querying histograms in Honeycomb is straight forward. However, histograms in Prometheus have a bit of a learning curve if you are not familiar with them. Please refer to the Prometheus histogram documentation if you need a refresher.

Refinery System Metrics 

Refinery’s system metrics include memory_inuse, num_goroutines, hostname, and process_uptime_seconds. We recommend monitoring process_uptime_seconds alongside memory_inuse. If you see unexpected restarts, this could indicate that the process is hitting memory constraints.

Note
Refinery’s system metrics are only available when sending directly to Honeycomb. If metrics are being sent from Prometheus, Refinery’s system metrics are not available.

Refinery Health Check Metrics 

is_ready
This field indicates whether the system is ready to receive traffic. The value is either 0 or 1. 1 means the system is ready to receive and process traffic. 0 means the system is not ready to receive traffic.
is_alive
This field indicates whether the system is operational and reporting its status. The value is either 0 or 1. 1 means the system is alive and actively reporting its health status. 0 means the system is not alive, potentially indicating a failure.

Collector Metrics 

The collector refers to Refinery’s mechanism that intercepts and collects traces in a buffer. Ideally, it holds onto each trace until the root span has arrived. At that point, Refinery sends the trace to the sampler to make a decision whether to keep or drop the trace. In some cases, Refinery may have to make a sampling decision on the trace before the root span arrives.

collect_cache_entries_*
Records avg, max, min, p50, p95, and p99. Indicates how full the cache is over time.
collector_incoming_queue_*
Records avg, max, min, p50, p95, and p99. Indicates how full the queue of spans is that were received from outside of Refinery and need to be processed by the collector.
collector_peer_queue_*
Records avg, max, min, p50, p95, and p99. Indicates how full the queue of spans is that were received from other Refinery peers and need to be processed by the collector.
collector_collect_loop_duration_ms
Records avg, max, min, p50, p95, and p99. Indicates the duration of each iteration for the primary event processing loop in Refinery.

Sampler Metrics 

Sampler metrics will vary with the type of sampler you have configured. Generally, there will be metrics on the number of traces dropped, the number of traces kept, and the sample rate. The fields below are an example of the metrics when the dynamic sampler is configured:

dynsampler_num_dropped
The number of traces dropped by the sampler.
dynsampler_num_kept
The number of traces kept by the sampler.
dynsampler_sample_rate_*
Records avg, max, min, p50, p95, and p99 of the sample rate reported by the configured sampler.

Incoming and Peer Router Metrics 

A Refinery host may receive spans both from outside Refinery and from other hosts within the Refinery cluster. In the following fields, incoming refers to the process that is listening for incoming events from outside Refinery and peer refers to the process that is listening for events redirected from a peer. upstream refers to the Honeycomb API.

incoming_router_batch, peer_router_batch
These values increment when Refinery’s batch event processing endpoint is hit.
incoming_router_event, peer_router_event
These values increment when Refinery’s single event processing endpoint is hit.
incoming_router_dropped, peer_router_dropped
These values increment when Refinery fails to add new spans to a receive buffer when processing new events. These values should be monitored closely as they indicate that spans are being dropped.
incoming_router_span, peer_router_span
These values increment when Refinery accepts events that are part of a trace, also known as spans.
incoming_router_nonspan, peer_router_nonspan
These values increment when Refinery accepts other non-span events that are not part of a trace.

The following fields can be used to get a better idea of the traffic that is flowing from incoming sources vs. from peer sources, and to track any errors from the Honeycomb API:

  • incoming_router_peer, peer_router_peer
  • incoming_router_proxied, peer_router_proxied
  • peer_enqueue_errors, upstream_enqueue_errors
  • peer_response_20x, upstream_response_20x
  • peer_response_errors, upstream_response_errors

Trace Metrics 

trace_accepted
This field indicates that a new trace has been added to the collector’s cache.
trace_duration_ms_*
Records avg, max, min, p50, p95, and p99. This value can help determine the appropriate configuration for CacheCapacity. For more information, see collect_cache_buffer_overrun.
trace_send_dropped
Indicates the number of traces that were dropped by the sampler. When dry run mode is enabled, this metric will increment for each trace. In this case, you can still see the result of sampling decisions by filtering by the configured field for DryRunFieldName.
trace_send_kept
Indicates the number of traces that were kept by the sampler. When dry run mode is enabled, this metric will remain 0, reflecting that we are sending all traces to Honeycomb. In this case, you can still see the result of sampling decisions by filtering by the configured field for DryRunFieldName.
trace_send_has_root
Indicates that the trace was fully finished when it was sent. This is generally what you want to happen, since if the trace was not complete when it was sent, this could indicate an incorrect sampling decision based on your criteria.
trace_send_no_root
Indicates that traces are being sent before they are completed. This field often correlates with collect_cache_buffer_overrun. Another reason why this could happen is if a node shuts down unexpectedly and sends the traces it currently has in its cache.
trace_sent_cache_hit
Indicates that Refinery received a span belonging to a trace that had already been sent. In this case, Refinery checks the sampling decision for the trace and either sends the span along to Honeycomb immediately, or drops the span.
trace_span_count_*
Records avg, max, min, p50, p95, and p99. Use this field as an indication of how large your traces are. Note that if you are seeing a high number of trace_send_no_root, the trace_span_count_* values may be undercounting, since this indicates that traces were not fully complete before they were sent.

Stress Relief Metrics 

The Stress Relief system monitors these metrics to calculate the current stress level of the Refinery cluster:

  • collector_peer_queue_length
  • collector_incoming_queue_length
  • libhoney_peer_queue_length
  • libhoney_upstream_queue_length
  • memory_heap_allocation

The stress level is calculated and represented as the following two metrics:

stress_level: a gauge from 0 to 100, where 0 is no stress and 100 is maximum stress. By default, at stress_level 90 Stress Relief will activate, and then deactivate once it reaches 75. These values are configurable as ActivationLevel and DeactivationLevel in the Refinery configuration file.

stress_relief_activated: a gauge at 0 or 1.