Scale and Troubleshoot | Honeycomb

We use cookies or similar technologies to personalize your online experience & tailor marketing to you. Many of our product features require cookies to function properly.

Read our privacy policy I accept cookies from this site

Scale and Troubleshoot

When scaling and troubleshooting Refinery, use Refinery’s metrics to determine if adjustment is needed in your general configuration and sampling rules.

In troubleshooting, changing your debug level in pre-production may be helpful. Also, use health check API endpoints to test if your Refinery instance is bootstrapped.

Scaling and Sizing  🔗

Refinery provides a variety of configuration options that allow operators to tune the product to handle a variety of different volumes and shapes of telemetry data. In this section, we walk through tuning that configuration based on the various metrics that Refinery exports.

Refinery is a stateful service and is not optimized for dynamic auto-scaling. Changes in cluster membership can result in temporary inconsistent sampling decisions and dropped traces. As such, we recommend provisioning Refinery for your anticipated peak load.

Sizing The Cache  🔗

To size the cache, set MaxAlloc to 80% of the system’s RAM in bytes and set CacheCapacity to the MaxAlloc value divided by 1000.

Monitor process_uptime_seconds and look for restarts. If Refinery restarts due to Out Of Memory exceptions or due to the host’s Out Of Memory Killer, decrease MaxAlloc to give Refinery more head room on the system.

Monitor collect_cache_capacity. Refinery will adjust collect_cache_capacity down to fit the MaxAlloc configuration. Once Refinery reaches a steady state, update CacheCapacity to match collect_cache_capacity.

Sizing The Receive Buffers  🔗

Monitor incoming_router_dropped and peer_router_dropped, and look for values above 0. If either metric is consistently above 0, increase CacheCapacity. The receive buffers are consistently three times the size of CacheCapacity.

Sizing The Send Buffers  🔗

Monitor libhoney_peer_queue_overflow and look for values above 0. If it is consistently above 0, increase PeerBufferSize. The default PeerBufferSize is 10,000.

Monitor libhoney_upstream_queue_length and look for values to stay under the UpstreamBufferSize value. If it hits UpstreamBufferSize, then Refinery will block waiting to send upstream to the Honeycomb API. Adjust the UpstreamBufferSize as needed. The default UpstreamBufferSize is 10,000.

Scaling The CPU  🔗

Monitor CPU usage on the host(s), and target for 80% CPU usage. Spiking to 90% is acceptable but avoid spiking to 100%. If CPU utilization is too high, add more cores or more hosts as needed.

Scaling The RAM  🔗

Monitor collect_cache_buffer_overrun and look for values above 0. If it is consistently above 0, add more RAM or most hosts as needed. Note that occasional blips are acceptable (see collector metrics). If you add more RAM, do not forget to re-size the cache.

Understanding Refinery’s Metrics  🔗

Refinery emits a number of metrics to give indications about its health as well as its trace throughput and sampling statistics. These metrics can be exposed to Prometheus or sent to Honeycomb, which will need configuration within config.toml. Below is a summary of recorded metrics by type.

System Metrics  🔗

Refinery’s system metrics include memory_inuse, num_goroutines, hostname, and process_uptime_seconds. We recommend monitoring process_uptime_seconds alongside memory_inuse. If you see unexpected restarts, this could indicate that the process is hitting memory constraints.

Collector Metrics  🔗

The collector refers to Refinery’s mechanism that intercepts and collects traces in a circular buffer. Ideally, it holds onto each trace until the root span has arrived. At that point, Refinery sends the trace to the sampler to make a decision whether to keep or drop the trace. In some cases, Refinery may have to make a sampling decision on the trace before the root span arrives.

collect_cache_buffer_overrun
This value should remain zero; a positive value could indicate the need to grow the size of the collector’s circular buffer. (The size of the circular buffer is set via the configuration field CacheCapacity.) Note that if collect_cache_buffer_overrun is increasing, it doesn’t necessarily mean that the cache is full. You may see this value increasing while collect_cache_entries values remain low in comparison to collect_cache_capacity. This is due to the circular nature of the buffer, and can occur when traces stay unfinished for a long time in the face of high throughput traffic. Anytime a trace arrives that persists for longer than the time it takes to accept the same number of traces as collect_cache_capacity (aka make a full circle around the ring), a cache buffer overrun is triggered. Setting CacheCapacity therefore depends not only on trace throughput but also on trace duration (both of which are tracked via other metrics). When a cache buffer overrun is triggered, it means that a trace has been sent to Honeycomb before it has been completed. Depending on your tracing strategy, this could result in an incorrect sampling decision for the trace. For example, if all the fields have been received that you have sampling rules set up for, the decision could be correct. However, if some of those fields haven’t been received yet, the sampling decision could be incorrect.
collect_cache_capacity
Equivalent to the value set in your configuration for CacheCapacity. Use this value in conjunction with collect_cache_entries to see how full the cache is getting over time.
collect_cache_entries_*
Records avg, max, min, p50, p95, and p99. Indicates how full the cache is over time.
collector_incoming_queue_*
Records avg, max, min, p50, p95, and p99. Indicates how full the queue of spans is that were received from outside of Refinery and need to be processed by the collector.
collector_peer_queue_*
Records avg, max, min, p50, p95, and p99. Indicates how full the queue of spans is that were received from other Refinery peers and need to be processed by the collector.

Sampler Metrics  🔗

Sampler metrics will vary with the type of sampler you have configured. Generally, there will be metrics on the number of traces dropped, the number of traces kept, and the sample rate. The fields below are an example of the metrics when the dynamic sampler is configured:

dynsampler_num_dropped
The number of traces dropped by the sampler.
dynsampler_num_kept
The number of traces kept by the sampler.
dynsampler_sample_rate_*
Records avg, max, min, p50, p95, and p99 of the sample rate reported by the configured sampler.

Incoming and Peer Router Metrics  🔗

A Refinery host may receive spans both from outside Refinery and from other hosts within the Refinery cluster. In the following fields, incoming refers to the process that is listening for incoming events from outside Refinery and peer refers to the process that is listening for events redirected from a peer. upstream refers to the Honeycomb API.

incoming_router_batch, peer_router_batch
These values increment when Refinery’s batch event processing endpoint is hit.
incoming_router_event, peer_router_event
These values increment when Refinery’s single event processing endpoint is hit.
incoming_router_dropped, peer_router_dropped
These values increment when Refinery fails to add new spans to a receive buffer when processing new events. These values should be monitored closely as they indicate that spans are being dropped.
incoming_router_span, peer_router_span
These values increment when Refinery accepts events that are part of a trace, also known as spans.
incoming_router_nonspan, peer_router_nonspan
These values increment when Refinery accepts other non-span events that are not part of a trace.

The following fields can be used to get a better idea of the traffic that is flowing from incoming sources vs. from peer sources, and to track any errors from the Honeycomb API:

  • incoming_router_peer, peer_router_peer
  • incoming_router_proxied, peer_router_proxied
  • peer_enqueue_errors, upstream_enqueue_errors
  • peer_response_20x, upstream_response_20x
  • peer_response_errors, upstream_response_errors

Trace Metrics  🔗

trace_accepted
This field indicates that a new trace has been added to the collector’s cache.
trace_duration_ms_*
Records avg, max, min, p50, p95, and p99. This value can help determine the appropriate configuration for CacheCapacity. For more information, see collect_cache_buffer_overrun.
trace_send_dropped
Indicates the number of traces that were dropped by the sampler. When dry run mode is enabled, this metric will increment for each trace. In this case, you can still see the result of sampling decisions by filtering by the configured field for DryRunFieldName.
trace_send_kept
Indicates the number of traces that were kept by the sampler. When dry run mode is enabled, this metric will remain 0, reflecting that we are sending all traces to Honeycomb. In this case, you can still see the result of sampling decisions by filtering by the configured field for DryRunFieldName.
trace_send_has_root
Indicates that the trace was fully finished when it was sent. This is generally what you want to happen, since if the trace was not complete when it was sent, this could indicate an incorrect sampling decision based on your criteria.
trace_send_no_root
Indicates that traces are being sent before they are completed. This field often correlates with collect_cache_buffer_overrun. Another reason why this could happen is if a node shuts down unexpectedly and sends the traces it currently has in its cache.
trace_sent_cache_hit
Indicates that Refinery received a span belonging to a trace that had already been sent. In this case, Refinery checks the sampling decision for the trace and either sends the span along to Honeycomb immediately, or drops the span.
trace_span_count_*
Records avg, max, min, p50, p95, and p99. Use this field as an indication of how large your traces are. Note that if you are seeing a high number of trace_send_no_root, the trace_span_count_* values may be undercounting, since this indicates that traces were not fully complete before they were sent.

Troubleshooting  🔗

Debug Logs  🔗

The default logging level of warn is almost entirely silent. The debug level emits too much data to be used in production, but contains excellent information in a pre-production environment. Setting the logging level to debug during initial configuration will help understand what’s working and what’s not, but when traffic volumes increase it should be set to warn.

Restarts  🔗

Refinery does not yet buffer traces or sampling decisions to disk. When you restart the process, all in-flight traces will be flushed and sent upstream to Honeycomb, but you will lose the record of past trace decisions. When started back up, Refinery will start with a clean slate.

Health Checks  🔗

Use health check API endpoints to determine if an instance is bootstrapped. The Refinery cluster machines will respond to two different health check endpoints.

/alive  🔗

This /alive API call will return a 200 JSON response. It does not perform any checks beyond the web server’s response to requests.

{
  "source": "refinery",
  "alive": "yes"
}

/x/alive  🔗

This /x/alive API call will return a 200 JSON response that has been proxied from the Honeycomb API. This can be used to determine if the instance is able to communicate with Honeycomb.

{
  "alive": "yes"
}