When scaling and troubleshooting Refinery, use Refinery’s metrics to determine if adjustment is needed in your general configuration and sampling rules.
In troubleshooting, changing your debug level in pre-production may be helpful. Also, use health check API endpoints to test if your Refinery instance is bootstrapped.
Refinery provides a variety of configuration options that allow operators to tune the product to handle a variety of different volumes and shapes of telemetry data. In this section, we walk through tuning that configuration based on the various metrics that Refinery exports.
In an ideal world with consistent, steady traffic and no traffic bursts, the proper Refinery cache configuration would be a MaxAlloc
of 100% of the system’s RAM in bytes and a CacheCapacity
equal to MaxAlloc
divided by the average number of bytes in a trace.
Unfortunately, we do not live in an ideal world.
Instead, we provide an exploratory approach to sizing Refinery based on experimentation using your actual traffic pattern and volume.
As a rough starting point, set MaxAlloc
to 80% of the system’s RAM in bytes and set CacheCapacity
to the MaxAlloc
value divided by 10,000.
To tune the MaxAlloc
value, monitor process_uptime_seconds
and look for restarts.
If Refinery restarts due to Out Of Memory exceptions or due to the host’s Out Of Memory Killer, decrease MaxAlloc
to give Refinery more head room on the system.
To tune the CacheCapacity
value, monitor collect_cache_capacity
.
Refinery will adjust collect_cache_capacity
down to fit the MaxAlloc
configuration.
Once Refinery reaches a steady state, update CacheCapacity
to match collect_cache_capacity
.
Monitor incoming_router_dropped
and peer_router_dropped
, and look for values above 0.
If either metric is consistently above 0, increase CacheCapacity
.
The receive buffers are consistently three times the size of CacheCapacity
.
Monitor libhoney_peer_queue_overflow
and look for values above 0.
If it is consistently above 0, increase PeerBufferSize
.
The default PeerBufferSize
is 10,000.
Monitor libhoney_upstream_queue_length
and look for values to stay under the UpstreamBufferSize
value.
If it hits UpstreamBufferSize
, then Refinery will block waiting to send upstream to the Honeycomb API.
Adjust the UpstreamBufferSize
as needed.
The default UpstreamBufferSize
is 10,000.
Monitor CPU usage on the host(s), and target for 80% CPU usage. Spiking to 90% is acceptable but avoid spiking to 100%. If CPU utilization is too high, add more cores or more hosts as needed.
Monitor collect_cache_buffer_overrun
and look for values above 0.
If it is consistently above 0, add more RAM or most hosts as needed.
Note that occasional blips are acceptable (see collector metrics).
If you add more RAM, do not forget to re-size the cache.
Refinery emits a number of metrics to give indications about its health as well as its trace throughput and sampling statistics.
These metrics can be exposed to Prometheus or sent to Honeycomb, which will need configuration within config.yaml
.
Below is a summary of recorded metrics by type.
Refinery’s system metrics include memory_inuse
, num_goroutines
, hostname
, and process_uptime_seconds
.
We recommend monitoring process_uptime_seconds
alongside memory_inuse
.
If you see unexpected restarts, this could indicate that the process is hitting memory constraints.
The collector refers to Refinery’s mechanism that intercepts and collects traces in a circular buffer. Ideally, it holds onto each trace until the root span has arrived. At that point, Refinery sends the trace to the sampler to make a decision whether to keep or drop the trace. In some cases, Refinery may have to make a sampling decision on the trace before the root span arrives.
collect_cache_buffer_overrun
CacheCapacity
.)
Note that if collect_cache_buffer_overrun
is increasing, it does not necessarily mean that the cache is full.
You may see this value increasing while collect_cache_entries
values remain low in comparison to collect_cache_capacity
.
This is due to the circular nature of the buffer, and can occur when traces stay unfinished for a long time in the face of high throughput traffic.
Anytime a trace arrives that persists for longer than the time it takes to accept the same number of traces as collect_cache_capacity
(also known as make a full circle around the ring), a cache buffer overrun is triggered.
Setting CacheCapacity
therefore depends not only on trace throughput but also on trace duration (both of which are tracked via other metrics).
When a cache buffer overrun is triggered, it means that a trace has been sent to Honeycomb before it has been completed.
Depending on your tracing strategy, this could result in an incorrect sampling decision for the trace.
For example, if all the fields have been received that you have sampling rules set up for, the decision could be correct.
However, if some of those fields have not been received yet, the sampling decision could be incorrect.collect_cache_capacity
CacheCapacity
.
Use this value in conjunction with collect_cache_entries
to see how full the cache is getting over time.collect_cache_entries_*
collector_incoming_queue_*
collector_peer_queue_*
Sampler metrics will vary with the type of sampler you have configured. Generally, there will be metrics on the number of traces dropped, the number of traces kept, and the sample rate. The fields below are an example of the metrics when the dynamic sampler is configured:
dynsampler_num_dropped
dynsampler_num_kept
dynsampler_sample_rate_*
A Refinery host may receive spans both from outside Refinery and from other hosts within the Refinery cluster.
In the following fields, incoming
refers to the process that is listening for incoming events from outside Refinery and peer
refers to the process that is listening for events redirected from a peer.
upstream
refers to the Honeycomb API.
incoming_router_batch
, peer_router_batch
incoming_router_event
, peer_router_event
incoming_router_dropped
, peer_router_dropped
incoming_router_span
, peer_router_span
incoming_router_nonspan
, peer_router_nonspan
The following fields can be used to get a better idea of the traffic that is flowing from incoming sources vs. from peer sources, and to track any errors from the Honeycomb API:
incoming_router_peer
, peer_router_peer
incoming_router_proxied
, peer_router_proxied
peer_enqueue_errors
, upstream_enqueue_errors
peer_response_20x
, upstream_response_20x
peer_response_errors
, upstream_response_errors
trace_accepted
trace_duration_ms_*
CacheCapacity
.
For more information, see collect_cache_buffer_overrun
.trace_send_dropped
DryRunFieldName
.trace_send_kept
DryRunFieldName
.trace_send_has_root
trace_send_no_root
collect_cache_buffer_overrun
.
Another reason why this could happen is if a node shuts down unexpectedly and sends the traces it currently has in its cache.trace_sent_cache_hit
trace_span_count_*
trace_send_no_root
, the trace_span_count_*
values may be undercounting, since this indicates that traces were not fully complete before they were sent.The Stress Relief system monitors these metrics to calculate the current stress level of the Refinery cluster:
collector_peer_queue_length
collector_incoming_queue_length
libhoney_peer_queue_length
libhoney_upstream_queue_length
memory_heap_allocation
The stress level is calculated and represented as the following two metrics:
stress_level
: a gauge from 0 to 100, where 0 is no stress and 100 is maximum stress.
By default, at stress_level
90 Stress Relief will activate, and then deactivate once it reaches 75.
These values are configurable as ActivationLevel
and DeactivationLevel
in the Refinery configuration file.
stress_relief_activated
: a gauge at 0 or 1.
The default logging level is warn
.
The debug
level emits too much data to be used in production, but contains excellent information in a pre-production environment.
Setting the logging level to debug
during initial configuration will help understand what is working and what is not, but when traffic volumes increase it should be set to warn
.
Refinery does not yet buffer traces or sampling decisions to disk. When you restart the process, all in-flight traces will be flushed and sent upstream to Honeycomb, but you will lose the record of past trace decisions. When started back up, Refinery will start with a clean slate.
Configuration file formats (TOML and YAML) can be confusing to read and write.
There is an option to check the loaded configuration by using one of the /query
endpoints from the command line, from a server that can access a refinery host.
The /query
endpoints are protected and can be enabled by specifying QueryAuthToken
in the configuration file or specifying REFINERY_QUERY_AUTH_TOKEN
in Refinery’s environment.
All requests to any /query
endpoint must include the header X-Honeycomb-Refinery-Query
set to the value of the specified token.
Retrieve the entire rules configuration in the desired format from Refinery:
curl --get $REFINERY_HOST/query/allrules/$FORMAT --header "x-honeycomb-refinery-query: my-local-token"
where:
$REFINERY_HOST
should be the url of your refinery.$FORMAT
can be one of json
, yaml
, or toml
.Retrieve the rule set that Refinery uses for the environment (or dataset in Classic Mode) defined in the variable $ENVIRON
:
curl --get $REFINERY_HOST/query/rules/$FORMAT/$ENVIRON --header "x-honeycomb-refinery-query: my-local-token"
where:
$REFINERY_HOST
should be the URL of your refinery.$FORMAT
can be one of json
, yaml
, or toml
.$ENVIRON
is the name of the environment (or dataset, in Classic Mode).The response contains a map of the sampler type to its rule set.
Retrieve information about the configurations currently in use, including the timestamp when the configuration was last loaded:
curl --include --get $REFINERY_HOST/query/configmetadata --header "x-honeycomb-refinery-query: my-local-token"
where:
$REFINERY_HOST
should be the URL of your refinery.The response contains a JSON blob of information about Refinery’s configurations. It will look something like this:
[
{
"type": "config",
"id": "tools/loadtest/config.yaml",
"hash": "1047bb6140b487ecdb0745f3335b6bc3",
"loaded_at": "2022-11-08T22:24:18-05:00"
},
{
"type": "rules",
"id": "tools/loadtest/rules.yaml",
"hash": "2d88389e1ff6530fba53466973e591e0",
"loaded_at": "2022-11-08T22:24:18-05:00"
}
]
For file-based configurations (the only type currently supported), the hash
value is identical to the value generated by the md5sum
command available in major operating systems.
Refinery can send telemetry that includes information that can help debug the sampling decisions that are made.
To enable it, in the configuration file, set AddRuleReasonToTrace
to true
.
This will cause traces that are sent to Honeycomb to include a field meta.refinery.reason
.
This field will contain text indicating which rule was evaluated that caused the trace to be included.
The rules comparisons in Refinery’s Rules-Based Sampler take the datatype of the fields into account.
In particular, a rule that compares status_code
to 200
(an integer) will fail if the status code is actually "200"
(a string), and vice-versa.
In a mixed environment where either datatype may be included in the telemetry, you should create a separate rule for each case.
This situation can be hard to diagnose, because Honeycomb’s backend converts all the values of a given field to the datatype specified in the dataset schema. Inspection of the data in Honeycomb will not give any indication that this has happened. If you see rules that appear to not execute when they should have, please consider this possibility of incorrect datatype.
Use health check API endpoints to determine if an instance is bootstrapped. The Refinery cluster machines respond to two different health check endpoints via HTTP:
/alive
This /alive
API call will return a 200
JSON response.
It does not perform any checks beyond the web server’s response to requests.
{
"source": "refinery",
"alive": "yes"
}
/x/alive
This /x/alive
API call will return a 200
JSON response that has been proxied from the Honeycomb API.
This can be used to determine if the instance is able to communicate with Honeycomb.
{
"alive": "yes"
}
If gRPC is configured, Refinery also responds to a standard gRPC Health Probe.