Refinery provides a variety of configuration options that allow operators to tune the product to handle a variety of different volumes and shapes of telemetry data.
We recommend increasing the amount of RAM and the number of cores after your initial set-up. Use our scale documentation below, and our troubleshooting documentation to learn more.
In an ideal world with consistent, steady traffic and no traffic bursts, the proper Refinery cache configuration would be set AvailableMemory
to utilize all of the available system RAM.
Unfortunately, we do not live in an ideal world.
Instead, we provide an exploratory approach to sizing Refinery based on experimentation using your actual traffic pattern and volume.
As a rough starting point, set MaxMemoryPercentage
to 75
to use 75% of AvailableMemory
and set CacheCapacity
to the equivalent percentage of AvailableMemory
value divided by 10,000 in bytes.
For example, if the system’s RAM is 4GB and 75% of that value is approximately 3GB, then set the CacheCapacity
to 300_000
.
To tune the MaxMemoryPercentage
value, monitor process_uptime_seconds
and look for restarts.
If Refinery restarts due to Out Of Memory exceptions or due to the host’s Out Of Memory Killer, decrease MaxMemoryPercentage
to give Refinery more head room on the system.
Monitor incoming_router_dropped
and peer_router_dropped
, and look for values above 0.
If either metric is consistently above 0, increase CacheCapacity
.
The receive buffers are consistently three times the size of CacheCapacity
.
Monitor libhoney_peer_queue_overflow
and look for values above 0.
If it is consistently above 0, increase PeerBufferSize
.
The default PeerBufferSize
is 100,000.
Monitor libhoney_upstream_queue_length
and look for values to stay under the UpstreamBufferSize
value.
If it hits UpstreamBufferSize
, then Refinery will block waiting to send upstream to the Honeycomb API.
Adjust the UpstreamBufferSize
as needed.
The default UpstreamBufferSize
is 10,000.
Monitor CPU usage on the host(s), and target for 80% CPU usage. Spiking to 90% is acceptable but avoid spiking to 100%. If CPU utilization is too high, add more cores or more hosts as needed.
Monitor collect_cache_buffer_overrun
and look for values above 0.
If it is consistently above 0, add more RAM or most hosts as needed.
Note that occasional blips are acceptable (see collector metrics).
If you add more RAM, do not forget to re-size the cache.