Scale and Size Honeycomb Refinery

Tip
Use the Refinery Board Template to create Boards that provide an overview of your sampling operations.

Refinery offers a range of configuration options to help operators tune it for varying volumes and shapes of telemetry data.

After your initial setup, we recommend increasing RAM and CPU cores as needed. Use the guidance on this page for scaling, and consult our troubleshooting documentation for additional support.

Note
Refinery is a stateful service and is not optimized for dynamic auto-scaling. Changes in cluster membership can temporarily cause inconsistent sampling decisions or dropped traces. We recommend provisioning Refinery for your anticipated peak load.

Understanding Stress Relief 

Refinery includes a built-in mechanism called Stress Relief that activates when the system is under heavy load. Frequent or prolonged activations indicate that Refinery is under-provisioned for the current load. You can monitor this via the stress_relief_activated field in Refinery internal metrics.

Identifying Resources to Adjust 

To determine which resources need to be increased, check the activation reasons in Refinery logs:

  1. Look for log messages like StressRelief has been activated.
  2. Check the reason field in the log message to understand what triggered the activation. For example, a reason of MaxAlloc indicates a sudden memory usage spike.
  3. Use this information to determine which resources need to be increased, such as memory, CPU, or queue sizes.

Scaling Refinery 

Scaling Refinery effectively involves choosing the right balance between vertical and horizontal scaling.

Vertical vs. Horizontal Scaling 

We recommend prioritizing vertical scaling (adding resources to existing nodes) over horizontal scaling (adding more nodes) whenever possible. This approach:

  • Reduces cluster size
  • Decreases the amount of peer-to-peer communication traffic
  • Simplifies management by having fewer nodes to maintain

Focus on ensuring fewer nodes can handle your peak load effectively before considering adding additional instances.

Important
Refinery’s maximum throughput is limited by single-thread CPU performance. If Refinery is not using all allocated CPU but is still falling behind processing incoming traffic, adding more CPU to a single host will not increase throughput. In this case, increase cluster size to add parallel Refinery instances.

Managing Queues and Mapping Resources 

Queues control how spans are buffered before sampling. Proper queue configuration ensures that Refinery can handle peak load efficiently.

Configuring IncomingQueueSize 

The IncomingQueueSize value sets the maximum number of spans that a Refinery host can receive and queue for sampling. Monitor the current queue size using the collector_incoming_queue_length metric and watch for incoming_router_dropped values above 0.

Interpreting Queue Length Metrics 

Understand what queue behavior tells you about Refinery’s ability to handle incoming traffic.

  • Temporary increases: Normal during traffic spikes when Refinery temporarily cannot process incoming data at arrival rate.
  • Rising trend: Indicates Refinery is gradually falling behind the incoming load.
  • Queue at maximum: Indicates Refinery cannot handle peak load and is dropping data.

Scaling Guidance 

Use these steps to decide how to adjust queues, CPU, and cluster size for optimal performance.

  1. Memory check: If memory_inuse is within 80% of allocated memory, try increasing IncomingQueueSize to absorb load.
  2. Queue size limitation: Increasing queue size delays failure but does not increase overall throughput.
  3. CPU scaling: To increase throughput, identify whether CPU is the bottleneck and scale CPU resources accordingly.
  4. Horizontal scaling: Add instances only if vertical scaling is insufficient.

Configuring PeerQueueSize 

The PeerQueueSize value sets the maximum spans that can be received from peer Refinery hosts and queued for sampling.

Apply the same scaling strategy as IncomingQueueSize, but note that adding instances to reduce peer queue length has diminishing returns: more peers increase overall cluster communication overhead, reinforcing the preference for vertical scaling.

Configuring AvailableMemory 

The AvailableMemory value sets the maximum amount or RAM that Refinery can use for processing and queues.

Setting Initial Memory Values 

Set memory values to ensure Refinery has enough headroom for normal operation.

Example
For a 4GB system, set AvailableMemory to ~3.4GB and MaxMemoryPercentage to 75% (~2.5GB usable).

Tuning Memory for Stability 

Adjust memory allocations to prevent restarts and handle peak load safely.

Monitor process_uptime_seconds for unexpected restarts. If Refinery restarts due to Out-of-Memory exceptions or the host’s Out-of-Memory Killer, either increase the memory made available to the Refinery host or reduce MaxMemoryPercentage to provide more headroom.