Refinery is a stateful service and is not optimized for dynamic auto-scaling.
Changes in cluster membership can temporarily cause inconsistent sampling decisions or dropped traces.
We recommend provisioning Refinery for your anticipated peak load.
Understanding Stress Relief
Refinery includes a built-in mechanism called Stress Relief that activates when the system is under heavy load. Frequent or prolonged activations indicate that Refinery is under-provisioned for the current load. You can monitor this via thestress_relief_activated field in Refinery internal metrics.
Identifying Resources to Adjust
To determine which resources need to be increased, check the activation reasons in Refinery logs:- Look for log messages like
StressRelief has been activated. - Check the
reasonfield in the log message to understand what triggered the activation. For example, a reason ofMaxAllocindicates a sudden memory usage spike. - Use this information to determine which resources need to be increased, such as memory, CPU, or queue sizes.
Scaling Refinery
Scaling Refinery effectively involves choosing the right balance between vertical and horizontal scaling.Vertical vs. Horizontal Scaling
We recommend prioritizing vertical scaling (adding resources to existing nodes) over horizontal scaling (adding more nodes) whenever possible. This approach:- Reduces cluster size
- Decreases the amount of peer-to-peer communication traffic
- Simplifies management by having fewer nodes to maintain
Refinery’s maximum throughput is limited by single-thread CPU performance.
If Refinery is not using all allocated CPU but is still falling behind processing incoming traffic, adding more CPU to a single host will not increase throughput.
In this case, increase cluster size to add parallel Refinery instances.
Managing Queues and Mapping Resources
Queues control how spans are buffered before sampling. Proper queue configuration ensures that Refinery can handle peak load efficiently.Configuring IncomingQueueSize
The IncomingQueueSize value sets the maximum number of spans that a Refinery host can receive and queue for sampling.
Monitor the current queue size using the collector_incoming_queue_length metric and watch for incoming_router_dropped values above 0.
Interpreting Queue Length Metrics
Understand what queue behavior tells you about Refinery’s ability to handle incoming traffic.- Temporary increases: Normal during traffic spikes when Refinery temporarily cannot process incoming data at arrival rate.
- Rising trend: Indicates Refinery is gradually falling behind the incoming load.
- Queue at maximum: Indicates Refinery cannot handle peak load and is dropping data.
Scaling Guidance
Use these steps to decide how to adjust queues, CPU, and cluster size for optimal performance.- Memory check: If
memory_inuseis within 80% of allocated memory, try increasingIncomingQueueSizeto absorb load. - Queue size limitation: Increasing queue size delays failure but does not increase overall throughput.
- CPU scaling: To increase throughput, identify whether CPU is the bottleneck and scale CPU resources accordingly.
- Horizontal scaling: Add instances only if vertical scaling is insufficient.
Configuring PeerQueueSize
The PeerQueueSize value sets the maximum spans that can be received from peer Refinery hosts and queued for sampling.
Apply the same scaling strategy as IncomingQueueSize, but note that adding instances to reduce peer queue length has diminishing returns: more peers increase overall cluster communication overhead, reinforcing the preference for vertical scaling.
Configuring AvailableMemory
The AvailableMemory value sets the maximum amount or RAM that Refinery can use for processing and queues.
Setting Initial Memory Values
Set memory values to ensure Refinery has enough headroom for normal operation.- Set
AvailableMemoryto roughly 85% of total system memory. - Set
MaxMemoryPercentageto75, indicating that Refinery can use up to 75% ofAvailableMemory.
For a 4GB system, set
AvailableMemory to ~3.4GB and MaxMemoryPercentage to 75% (~2.5GB usable).Tuning Memory for Stability
Adjust memory allocations to prevent restarts and handle peak load safely. Monitorprocess_uptime_seconds for unexpected restarts.
If Refinery restarts due to Out-of-Memory exceptions or the host’s Out-of-Memory Killer, either increase the memory made available to the Refinery host or reduce MaxMemoryPercentage to provide more headroom.