Refinery offers a range of configuration options to help operators tune it for varying volumes and shapes of telemetry data.
After your initial setup, we recommend increasing RAM and CPU cores as needed. Use the guidance on this page for scaling, and consult our troubleshooting documentation for additional support.
Refinery includes a built-in mechanism called Stress Relief that activates when the system is under heavy load.
Frequent or prolonged activations indicate that Refinery is under-provisioned for the current load.
You can monitor this via the stress_relief_activated
field in Refinery internal metrics.
To determine which resources need to be increased, check the activation reasons in Refinery logs:
StressRelief has been activated
.reason
field in the log message to understand what triggered the activation.
For example, a reason of MaxAlloc
indicates a sudden memory usage spike.Scaling Refinery effectively involves choosing the right balance between vertical and horizontal scaling.
We recommend prioritizing vertical scaling (adding resources to existing nodes) over horizontal scaling (adding more nodes) whenever possible. This approach:
Focus on ensuring fewer nodes can handle your peak load effectively before considering adding additional instances.
Queues control how spans are buffered before sampling. Proper queue configuration ensures that Refinery can handle peak load efficiently.
IncomingQueueSize
The IncomingQueueSize
value sets the maximum number of spans that a Refinery host can receive and queue for sampling.
Monitor the current queue size using the collector_incoming_queue_length
metric and watch for incoming_router_dropped
values above 0.
Understand what queue behavior tells you about Refinery’s ability to handle incoming traffic.
Use these steps to decide how to adjust queues, CPU, and cluster size for optimal performance.
memory_inuse
is within 80% of allocated memory, try increasing IncomingQueueSize
to absorb load.PeerQueueSize
The PeerQueueSize
value sets the maximum spans that can be received from peer Refinery hosts and queued for sampling.
Apply the same scaling strategy as IncomingQueueSize
, but note that adding instances to reduce peer queue length has diminishing returns: more peers increase overall cluster communication overhead, reinforcing the preference for vertical scaling.
AvailableMemory
The AvailableMemory
value sets the maximum amount or RAM that Refinery can use for processing and queues.
Set memory values to ensure Refinery has enough headroom for normal operation.
AvailableMemory
to roughly 85% of total system memory.MaxMemoryPercentage
to 75
, indicating that Refinery can use up to 75% of AvailableMemory
.AvailableMemory
to ~3.4GB and MaxMemoryPercentage
to 75% (~2.5GB usable).Adjust memory allocations to prevent restarts and handle peak load safely.
Monitor process_uptime_seconds
for unexpected restarts.
If Refinery restarts due to Out-of-Memory exceptions or the host’s Out-of-Memory Killer, either increase the memory made available to the Refinery host or reduce MaxMemoryPercentage
to provide more headroom.