Refinery Configuration | Honeycomb

Refinery Configuration

Update the fields in config.toml to customize your configuration. The default configuration at installation contains the minimum configuration needed to run Refinery.

Supported sampling methods and their configuration are set in rules.toml.

When running Refinery within Docker, be sure to mount the directory containing configuration and rules files. This is because the configuration component, Viper, monitors the directory containing the files, not the files themselves.

Default Configuration 

The default Refinery configuration uses a hardcoded peer list for file-based peer management. It uses the DeterministicSampler Sampling Method and a SampleRate of 1, meaning that no traffic will be dropped.

To see the full set of default fields, see GitHub for the full configuration file.

See GitHub for an example configuration file.

General Configuration 

Use the fields below to customize your configuration file.

ListenAddr
The IP and port on which to listen for incoming events. Incoming traffic is expected to be HTTP. If using SSL, put something like nginx in front to do the decryption. Should be in 0.0.0.0:8080 form. HTTP endpoints support both Honeycomb JSON and OpenTelemetry OTLP binary formatted data.
GRPCListenAddr
OPTIONAL The IP and port on which to listen for incoming events over gRPC. A gRPC server will only be started if a non-empty value is provided. Incoming traffic is expected to be unencrypted. If using SSL, put something like nginx in front to do the decryption. Should be in 0.0.0.0:9090 form. Refinery can be configured to receive OpenTelemetry OTLP traffic over gRPC with GRPCListenAddr. If the environment variable REFINERY_GRPC_LISTEN_ADDRESS is set, REFINERY_GRPC_LISTEN_ADDRESS takes precedence and this GRPCListenAddr value is ignored.
PeerListenAddr
The IP and port on which to listen for traffic being rerouted from a peer. Peer traffic is expected to be HTTP. If using SSL, put something like nginx in front to do the decryption. Must be different from the ListenAddr setting. Should be in 0.0.0.0:8081 form.
CompressPeerCommunication
Determines whether Refinery will compress span data it forwards to peers. If it costs money to transmit data between Refinery instances, such as when instances are spread across AWS availability zones, then you almost certainly want compression enabled to reduce your bill. The option to disable compression is provided as an escape hatch for deployments that value lower CPU utilization over data transfer costs.
APIKeys
A list of Honeycomb API keys that the proxy will accept. This list only applies to events as other Honeycomb API actions will fall through to the upstream API directly. Adding keys here causes events arriving with API keys not in this list to be rejected with an HTTP 401 error. If an API key that is a literal ‘*’ is in the list, all API keys are accepted.
HoneycombAPI
The URL for the upstream Honeycomb API.
SendDelay
A short timer that will be triggered when a trace is complete. Refinery will wait this duration before actually sending the trace. The reason for this short delay is to allow for small network delays or clock jitters to elapse and any final spans to arrive before sending the trace. This supports duration strings with supplied units. Set to 0 for immediate sends.
BatchTimeout
BatchTimeout defines the frequency to send unfulfilled batches. Default is the value of DefaultBatchTimeout in libhoney [100ms]. Eligible for live reload.
TraceTimeout
A long timer that represents the outside boundary of how long to wait before sending an incomplete trace. Normally traces are sent when the root span arrives. Sometimes the root span never arrives (due to crashes, for example), and this timer will send a trace even without having received the root span. If you have particularly long-lived traces, you should increase this timer. This supports duration strings with supplied units. It should be set higher (maybe double?) the longest expected trace. If all of your traces complete in under 10 seconds, 30 is a good value here. If you have traces that can last minutes, it should be raised accordingly. Note that the trace does not have to complete before this timer expires, but the sampling decision will be made at that time. So any spans that contain fields that you want to use to compute the sample rate should arrive before this timer expires. Additional spans that arrive after the timer has expired will be sent or dropped according to the sampling decision made when the timer expired.
MaxBatchSize
The number of events to be included in the batch to send. The default value is 500.
SendTicker
A short timer that determines the duration to use to check for traces to send.
LoggingLevel
The level above which we should log. Debug is very verbose, and should only be used in pre-production environments. “error” is the recommended level. Valid options are “debug”, “info”, “error”, and “panic”.
UpstreamBufferSize and PeerBufferSize
Control how large of an event queue to use when buffering events that will be forwarded to peers or the upstream API.
DebugServiceAddr
Sets the IP and port the debug service will run on. The debug service will only run if the command line flag -d is specified. The debug service runs on the first open port between localhost:6060 and localhost:6069 by default.
AddHostMetadataToTrace
Determines whether or not to add information about the host that Refinery is running on to the spans that it processes. If enabled, host metadata will be added to each span using field names prefixed with meta.Refinery. (For example, meta.Refinery.local_hostname)
EnvironmentCacheTTL
The amount of time a cache entry will live that associates an API key with an environment name. Cache misses look up the environment name using the HoneycombAPI config value. Default is 1 hour (1h). Not eligible for live reload.

The EnvironmentCacheTTL configuration option is not valid for Honeycomb Classic.

AddRuleReasonToTrace
Causes traces that are sent to Honeycomb to include the field meta.Refinery.reason. This field contains text indicating which rule was evaluated that caused the trace to be included. Spans arriving after the trace’s sampling decision has already been made will have their meta.Refinery.reason set to late before sending to Honeycomb. Default is false. Eligible for live reload.
AdditionalErrorFields
A list of span fields to include in the logging errors that happen during ingestion of events (for example, the span too large error). Used to track down misbehaving senders in a large installation. The fields dataset, apihost, and environment are always included. Fields not present in the span do not appear in error log. Default is [“trace.span_id”]. Eligible for live reload.
AddSpanCountToRoot
Adds a new metadata field, meta.span_count, to root spans, which indicates the number of child spans on the trace at the time that the sampling decision was made. This value is available to the rules-based sampler, making it possible to write rules that are dependent upon the number of spans in the trace. Default is false. Eligible for live reload.
CacheOverrunStrategy
Controls the cache management behavior under memory pressure. Setting CacheOverrunStrategy to resize means that when a cache overrun occurs, the cache shrinks and never grows again, which is generally not helpful unless occurring because of a permanent change in traffic patterns. Setting CacheOverrunStrategy to impact means that the items having the most impact on the cache size are ejected from the cache earlier than normal, but the cache is not resized. In both cases, CacheOverrunStrategy only applies if MaxAlloc is nonzero. Default is resize for backwards compatibility but impact is recommended for most installations. Eligible for live reload.

Field Name Configuration 

The names that Refinery uses for trace ID and parent span ID are configurable. This can be helpful if you are using a tracing system with a non-standard naming scheme for these fields.

By default, Refinery recognizes the following field names incoming data:

  • trace.trace_id
  • trace.parent_id
  • traceId
  • parentId

You can add additional field names to the list by adding them to the TraceIDFieldNames and ParentIDFieldNames lists in the configuration file:

# Custom field names for trace ID
TraceIdFieldNames = [
  "trace.my_trace_id",
  "trace_id"
]
# Custom field names for parent span ID
ParentIdFieldNames = [
  "trace.my_parent_id",
  "parent_id"
]

Sample Cache 

Sample Cache Configuration controls the sample cache used to retain information about trace status after the sampling decision has been made.

Sample Cache Types 

legacy: “legacy” is a strategy where both keep and drop decisions are stored in a circular buffer that is 5x the size of the trace cache. This is Refinery’s original sample cache strategy. It is the default. Not eligible for live reload (you cannot change the type of cache with reload).

cuckoo: “cuckoo” is a strategy where dropped traces are preserved in a “Cuckoo Filter”, which can remember a much larger number of dropped traces, leaving capacity to retain a much larger number of kept traces. It is also more configurable (see below). The cuckoo filter is recommended for most installations. Not eligible for live reload as you cannot change the type of cache with reload.

Sample Cache Configurations 

KeptSize: Controls the number of traces preserved in the kept traces cache. Refinery keeps a record of each trace that was kept and sent to Honeycomb, along with some statistical information. This is most useful in cases where the trace was sent before sending the root span, so that the root span can be decorated with accurate metadata. Default is 10_000 traces (each trace in this cache consumes roughly 200 bytes). Does not apply to the “legacy” type of cache. Eligible for live reload.

DroppedSize: Controls the size of the cuckoo dropped traces cache. This cache consumes 4-6 bytes per trace at a scale of millions of traces. Changing its size with live reload sets a future limit, but does not have an immediate effect Default is 1_000_000 traces. Does not apply to the “legacy” type of cache. Eligible for live reload.

Stress Relief 

Controls the parameters of the stress relief system. There is a metric called stress_level that is emitted as part of Refinery metrics. It is a measure of Refinery’s throughput rate relative to its processing rate, combined with the amount of room in its internal queues, and ranges from 0 to 100. It is generally expected to be low except under heavy load. When stress levels reach 100, there is an increased chance that Refinery will become unstable.

To avoid this problem, the Stress Relief system can do deterministic sampling on new trace traffic based solely on TraceID, without having to store traces in the cache or take the time processing sampling rules. Existing traces in flight will be processed normally, but when Stress Relief is active, trace decisions are made deterministically on a per-span basis; all spans will be sampled according to the SamplingRate specified here.

Once Stress Relief activates (by exceeding the ActivationLevel), it will not deactivate until stress_level falls below the DeactivationLevel. When it deactivates, normal trace decisions are made – and any additional spans that arrive for traces that were active during Stress Relief will respect those decisions.

The measurement of stress is a lagging indicator and is highly dependent on Refinery configuration and scaling. Other configuration values should be well tuned first, before adjusting the Stress Relief Activation parameters.

Stress Relief Configuration 

Mode: a string indicating how to use Stress Relief. "never" means that Stress Relief will never activate. "monitor" is the recommended setting, and means that Stress Relief will monitor the status of Refinery and activate according to the levels set below. "always" means that Stress Relief is always on, which may be useful in an emergency situation. Default is "never". Eligible for live reload.

ActivationLevel: The stress_level (from 0-100) at which Stress Relief is triggered. Default value is 75. Eligible for live reload.

DeactivationLevel: The stress_level (from 0-100) at which Stress Relief is turned off (subject to MinimumActivationDuration). Under normal circumstances, it should be well below ActivationLevel to avoid oscillations. Default value is 25. Eligible for live reload.

StressSamplingRate: The sampling rate to use when Stress Relief is activated. All new traces will be deterministically sampled at this rate based only on the traceID. Default value is 100. Eligible for live reload.

MinimumActivationDuration: The minimum time that stress relief will stay enabled, once activated. This prevents oscillations. Default value is 10s. Eligible for live reload.

MinimumStartupDuration: Used when switching into Monitor mode. When stress monitoring is enabled, it will start up in stressed mode for at least this amount of time to try to make sure that Refinery can handle the load before it begins processing it in earnest. This is to help address the problem of trying to bring a new node into an already-overloaded cluster. If this duration is 0, Refinery will not start in stressed mode. This can provide faster startup at the possible cost of startup instability. Default value is “3s”.

Peer Management 

For proper data distribution, each Refinery process needs to know how to identify and communicate with its peers, the other Refinery processes participating in the cluster. The list of peer identifiers can be referenced dynamically through redis (redis-based peer management, recommended) or set explicitly in a hard-coded list in the config file (file-based peer management).

All of the peer management options are set within the [Peer Management] section of the Refinery config file.

General Peer Management Configuration 

Strategy
Strategy controls the way that traces are assigned to Refinery nodes. When the number of nodes Refinery uses changes, Refinery may need to distribute traces differently. This can impact Refinery’s throughout when sampling. It is highly recommended to set to "hash". With the "hash" strategy, only 1/N traces (where N is the number of nodes) get redistributed. The “legacy” strategy, which is the default, uses a simple algorithm that makes 1/2 of the in-flight traces to be assigned to a different node whenever the number of nodes changes. The legacy strategy is deprecated and is intended to be removed in a future release. Not eligible for live reload.

Redis-Based Peer Management 

Configuring Refinery for peer management with Redis requires more configuration information than the default file-based peer management, but is recommended so that as a Refinery cluster scales up with new instances, existing instances learn of their new peers without further intervention.

Refinery needs to know the Redis hostname and port, which can be specified in one of two ways:

  1. set the REFINERY_REDIS_HOST environment variable or
  2. set the RedisHost field in the config file

Similarly, a password for Redis can be specified:

  1. set the REFINERY_REDIS_PASSWORD environment variable or
  2. set the RedisPassword field in the config file

To customize Redis-based Peer Management for your environment, the following fields can be set under the [Peer Management] section of config.toml:

Type
Set to redis to use redis for managing the peer registry.
RedisHost
Used to connect to redis for peer cluster membership management. If the environment variable REFINERY_REDIS_HOST is set, REFINERY_REDIS_HOST takes precedence and this RedisHost value is ignored. Not eligible for live reload. The redis host should be a hostname and a port. For example: redis.mydomain.com:6379. The example config file has localhost:6379, which will not work with more than one host.
RedisUsername
RedisUsername is the username used to connect to redis for peer cluster membership management. If the environment variable REFINERY_REDIS_USERNAME is set, REFINERY_REDIS_USERNAME takes precedence and this RedisUsername value is ignored. Not eligible for live reload.
RedisPassword
The password used to connect to redis for peer cluster membership management. If the environment variable REFINERY_REDIS_PASSWORD is set, REFINERY_REDIS_PASSWORD takes precedence and this RedisPassword value is ignored. Not eligible for live reload.
UseTLS
Enables TLS when connecting to redis for peer cluster membership management, and sets the MinVersion to 1.2. Not eligible for live reload.
IdentifierInterfaceName
OPTIONAL The name of the network interface to bind to for peer communications. By default, a Refinery instance will register itself in redis using its local hostname as its identifier for peer communications. In environments where domain name resolution is slow or unreliable, override the reliance on name lookups by specifying the name of the peering network interface in this IdentifierInterfaceName field. Refinery will use the first available unicast address on the given interface as its peering identifier to register in redis. The unicast address will be IPv4 by default or IPv6 if UseIPV6Identifier is set to true.
UseIPV6Identifier
OPTIONAL Set to true if the peering network is IPv6 and IdentifierInterfaceName is set. Refinery will use the first IPv6 unicast address found instead of IPv4.
RedisIdentifier
OPTIONAL Explicitly set the peering identifier with which a Refinery instance will register itself in redis. Overrides any automatic use of local hostname and any unicast addresses determined through the use of IdentifierInterfaceName.
Timeout
OPTIONAL Refinery will time out when communicating with Redis after a configured time in seconds. Default is “5s”.

File-Based Peer Management 

File-based peer management is the default behavior. This peer management option is not recommended if you expect to increase your Refinery instances due to the intensive process required to update configuration files.

To use file-based Peer Management, configure the following fields in the [Peer Management] section of config.toml:

Type
Set to file to use the Refinery configuration file to list Refinery peers.
Peers
The list of all servers participating in this proxy cluster. Events will be sharded evenly across all peers based on the Trace ID. Values here should be the base URL used to access the peer, and should include scheme, hostname (or ip address), and port. All servers in the cluster should be in this list, including this host.

Environment Variables 

Refinery supports the following environment variables. Environment variables take precedence over file configuration.

Environment Variable Configuration Field
REFINERY_GRPC_LISTEN_ADDRESS GRPCListenAddr
REFINERY_REDIS_HOST PeerManagement.RedisHost
REFINERY_REDIS_USERNAME PeerManagement.RedisUsername
REFINERY_REDIS_PASSWORD PeerManagement.RedisPassword
REFINERY_HONEYCOMB_API_KEY HoneycombLogger.LoggerAPIKey
REFINERY_HONEYCOMB_METRICS_API_KEY REFINERY_HONEYCOMB_API_KEY HoneycombMetrics.MetricsAPIKey
REFINERY_QUERY_AUTH_TOKEN QueryAuthToken

REFINERY_HONEYCOMB_METRICS_API_KEY takes precedence over REFINERY_HONEYCOMB_API_KEY for the HoneycombMetrics.MetricsAPIKey configuration.

Implementation Choices 

There are a few components of Refinery with multiple implementations; the config file lets you choose your desired implementation. For example, there are two logging implementations: one that uses logrus and sends logs to STDOUT, and a honeycomb implementation that sends the log messages to a Honeycomb dataset instead.

Components with multiple implementations have one top level config item that lets you choose which implementation to use and then a section further down with additional config options for that choice. For example, the Honeycomb logger requires an API key.

Changing implementation choices requires a process restart; these changes will not be picked up by a live configuration reload. (Individual configuration options for a given implementation may be eligible for live reload).

Collector 

Collector describes which collector to use for collecting traces. The only current valid option is InMemCollector. More can be added by adding implementations of the Collector interface. Use the fields below to modify your Collector settings.

CacheCapacity
The cache is used to collect all spans into a trace. Its capacity is configured via the CacheCapacity value. For guidance on how to best configure the CacheCapacity value, please refer to the Scale and Troubleshoot documentation. In addition, a cache remembers the sampling decision for any spans that might come in after the trace has been marked “complete” (either by timing out or seeing the root span); that capacity will be 5x this value. This setting is eligible for live reload; growing the cache capacity with a live config reload is fine. Avoid shrinking it with a live reload (you can, but it may cause temporary odd sampling decisions). If the cache capacity is too low, the collect_cache_buffer_overrun metric will increment. If this indicator occurs, you should increase the CacheCapacity value.
MaxAlloc
An optional field that if set, must be an integer >= 0. 64-bit values are supported. If set to a non-zero value, once per tick (see SendTicker) the collector will compare total allocated bytes to this value. If allocation is too high, cache capacity will be reduced and an error will be logged. Useful values for this setting are generally in the range of 75%-90% of available system memory.

Logger 

Logger describes which logger to use for Refinery logs. Valid options are logrus and honeycomb. Set where log events go in this section. Use honeycomb to send logs to the Honeycomb API. Use logrus to send logs to STDOUT.

Honeycomb Logger 

LoggerHoneycombAPI
The URL for the upstream Honeycomb API. Eligible for live reload.
LoggerAPIKey
The API key to use to send log events to the Honeycomb logging dataset. This is separate from the APIKeys used to authenticate regular traffic. Eligible for live reload.
LoggerDataset
The name of the dataset to which to send Refinery logs. Eligible for live reload.
LoggerSamplerEnabled
Enables a dynamic sampler for log messages. This will sample log messages based on [log level:message] key on a per second throughput basis. Not eligible for live reload.
LoggerSamplerThroughput
The per second throughput for each unique log message, when the logger sampler is enabled. Not eligible for live reload.

Logrus Logger 

There are no configurable options for the logrus logger yet.

Metrics 

Metrics describes which service to use for Refinery metrics. Valid options are prometheus and honeycomb. The prometheus option starts a listener that will reply to a request for /metrics. The honeycomb option will send summary metrics to the Honeycomb dataset you specify.

Honeycomb Metrics 

Refinery emits metrics as a Honeycomb event containing all values at each reporting interval. This configuration does not send OTLP Metrics to Honeycomb.

MetricsHoneycombAPI
The URL for the upstream Honeycomb API. Eligible for live reload.
MetricsAPIKey
The API key used to send metrics events to the Honeycomb metrics dataset. This API key is separate from the APIKeys used to authenticate regular traffic. Eligible for live reload.
MetricsDataset
The name of the dataset to which to send Refinery metrics events. Eligible for live reload.
MetricsReportingInterval
The frequency (in seconds) to send metrics events to Honeycomb. Between 1 and 60 is recommended. The default is 1s, but 10s is reasonable. Not eligible for live reload.

Prometheus Metrics 

MetricsListenAddr
Determines the interface and port on which Prometheus will listen for requests for /metrics. Must be different from the main Refinery listener. Not eligible for live reload.

Here ends the list of the general Refinery configuration options. Remember to customize your Sampling Methods configuration to complete your Refinery set-up.

Did you find what you were looking for?