Managing Metrics Event Volume | Honeycomb

Managing Metrics Event Volume

Several factors affect the exact number of events created by your metric data: number of metrics captured, capture interval, and the number of labels you apply to – or the cardinality of – your metrics. Due to the regular nature of metric captures, you can expect the event volume used for metrics to be consistently predictable over time, and controllable.

To control event volume, you can change:

Honeycomb automatically compacts some event volume based on data point attributes, including system.cpu.time.

Events Generated by Each Metric Capture 

Every metric data point is associated with a resource, representing the system that it describes, and any number of attributes, providing additional context about the meaning of that data point. Honeycomb stores these data points, and all associated metadata (the resources and attributes) in events within our columnar data store.

Honeycomb will combine data points into the same event if:

  • they were received as part of the same OTLP request
  • their timestamps are equivalent when truncated to the second (we truncate metric timestamps to the second for improved compaction)
  • they have the same set of resource attribute keys and values
  • they have the same set of data point attribute keys and values (sometimes these are also called “tags” or “labels”)

See some examples of metric-to-event mapping.

Grouping OTLP Metrics Requests 

Any system that produces OpenTelemetry metrics will send repeated OTLP metrics requests. The more metrics contained in any OTLP request to Honeycomb, the greater the opportunity Honeycomb has to combine those requests into the same set of events.

Requests can be grouped by time or size using OpenTelemetry Collector’s Batch Processor, which can be added to any preexisting OpenTelemetry Collector pipeline.

Requests can also be grouped across hosts by sending them through a single OpenTelemetry Collector processor before forwarding them to Honeycomb. (OpenTelemetry Collector can receive OTLP requests from other servers using the OTLP Receiver.)

Adjusting the Distinct Attributes in Any Individual Metrics Request 

For any metrics request, data points from distinct metrics can be combined into the same event if they share the same complete set of attributes (both keys and values) across all resources and data points. For this reason, it is generally good practice to share sets of attribute values across as many metrics as possible. For instance, if two distinct metrics are broken out by process.pid, their datapoints can share the same events. But if one metric has a process.pid attribute and the other does not, each datapoint will end up in a distinct event.

Resource attributes can be set or changed using the OpenTelemetry SDK, or by using the OpenTelemetry Collector Resource Processor. Labels can be set or changed using the OpenTelemetry SDK, or by using the OpenTelemetry Collector Metrics Transform Processor. Note that this processor lives in the “contrib” build of OpenTelemetry Collector.

Adjusting the Number of Captured Timeseries 

Metrics instrumentation can separate any individual metric (for example, http.server.active_requests) into any number of distinct timeseries that can be distinguished from one another by resource attributes (for example, host.name) or datapoint attributes (for example, http.method).

The larger the cardinality of any of these attributes, the more distinct timeseries the system will be capturing. (Cardinality is the number of distinct values that exist for any individual attribute. For example, if http.method is sometimes GET and sometimes POST, the cardinality of this attribute would be 2.)

Timeseries can sometimes accumulate exponentially. For example, if a system had 100 distinct host.name fields, 2 distinct http.method fields, and 4 distinct http.host fields, it could consist of up to 100 × 2 × 4 = 800 distinct timeseries just for the http.server.active_requests metric. (And given that all of these would have distinct sets of attributes, this means Honeycomb would create 800 events at every capture interval for this metric.)

Here is an example of what this kind of combinatoric cardinality explosion can look like:

host.name  measurements (for http.server.active_requests, measured every 60s for 10 minutes)
---------  ---------------------------------------------------------------------------------
host1      46, 20, 36, 11, 38, 25,  5, 32, 57, 14
host2      16, 48,  1, 46, 29, 15, 53, 49, 33, 40

cardinality of host.name = 2
2 timeseries, generated 20 events over 10 minutes
at minute 1, your dataset would contain the following 2 events:
  - host.name: host1, http.server.active_requests: 46
  - host.name: host2, http.server.active_requests: 16
host.name  http.method  measurements (for http.server.active_requests, measured every 60s for 10 minutes)
---------  -----------  ---------------------------------------------------------------------------------
host1      GET           9,  4, 15,  6, 26, 11,  5,  4, 19,  9
host1      POST         37, 16, 21,  5, 12, 14,  0, 28, 38,  5
host2      GET          15, 33,  1, 45, 17,  6, 19, 12, 14, 19
host2      POST          1, 15,  0,  1, 12,  9, 34, 37, 19, 21

cardinality of host.name = 2
cardinality of http.method = 2
2*2=4 timeseries, generated 40 events over 10 minutes
at minute 1, your dataset would contain the following 4 events:
  - host.name: host1, http.method: GET,  http.server.active_requests: 9
  - host.name: host1, http.method: POST, http.server.active_requests: 37
  - host.name: host2, http.method: GET,  http.server.active_requests: 15
  - host.name: host2, http.method: POST, http.server.active_requests: 1
host.name  http.method  http.host  measurements (for http.server.active_requests, measured every 60s for 10 minutes)
---------  -----------  ---------  ---------------------------------------------------------------------------------
host1      GET          public      8,  2, 14,  5, 25,  9,  3,  3, 18,  8
host1      GET          internal    1,  2,  1,  1,  1,  2,  2,  1,  1,  1
host1      POST         public     37, 16, 20,  5, 11, 13,  0, 27, 37,  4
host1      POST         internal    0,  0,  1,  0,  1,  1,  0,  1,  1,  1
host2      GET          public     14, 31,  0, 44, 14,  5, 18, 11, 13, 18
host2      GET          internal    1,  2,  1,  1,  3,  1,  1,  1,  1,  1
host2      POST         public      1, 14,  0,  1, 11,  8, 33, 37, 19, 20
host2      POST         internal    0,  1,  0,  0,  1,  1,  1,  0,  0,  1

cardinality of host.name = 2
cardinality of http.method = 2
cardinality of http.host = 2
2*2*2=8 timeseries, generated 80 events over 10 minutes
at minute 1, your dataset would contain the following 8 events:
  - host.name: host1, http.method: GET,  http.host: public,   http.server.active_requests: 8
  - host.name: host1, http.method: GET,  http.host: internal, http.server.active_requests: 1
  - host.name: host1, http.method: POST, http.host: public,   http.server.active_requests: 37
  - host.name: host1, http.method: POST, http.host: internal, http.server.active_requests: 0
  - host.name: host2, http.method: GET,  http.host: public,   http.server.active_requests: 14
  - host.name: host2, http.method: GET,  http.host: internal, http.server.active_requests: 1
  - host.name: host2, http.method: POST, http.host: public,   http.server.active_requests: 1
  - host.name: host2, http.method: POST, http.host: internal, http.server.active_requests: 0

Timeseries can be set or changed using the OpenTelemetry SDK, or by using OpenTelemetry Collector’s Filter Processor or Metrics Transform Processor. Note that the Metrics Transform Processor lives in the “contrib” build of OpenTelemetry Collector.

Modifying Capture Interval 

Every metrics stream is configured with a capture interval, which determines the frequency that individual datapoints are captured. More frequent capture intervals allow for a smaller granularity of any timeseries graph. Less frequent capture intervals generate proportionally fewer events. Capture interval can be modified directly at the point of capture. Generally this variable will be in the OpenTelemetry SDK or in an OpenTelemetry Collector receiver.

Data Point Attribute Compaction 

As noted above, metrics normally include all data point attributes as key-value pairs on the metric event. However, Honeycomb has found that certain standard attributes relating to OpenTelemetry Semantic Conventions can be combined, or compacted, even when they’re not identical because there are only a small number of individual values for these attributes. This compaction occurs automatically.

For example, the metric system.disk.io has an attribute called direction. The only two values of direction are transmit and receive, so Honeycomb distributes these two values into a single event with two fields: system.disk.io.transmit and system.disk.io.receive.

The full set of metric names and data point attributes that are distributed in this way is:

Metric Name Data Point Attribute Name
system.disk.io direction
system.filesystem.usage state
system.processes.count status
system.network.connections protocol
system.network.dropped direction
system.network.dropped_packets direction
system.network.errors direction
system.network.io direction
k8s.node.network.errors direction
k8s.node.network.io direction
k8s.pod.network.errors direction
k8s.pod.network.io direction

Compaction of system.cpu.time 

There is one more metric that is treated specially: system.cpu.time.

This metric has two key data point attributes that are compacted automatically: state and logical_number. The state attribute is distributed as above, generating values like system.cpu.time.idle. In addition, the logical_number attribute, an indication of which CPU core is used on a multi-core CPU, is dropped, and its different values are summed into the appropriate state. Thus, system.cpu.time.idle is the sum of the idle value of the state attribute over all values of logical_number.

The result of this manipulation is that up to 128 individual metrics are compacted into a single Honeycomb event.