> ## Documentation Index
> Fetch the complete documentation index at: https://docs.honeycomb.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Group Logs by Pattern with the OpenTelemetry Drain Processor

> Use the OpenTelemetry Collector drain processor to derive a template attribute on every log record, then group, filter, and sample by recurring patterns in Honeycomb.

The OpenTelemetry Collector [`drainprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/drainprocessor) applies the [Drain](https://arxiv.org/pdf/1806.04356) online log parsing algorithm to each log record and attaches the derived template as an attribute.
A template captures the invariant skeleton of a log message, with variable parts replaced by `<*>` wildcards.
For example, the messages `user 1234 logged in from 10.0.0.1` and `user 5678 logged in from 10.0.0.2` both reduce to `user <*> logged in from <*>`.

Deriving templates at the collector means a stable template attribute, defaulting to `log.record.template`, lands on every log record in your Honeycomb dataset.
You can `GROUP BY` it to find dominant patterns, build SLOs or Triggers against specific templates, and feed template-based conditions into Refinery sampling rules.

<Warning>
  The drain processor is at **alpha** stability for logs and is distributed in the OpenTelemetry Collector Contrib and Kubernetes builds.
  The processor operates on log records only; it does not affect traces or metrics.
  Configuration options and behavior may change between releases.
  Pin a specific collector version, test in a non-production environment, and read the upstream release notes before upgrading.
</Warning>

## When to use the drain processor

The drain processor is the right choice when you need to discover and stabilize log patterns at the collector rather than define them manually.
Use it when:

* You have high-volume or noisy logs and want to identify the dominant message patterns without manually defining them.
* You want a stable template attribute on every event so that SLOs, Triggers, and Refinery rules can reference it directly.
* You want to make log volume reduction decisions (sampling, dropping, or routing) at the collector based on pattern rather than message content.

Reach for an alternative when:

* You only need to template logs for a single ad-hoc query. A [Calculated Field](/configure/environments/calculated-fields/) using regex or string functions is enough and doesn't require collector changes. Calculated Fields are per-query and can't be referenced by SLOs, Triggers, or Refinery.
* You already know the exact patterns and just need to tag them. The [`transform`](/send-data/opentelemetry/collector/handle-sensitive-information#transform-processor) processor with OTTL is more precise than a learned algorithm.
* You need a stable, generally-available processor and can't adopt alpha components.

## How drain works

Drain incrementally builds a fixed-depth parse tree of tokens seen across log messages.
Each leaf holds a small set of templates; an incoming message walks the tree to the matching leaf and is merged with the closest template if a sufficient fraction of tokens match.
This approach is fast, runs online without retraining, and converges to a stable set of templates after a short warmup.

The two parameters that most affect output quality are `tree_depth` and `merge_threshold`.
To learn more, refer to the [Tune the Algorithm](#tune-the-algorithm) section.

## Configuration reference

This table covers the most common fields.
For the authoritative option list at the version of the collector you run, refer to the [upstream `drainprocessor` README](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/drainprocessor).

| Field                 | Type       | Required | Default               | Purpose                                                                                                                                                                        |
| --------------------- | ---------- | -------- | --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `tree_depth`          | int        | No       | `4`                   | Depth of the parse tree. Minimum is `3`. Higher values increase precision but use more memory.                                                                                 |
| `merge_threshold`     | float      | No       | `0.4`                 | Fraction of tokens that must match an existing template (range `0.0`–`1.0`). Lower values merge more aggressively into fewer templates.                                        |
| `max_node_children`   | int        | No       | `100`                 | Maximum children per internal tree node.                                                                                                                                       |
| `max_clusters`        | int        | No       | `0`                   | Maximum number of templates retained. `0` means unlimited. When the cap is hit the least-recently-used template is evicted.                                                    |
| `extra_delimiters`    | `[]string` | No       | `[]`                  | Additional characters to treat as token delimiters, in addition to whitespace.                                                                                                 |
| `body_field`          | string     | No       | `""`                  | When the log body is a map, the key whose value should be templated. Empty means template the stringified body.                                                                |
| `template_attribute`  | string     | No       | `log.record.template` | Attribute name written to each log record.                                                                                                                                     |
| `seed_templates`      | `[]string` | No       | `[]`                  | Pre-loaded templates the processor starts with, using `<*>` for wildcards.                                                                                                     |
| `seed_logs`           | `[]string` | No       | `[]`                  | Example log lines used to seed templates at startup.                                                                                                                           |
| `warmup_min_clusters` | int        | No       | `0`                   | Minimum templates that must exist before the processor begins annotating records. Useful to avoid noisy templates during a cold start.                                         |
| `storage`             | string     | No       | in-memory             | ID of a [storage extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/storage) for persisting templates across collector restarts. |
| `save_interval`       | duration   | No       | `0s`                  | How often to snapshot template state to the storage extension. `0s` disables periodic snapshots.                                                                               |

## Basic configuration

Wire the processor into your logs pipeline.
Spans and metric pipelines should not include it.

```yaml theme={}
processors:
  drain:
    tree_depth: 4
    merge_threshold: 0.4
    max_clusters: 500

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [drain]
      exporters: [otlp/honeycomb]
```

After deployment, every log record arriving at Honeycomb has a `log.record.template` attribute set to its derived template.

To start with a curated set of templates rather than learning from scratch, seed the processor:

```yaml theme={}
processors:
  drain:
    seed_templates:
      - "user <*> logged in from <*>"
      - "request <*> completed in <*> ms"
    warmup_min_clusters: 20
```

`warmup_min_clusters` holds back annotation until the processor has learned the specified number of templates, preventing a noisy first few minutes of data.

## Tune the algorithm

The defaults work well for general application logs but produce poor results for very high-volumne logs with many distinct patterns, or very low-volume logs where the algorithm has little data to learn from.
Two parameters do most of the work:

* **`tree_depth`**: Controls how many leading tokens the tree splits on. Increase it (5–6) for logs where the first few tokens carry less signal, such as logs prefixed with a long timestamp or request ID. Decrease it (3) for short, structured logs.
* **`merge_threshold`**: Controls how readily new messages merge into existing templates. Lower values (0.2–0.3) produce fewer, broader templates. Higher values (0.5–0.7) produce more, narrower templates. If you see hundreds of near-duplicate templates, lower the threshold.

`max_clusters` is a memory ceiling, not a quality control.
Set it to roughly twice the number of stable templates you expect; LRU eviction handles the rest.
For most services this is in the low hundreds.

If a single source emits structured logs as a map, set `body_field` to the key that holds the human-readable message so the processor templates the right value.

## Use templates in Honeycomb

Once log records carry `log.record.template`, you can use it like any other attribute:

* **Find dominant patterns:** Run a query over your logs dataset with `COUNT` grouped by `log.record.template`, ordered descending. The top rows are your noisiest patterns.
* **Build SLOs and Triggers:** Reference `log.record.template = "<specific template>"` in the SLI or Trigger condition. A stable template attribute makes these conditions resilient to changes in variable values that would otherwise require regex.
* **Drive Refinery sampling:** Configure Refinery rules to drop or downsample high-volume templates while keeping a representative sample of each pattern.
* **Identify reduction candidates:** A small number of templates accounting for a large fraction of log volume usually means easy reductions are available, either by dropping at the collector, by pairing with the [log deduplication processor](#reduce-volume-by-pairing-with-the-log-deduplication-processor) downstream, or by addressing the source.

## Reduce volume by pairing with the log deduplication processor

A common and effective pattern is to put the [`logdedupprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/logdedupprocessor) immediately downstream of the drain processor.
The drain processor attaches a stable template attribute to every record; the log deduplication processor then collapses records that share the same template within a time window into a single record annotated with a count.
For high-volume, repetitive logs, this can reduce egress to Honeycomb by an order of magnitude or more while preserving the information needed to reason about pattern frequency.

By default, the log deduplication processor matches on the full body, attributes, severity, and resource, which prevents two records with the same template but different variable values from collapsing.
To deduplicate by template, restrict matching to the template attribute and any resource attributes you care about with `include_fields`.

```yaml theme={}
processors:
  drain:
    tree_depth: 4
    merge_threshold: 0.4

  log_dedup:
    interval: 60s
    log_count_attribute: log.record.count
    include_fields:
      - attributes["log.record.template"]
      - resource.attributes["service.name"]

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [drain, log_dedup]
      exporters: [otlp/honeycomb]
```

Order matters.
The drain processor must run before the log deduplication processor so the template attribute exists when deduplication evaluates its match fields.

In Honeycomb, deduplicated records carry the count attribute (renamed to `log.record.count` in the example above) and `first_observed_timestamp`/`last_observed_timestamp` attributes that the log deduplication processor adds.
Use `SUM(log.record.count)` rather than `COUNT` when summing volume across templates.

<Note>
  The log deduplication processor is also at **alpha** stability for logs.
  The same upgrade caveats apply.
</Note>

## Persist templates across restarts

By default, the processor's learned templates are in-memory and are lost on collector restart.
After a restart, the processor relearns templates and may produce different output until the algorithm converges again.

To persist templates, configure a [storage extension](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/storage) and reference it from the processor:

```yaml theme={}
extensions:
  file_storage:
    directory: /var/lib/otelcol/drain

processors:
  drain:
    storage: file_storage
    save_interval: 5m

service:
  extensions: [file_storage]
  pipelines:
    logs:
      receivers: [otlp]
      processors: [drain]
      exporters: [otlp/honeycomb]
```

`save_interval` controls how often state is snapshotted.
Shorter intervals reduce data loss on crash at the cost of more disk activity.

## Limitations

Keep these constraints in mind when planning your pipeline configuration:

* The processor operates on logs only. Span and metric pipelines should not include it.
* Drain is an online clustering algorithm; output is deterministic only when seeded with `seed_templates` or `seed_logs`. Two collectors without shared state will produce slightly different template inventories.
* Without a `storage` extension, templates do not survive collector restarts.
* Alpha stability means option names and defaults may change between collector releases. Validate after every upgrade.

## Related topics

These resources cover the tools and approaches that work alongside or instead of the drain processor:

* [Handle Sensitive Information with the OpenTelemetry Collector](/send-data/opentelemetry/collector/handle-sensitive-information): Covers the `transform` and `redaction` processors, which are the right tools when you know the patterns up front rather than learning them.
* [Calculated Fields](/configure/environments/calculated-fields/): Honeycomb-side, per-query templating when collector changes are not an option.
* [Refinery](/manage-data-volume/sample/honeycomb-refinery/): Apply template-based sampling rules using the attribute the drain processor produces.
* [Upstream `drainprocessor` README](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/drainprocessor)
* [Upstream `logdedupprocessor` README](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/logdedupprocessor)
* [Drain: An Online Log Parsing Approach with Fixed Depth Tree (paper)](https://arxiv.org/pdf/1806.04356)
