High Cardinality | Honeycomb

High Cardinality

One thing that sets Honeycomb apart is its ability to query on high cardinality and high dimensionality data. What do “high cardinality” and “high dimensionality” mean, and why are they important for observability?

Remember that an event is a collection of information about what it took to do a unit of work. Honeycomb receives each event as a set of key-value pairs, and each of which is an attribute of the event. A collection of events is a dataset. Each attribute of an event is a column, or equivalently, a dimension of the dataset.

The term “high cardinality” means that there can be many possible values for a single attribute; the term “high dimensionality” means that there can be many different attributes attached to events. For many database architectures, having high-cardinality and high-dimensionality data makes it prohibitively expensive to store and query. Honeycomb’s design, however, is powerful enough to allow users to query freely. A user may group or filter on any attribute, no matter how high its cardinality.

Let us take a closer look at these.

What is Dimensionality? 

The dimensionality of a dataset is the number of different attributes that it has. A high-dimensionality dataset, then, has many different attributes. In Honeycomb, it is not unusual to have a dataset with many hundreds of dimensions, exploring every possible facet of data. A single event does not need to have a defined value for every possible attribute; an event that describes database operations might not have attributes about HTTP requests, for example.

Honeycomb datasets can have very high dimensionality. Any single event has a generous individual size limit; a dataset can have thousands of columns. (Experience has shown that when a dataset approaches that very high number, it is often caused by a programming error.)

What is Cardinality? 

The cardinality of a data attribute refers to the number of distinct values that it can have. A boolean column, which only can have the values of true or false, has a cardinality of 2. HTTP status codes – 200, 301, 302, 404, 500 – might have a cardinality under a few dozen. These low cardinality fields are useful to track broad trends in your system: separating your service out by AWS Zone, or by current build of your code, or by endpoint.

High cardinality refers to a column that can have many possible values. For an online shopping system, fields like userId, shoppingCartId, and orderId are often high-cardinality columns that can take take hundreds of thousands of distinct values. Similarly, requestId might be in the millions. request.URL can be a high-cardinality column if you have many different combinations of GET query parameters.

A high-cardinality field can help uniquely identify a request: they let you specifically narrow down precisely what caused something to go wrong.

High Cardinality and High Dimensionality are Critical for Observability 

The ability to rapidly look at high cardinality fields is a key aspect of observability. Consider, for example, the “Debug your first issue with Honeycomb” Sandbox tour. The issue can be traced to a single user’s actions in the dataset. That one unlucky user managed to find a particular API endpoint that responded extremely slowly. Being able to identify the specific endpoint and user made it easy to see both how badly the endpoint had behaved, and who it had affected.

Honeycomb allows you to query on high-cardinality fields. If a user reports an error, it is possible to look only at events generated to service that user’s requests, and to examine what is going wrong. It is even possible to query on dimensions like duration in order to find only fast-running events.

Honeycomb BubbleUp is Honeycomb’s tool to identify attributes that stand out from others. It looks at all dimensions at once, finding which fields stand out.

BubbleUp can help identify that a particular latency occurred to a single user, even if cardinality is in the millions.

It can also be helpful to store fields with lower cardinality – in the hundreds – like error messages in Honeycomb. You can even store fields like “error message” in Honeycomb; that will help if you might later need to query to find out how many error message contain the text “cannot connect”.

To learn more about the role that high cardinality plays in observability, visit our blog post: Understanding High Cardinality and Its Role in Observability.

The Curse of Dimensionality 

Why are high-dimensionality and high-cardinality a concern?

Some metrics analytics systems are built around the idea of tags, or attributes. A data value can be associated with one or more tags. For example, a tag might represent version: 21.3 and another tag might correspond to action: save_shopping_cart. They store a time series for every tag, and every combination of tags. This allows them to rapidly query on any of these time series.

The cost model for metrics tools is often based on the number of distinct tag combinations. As a user, this means that adding even a single extra value on an attribute can be very expensive: it might double the number of tags used. Adding a high cardinality value, like user-id, causes tag costs to explode.

In statistics, this is referred to as the “curse of dimensionality” – the fact that many dimensions can be exponentially more expensive to store.

Honeycomb’s internal storage engine is designed to store each event and its data independently. For non-event data, Honeycomb Metrics stores a number of events corresponding in size to the number of data points sent. What this means is that you can send Honeycomb events that have rich context, complex attributes, and contain data that you do not have to think about managing. They are stored inexpensively in a simple format.

The query engine then aggregates that data as you analyze it, which makes working with high cardinality and high dimensionality data easy, inexpensive, and fast.