Best Practices for Organizing Data | Honeycomb

Best Practices for Organizing Data

We recommend that you follow certain best practices when organizing data.

Note

This page gives recommendations based the current state of the Honeycomb data model.

If seeking guidelines for Honeycomb Classic datasets, go to Dataset Best Practices. Find out more about Honeycomb versus Honeycomb Classic.

Honeycomb divides your data into Environments and, within them, Datasets. An Environment represents the context for your events. A Dataset represents a collection of related events that come from the same source, or are related to the same source.

Use Environments to group Datasets based on a theme 

Group events in the same Environment when you expect to analyze them as part of the same query or see them in the same trace. For example, if you create a separate Environment for each of your staging environments (“Production”, “Development”, and “Testing”), you can maintain focused datasets accompanied by relevant fields and values for each staging scenario.

Separate events into different Environments when you cannot establish a relationship between them and want to reinforce that the data is differentiated. For example, you likely would not want to issue a single query against both “Production” and “Development”.

We do not recommend mixing data in the same Environment. Events from different Environments appear similar to one another and can be easily confused. Relying on the consistent application of a filter is tedious, error-prone, and likely to create misleading query results when forgotten.

Consolidate Traces in an Environment 

For tracing, all the events in a trace must be within the same environment in order to render correctly in the UI. To ensure distributed tracing works across a number of services, send events from all of these services to a single environment.

Examples of Environments 

The general guideline when creating environments is to what degree you will need to query data in the same place. There ara a few common patterns for choosing when to create distinct environments.

Environments for Release Workflow 

You can use an Environment to represent different instances of an application as it moves through the release workflow, separating events that will be used in production from those in staging. You might also use a separate Environment for CI concerns, such as tracking build events or test suite automation.

Individual Development Environments 

Many Honeycomb users find it helpful send data from their local development environment. In this case, it can make sense for each team member to have an environment with their name, for example: dev-alice. A development environment can help both check that instrumentation is working correctly, and can also help understand how a new service is working.

In some cases, you might prefer to have all developers use one shared dev environment. In those situations, every developer would send their events to the same dataset. Consider tagging each event with a field that specifies the developer whose local environment sent that event. That would allow a developer to query only their events by adding the filter developer.name = name to their queries.

Environments for Regulatory Purposes 

Some Honeycomb users are subject to regulatory regimes, such as GDPR, and choose to create different environments to represent data that is affected by different regulations. Even though the events represent similar underlying activity, they may choose to send less-identifiable data to one environment than another.

Data That Does Not Seem to Fit an Environment 

You might have general data that does not specifically fit into one environment. For example, you might keep infrastructure metrics that support both test and production data. We suggest you choose one well-known environment to put these into. One way to do this is create a separate infra environment. Alternately, you can put those events into production.

Datasets Group Data Together 

Each Environment consists of a number of different Datasets. You can query within a single dataset, or across the entire Environment.

Datasets are separated into two types: Service Datasets and General Datasets.

Service Datasets 

The events in a Service Dataset represent distributed tracing spans. Each service is distinguished by its service.name field or serviceName configuration.

A single trace can cross a number of different Service Datasets. When you look at a Trace, the query engine will find all spans — across all the Services in the Environment — that share the same trace id field. This will allow you to see the entire trace from any entry point.

General Datasets 

General datasets consist of any data that does not participate in traces. They may include data from deployments, data from log sources, or metrics. We recommend that you send events into the Environment that best corresponds to those events. For example, a load balancer dataset might go into the prod Environment.

While you can query across multiple datasets in Honeycomb, you may benefit from partitioning portions of your non-trace data into multiple general datasets. We recommend this approach to setting up your non-trace data:

  • Think of some of the questions that you want to ask of your data
  • Think about what data you need to collect in order to answer that question
  • Think about the query that you need to run across your data in order to answer the question - and in particular, how to make sure that you can filter it to only contain the data that you want.

Manage the Number of Services 

Honeycomb sees Service Datasets as a way to group data: any data that is closely related to each other should go in the same Dataset. Information about a particular instance of a service should be sent as fields on the event, rather than the name of the service itself. For example, rather than sending data to two separate services, authservice-host-01 and authservice-host-02, consider instead sending the data to one service, named authservice; use an additional field, host, to contain 01 and 02. If you are able to combine this data into a single service, you will be able to more easily query across them. This will also let you use tools like BubbleUp will allow you to compare and contrast fields on those events.

Querying Across Trace and Non-trace data 

In order to easily query across trace and non-trace data - for example, correlating load metrics with trace information on API calls - it is important that you maintain consistency between your trace data and non-trace data. For example, if you are using OpenTelemetry to ingest your data, use the suggested OpenTelemetry schema for your service name, hostname, and other fields that you may decide to use.

Managing your Data 

There are some general best-practices not linked to Environments and Datasets in particular.

Namespace Custom Fields 

To help keep fields in order, it can be helpful to organize fields in the incoming events. We recommend using namespaces with dots to help bring them together. The automatic instrumentation in OpenTelemetry and Beelines follow this convention. For example:

  • tracing data is identified with trace.trace_id, trace.parent_id, and so on.
  • HTTP requests use fields like request.url and request.user-agent
  • database spans include fields such as db.query, db.query_args, and db.rows_affected

Refer to the OpenTelemetry Semantic Conventions to learn more about conventional names for fields.

Consider putting manual instrumentation under app.. Use as many layers of hierarchy as makes sense: app.shopping_cart.subtotal and app.shopping_cart.items; app.user.email and app.user.id.

Tip

In general, it is a best practice not to dynamically set a field’s name from your instrumentation code, or to generate field names on the fly. This can lead to runaway schemas, which can make the dataset difficult to navigate, and to Honeycomb throttling the creation of new columns.

It is a common error to accidentally send a timestamp as a key, rather than as a value. It is particularly dangerous to send unsanitized user input as a field name.

Ensure Schema Consistency 

In a dataset that encompasses multiple services, it can be distressingly easy to create inconsistent field names.

{
    "service.name": "web",
    "app.customer_login": "tallen",
    "http.url": "/home",
    ...
}
{
    "service.name": "s3",
    "app.user_id": "tallen",
    "s3.bucket": 5484,
    "s3.size": 518324,
    ...
}

When you look at a full trace that contains both columns, the inconsistent user name fields might be annoying. Even though this will not be a significant problem when querying within one dataset, the inconsistency can lead to potential problems.

One way to help ensure consistency is to use a shared library of constants, or shared functions that add instrumentation.

Ensure Appropriate Field Data Types 

Check that your data is a good match for the type Honeycomb thinks it is. For example, a field that looks like an integer might actually be a user ID. It would not make sense to round a user ID, which could happen with a large integer value. You should explicitly send these as string data. Alternately, you can set the field type in from the Dataset Settings Schema page.

Limits 

Limits for Environments 

Team owners can create any number of Environments in Honeycomb. You can send up to 100 datasets to a single Environment. If you are on a Honeycomb Enterprise plan, you can send up to 300 datasets to a single Environment. When you are approaching 90% of your limit, Honeycomb will warn the team via email. When you have hit your limit, Honeycomb will stop accepting new datasets for the environment but will continue to accept events for existing datasets.

View the current number of datasets for your Environment from the Manage Environments page. If your team needs more datasets, please contact our Support team via support.honeycomb.io, or email at support@honeycomb.io to raise these limits.

Limits for Events 

Each event allows a maximum of 2,000 distinct fields. Each entire event must contain less than 100 KB of uncompressed JSON data. For string fields, each string field has a maximum length of 64KB. For number fields, integers and floats are both 64-bit. If you exceed the maximum limit for any of these values, then the event is rejected and and an error is returned.

Query Assistant 

In addition to your schema, Query Assistant uses the following as context:

When modifying the current query, Query Assistant translates your prompts into additional query clauses. For example, when given a query that displays overall latency and the prompt “only show errors”, Query Assistant usually adds a WHERE clause to return spans with an error field present and set to true.

When a dataset has Suggested Queries configured, Query Assistant analyzes its fields and generates better results. We recommend configuring your own Suggested Queries for datasets that do not conform to common standards defined by OpenTelemetry instrumentation.

Query Assistant uses the fields defined in Dataset Definitions. For example, OpenTelemetry (and Honeycomb, by default) recognizes any error as a boolean value in the error field. When a Dataset Definition overrides this default with a string value in the app.error field, Query Assistant uses the app.error field instead of the error field when it evaluates prompts.