Best Practices for Organizing Data | Honeycomb

We use cookies or similar technologies to personalize your online experience & tailor marketing to you. Many of our product features require cookies to function properly.

Read our privacy policy I accept cookies from this site

Best Practices for Organizing Data

This page gives recommendations based the current state of the Honeycomb data model.

If seeking guidelines for Honeycomb Classic datasets, go to Dataset Best Practices. Find out more about Honeycomb versus Honeycomb Classic.

Honeycomb divides your data into Environments and, within them, Datasets. An Environment represents the context for your events. A Dataset represents a collection of related events that come from the same source, or are related to the same source.

Environments Separate Data 

When you first add Honeycomb to your application, you will be asked to specify an environment. All events in an environment are related to each other. For example, you can use environments to represent stages of the release process, and put production and test into separate environments.

Environments are used to partition your data. Send events to the same environment when you might expect to analyze them as part of the same query, or to see them in the same trace. Each query can only be run within a single environment; there is no cross-environment querying.

You should separate events into different environments when you cannot establish a relationship between them. For example, it is unlikely that you would want to issue a single query across both prod and dev. Putting events into separate environments helps reinforce that the data is differentiated.

We discourage the practice of putting all the data in the same environment. Events from different environments can be easily confused because they will appear very similar to each other. Relying on consistent application of a filter is tedious and error-prone, potentially creating misleading query results if forgotten.

Consolidate Traces in an Environment 

For tracing, all the events in a trace must be within the same environment in order to render correctly in the UI. To ensure distributed tracing works across a number of services, send events from all of these services to a single environment.

Examples of Environments 

The general guideline when creating environments is to what degree you will need to query data in the same place. There ara a few common patterns for choosing when to create distinct environments.

Environments for Release Workflow 

You can use an Environment to represent different instances of an application as it moves through the release workflow, separating events that will be used in production from those in staging. You might also use a separate Environment for CI concerns, such as tracking build events or test suite automation.

Individual Development Environments 

Many Honeycomb users find it helpful send data from their local development environment. In this case, it can make sense for each team member to have an environment with their name, for example: dev-alice. A development environment can help both check that instrumentation is working correctly, and can also help understand how a new service is working.

In some cases, you might prefer to have all developers use one shared dev environment. In those situations, every developer would send their events to the same dataset. Consider tagging each event with a field that specifies the developer whose local environment sent that event. That would allow a developer to query only their events by adding the filter developer.name = name to their queries.

Environments for Regulatory Purposes 

Some Honeycomb users are subject to regulatory regimes, such as GDPR, and choose to create different environments to represent data that is affected by different regulations. Even though the events represent similar underlying activity, they may choose to send less-identifiable data to one environment than another.

Data That Does Not Seem to Fit an Environment 

You might have general data that does not specifically fit into one environment. For example, you might keep infrastructure metrics that support both test and production data. We suggest you choose one well-known environment to put these into. One way to do this is create a separate infra environment. Alternately, you can put those events into production.

Datasets Group Data Together 

Each Environment consists of a number of different Datasets. You can query within a single dataset, or across the entire Environment.

Datasets are separated into two types: Service Datasets and General Datasets.

Service Datasets 

The events in a Service Dataset represent distributed tracing spans. Each service is distinguished by its service.name field or serviceName configuration.

A single trace can cross a number of different Service Datasets. When you look at a Trace, the query engine will find all spans — across all the Services in the Environment — that share the same trace id field. This will allow you to see the entire trace from any entry point.

General Datasets 

General datasets consist of any data that does not participate in traces. They may include data from deployments, data from log sources, or metrics. We recommend that you send events into the Environment that best corresponds to those events. For example, a load balancer dataset might go into the prod Environment.

While you can query across multiple datasets in Honeycomb, you may benefit from partitioning portions of your non-trace data into multiple general datasets. We recommend this approach to setting up your non-trace data:

  • Think of some of the questions that you want to ask of your data
  • Think about what data you need to collect in order to answer that question
  • Think about the query that you need to run across your data in order to answer the question - and in particular, how to make sure that you can filter it to only contain the data that you want.

Manage the Number of Services 

Honeycomb sees Service Datasets as a way to group data: any data that is closely related to each other should go in the same Dataset. Information about a particular instance of a service should be sent as fields on the event, rather than the name of the service itself. For example, rather than sending data to two separate services, authservice-host-01 and authservice-host-02, consider instead sending the data to one service, named authservice; use an additional field, host, to contain 01 and 02. If you are able to combine this data into a single service, you will be able to more easily query across them. This will also let you use tools like BubbleUp will allow you to compare and contrast fields on those events.

Querying Across Trace and Non-trace data 

In order to easily query across trace and non-trace data - for example, correlating load metrics with trace information on API calls - it is important that you maintain consistency between your trace data and non-trace data. For example, if you are using OpenTelemetry to ingest your data, use the suggested OpenTelemetry schema for your service name, hostname, and other fields that you may decide to use.

Managing your Data 

There are some general best-practices not linked to Environments and Datasets in particular.

Namespace Custom Fields 

To help keep fields in order, it can be helpful to organize fields in the incoming events. We recommend using namespaces with dots to help bring them together. The automatic instrumentation in OpenTelemetry and Beelines follow this convention. For example:

  • tracing data is identified with trace.trace_id, trace.parent_id, and so on.
  • HTTP requests use fields like request.url and request.user-agent
  • database spans include fields such as db.query, db.query_args, and db.rows_affected

Refer to the OpenTelemetry Semantic Conventions to learn more about conventional names for fields.

Consider putting manual instrumentation under app.. Use as many layers of hierarchy as makes sense: app.shopping_cart.subtotal and app.shopping_cart.items; app.user.email and app.user.id.

In general, it is a best practice not to dynamically set a field’s name from your instrumentation code, or to generate field names on the fly. This can lead to runaway schemas, which can make the dataset difficult to navigate, and to Honeycomb throttling the creation of new columns.

It is a common error to accidentally send a timestamp as a key, rather than as a value. It is particularly dangerous to send unsanitized user input as a field name.

Ensure Schema Consistency 

In a dataset that encompasses multiple services, it can be distressingly easy to create inconsistent field names.

{
    "service.name": "web",
    "app.customer_login": "tallen",
    "http.url": "/home",
    ...
}
{
    "service.name": "s3",
    "app.user_id": "tallen",
    "s3.bucket": 5484,
    "s3.size": 518324,
    ...
}

When you look at a full trace that contains both columns, the inconsistent user name fields might be annoying. Even though this will not be a significant problem when querying within one dataset, the inconsistency can lead to potential problems.

One way to help ensure consistency is to use a shared library of constants, or shared functions that add instrumentation.

Ensure Appropriate Field Data Types 

Check that your data is a good match for the type Honeycomb thinks it is. For example, a field that looks like an integer might actually be a user ID. It would not make sense to round a user ID, which could happen with a large integer value. You should explicitly send these as string data. Alternately, you can set the field type in from the Dataset Settings Schema page.

Limits for Environments 

Team owners can create any number of Environments in Honeycomb. You can send up to 100 datasets to a single Environment. When you are approaching 90% of your limit, Honeycomb will warn the team via email. When you have hit your limit, Honeycomb will stop accepting new datasets for the environment but will continue to accept events for existing datasets.

View the current number of datasets for your Environment from the Manage Environments page. If your team needs more datasets, please contact our support team (via chat or e-mail support@honeycomb.io) to raise these limits.

Did you find what you were looking for?