We use cookies or similar technologies to personalize your online experience & tailor marketing to you. Many of our product features require cookies to function properly.

Read our privacy policy I accept cookies from this site

Dataset Best Practices

When you first add Honeycomb to your application, you’ll be asked to specify a dataset. Datasets are high-level buckets for your data.

Use Datasets to Group Data  🔗

Datasets are used to partition your data into queryable sets. Each query can only be run within a single dataset; there is no cross-dataset querying. Send events to a single dataset when you expect to analyze them as part of the same query.

For tracing, all the events in a trace must be within the same dataset in order to render correctly in the UI. To support this, the default tracing configurations for Honeycomb’s OpenTelemetry and Beeline instrumentation send their data to a single dataset. To ensure distributed tracing works across a number of services, send events from all of these services to a single dataset.

Separate by Environment  🔗

You should separate events into different datasets when you cannot establish a relationship between them. For example, it’s unlikely that you would want to issue a single query across both prod and dev. Changing datasets to query different environments helps reinforce that the data is differentiated.

We discourage the practice of resolving this by creating a single attribute, environment, and then to putting all the data in the same dataset. Events from different environments can be easily confused because they will appear very similar to each other. Relying on consistent application of some environment filter is tedious and error-prone, potentially creating misleading query results if forgotten.

Use Development Datasets  🔗

Many Honeycomb users find it helpful send data from their local development environment. In this case, it can make sense for each team member to have a dataset with their name, for example: dev-alice. A development dataset can help both check that instrumentation is working correctly, and can also help understand how a new service is working.

Separate by Service  🔗

If you do not plan to use distributed tracing, it may make more sense to use one dataset for each individual service. This can be done while also separating environments. For example: frontend-test, frontend-prod, api-test, api-prod.

Consistent naming conventions can help your team find a specific dataset faster.

Example of Separating Datasets  🔗

To discuss how this might work, consider a team that is setting up a system with two different sets of endpoints: users can access them from both web pages, and from a separate rest API, which goes to a separate service. Separately, the team has also instrumented a back-end database. This SQL database is not instrumented in a way that corresponds to these front ends.

This team decides to separate db from the other datasets, because they never expect to try to write a single query that looks at both data in db and the endpoints. They could then separate api and web into distinct datasets. This would allow each dataset to focus on a single topic, and a single type of data.

The team decides, however, to create a single requests dataset. They realize that they consider API requests and HTTP requests to be very similar: an API request is just a type of web request that happens to have more fields. This dataset can include the extra API requests fields even though the web requests don’t have them. Conversely, HTTP requests will have browser information that the API requests do not. With this unified requests dataset, they can do queries across the entire dataset to look at the amount of traffic users are sending to Honeycomb. They can use filters in the Query Builder with filters like where name=api to separate out web requests or API requests. They also get to benefit from trace visualizations showing web requests that call out to the API.

Manage Schema Complexity  🔗

A dataset in Honeycomb represents all of the events that have been sent into it, each of which may have had many different fields. Honeycomb gives you access to all of those fields, as columns in your dataset. If you combine lots of different types of data together in a single dataset, this may result in a sparse table, one in which many columns do not have data values. It can be easy to accidentally issue nonsensical queries that return disappointing results.

For example, imagine a dataset that stores distributed traces that touch both S3 storage and HTTP requests. It might send in S3 events that look something like this:

{
    "service_name": "s3",
    "duration_ms": 13,
    "s3.bucket": 5484,
    "s3.size": 518324,
    ...
}

As well as web requests that look like this:

{
    "service_name": "web",
    "duration_ms": 64,
    "http.url": "/home",
    "http.browser_os": "chrome",
    ...
}

This would lead to a sparse data table.

service_name duration_ms s3.bucket s3.size http.url http.browser_os
s3 13 5484 518324 - -
s3 23 8595 177484 - -
web 64 - - /home chrome
web 237 - - /login iOS

A query for COUNT WHERE s3.bucket > 200 AND http.browser_os = iOS will return no results. Many events will have S3 requests, and others will have end-user requests, while no events will have both.

Namespace Custom Fields  🔗

To help keep fields straight, it can be helpful to organize fields in the incoming events. We recommend using namespaces with dots can help bring them together. The automatic instrumentation in OpenTelemetry and Beelines follow this convention. For example:

Consider putting custom instrumentation under app.. Use as many layers of hierarchy as makes sense: app.shopping_cart.subtotal and app.shopping_cart.items; app.user.email and app.user.id.

In general, it’s a best practice not to dynamically set a field’s name from your instrumentation code, or to generate field names on the fly. This can lead to runaway schemas, which can make the dataset difficult to navigate, and to Honeycomb throttling the creation of new columns.

It is a common error to accidentally send a timestamp as a key, rather than as a value. It is particularly dangerous to send unsanitized user input as a field name.

Ensure Schema Consistency  🔗

In a dataset that encompasses multiple services, it can be distressingly easy to create inconsistent field names.

{
    "service_name": "web",
    "app.customer_login": "tallen",
    "http.url": "/home",
    ...
}
{
    "service_name": "s3",
    "app.user_id": "tallen",
    "s3.bucket": 5484,
    "s3.size": 518324,
    ...
}

In this example, it becomes important to remember which events are associated with which name. A query for WHERE s3.bucket exists AND app.customer_login = tallen will return no results; a query for WHERE s3.bucket exists AND app.user_id = tallen will be more satisfying.

Look for methods to help enforce consistency across the schema.

Ensure Appropriate Field Data Types  🔗

Check that your data is a good match for the type Honeycomb thinks it is. For example, a field that looks like an integer might actually be a user ID. It wouldn’t make sense to round a user ID, which could happen with a large integer value. You should explicitly set a field like this to string, either while instrumenting your code or from the Dataset Settings Schema page.