This page gives recommendations based on the Honeycomb Classic data model.If seeking guidelines based on the current Honeycomb data model, find out about best practices for data in Honeycomb.
Alternatively, learn more about Honeycomb versus Honeycomb Classic.
Use Datasets to Group Data
Datasets are used to partition your data into queryable sets. Each query can only be run within a single dataset; there is no cross-dataset querying. Send events to a single dataset when you expect to analyze them as part of the same query. For tracing, all the events in a trace must be within the same dataset in order to render correctly in the UI. To support this, the default tracing configurations for Honeycomb’s OpenTelemetry and Beeline instrumentation send their data to a single dataset. To ensure distributed tracing works across a number of services, send events from all of these services to a single dataset.Separate by Environment
You should separate events into different datasets when you cannot establish a relationship between them. For example, it is unlikely that you would want to issue a single query across bothprod and dev.
Changing datasets to query different environments helps reinforce that the data is differentiated.
We discourage the practice of resolving this by creating a single attribute, environment, and then to putting all the data in the same dataset.
Events from different environments can be easily confused because they will appear very similar to each other.
Relying on consistent application of some environment filter is tedious and error-prone, potentially creating misleading query results if forgotten.
Use Development Datasets
Many Honeycomb users find it helpful send data from their local development environment. In this case, it can make sense for each team member to have a dataset with their name, for example:dev-alice.
A development dataset can help both check that instrumentation is working correctly, and can also help understand how a new service is working.
Separate by Service
If you do not plan to use distributed tracing, it may make more sense to use one dataset for each individual service. This can be done while also separating environments. For example:frontend-test, frontend-prod, api-test, api-prod.
Consistent naming conventions can help your team find a specific dataset faster.
Example of Separating Datasets
To discuss how this might work, consider a team that is setting up a system with two different sets of endpoints: users can access them from both web pages, and from a separate rest API, which goes to a separate service. Separately, the team has also instrumented a back-end database. This SQL database is not instrumented in a way that corresponds to these front ends. This team decides to separatedb from the other datasets, because they never expect to try to write a single query that looks at both data in db and the endpoints.
They could then separate api and web into distinct datasets.
This would allow each dataset to focus on a single topic, and a single type of data.
The team decides, however, to create a single requests dataset.
They realize that they consider API requests and HTTP requests to be very similar: an API request is just a type of web request that happens to have more fields.
This dataset can include the extra API requests fields even though the web requests do not have them.
Conversely, HTTP requests will have browser information that the API requests do not.
With this unified requests dataset, they can do queries across the entire dataset to look at the amount of traffic users are sending to Honeycomb.
They can use filters in the Query Builder with filters like where name=api to separate out web requests or API requests.
They also get to benefit from trace visualizations showing web requests that call out to the API.
Manage Schema Complexity
A dataset in Honeycomb represents all of the events that have been sent into it, each of which may have had many different fields. Honeycomb gives you access to all of those fields, as columns in your dataset. If you combine lots of different types of data together in a single dataset, this may result in a sparse table, one in which many columns do not have data values. It can be easy to accidentally issue nonsensical queries that return disappointing results. For example, imagine a dataset that stores distributed traces that touch both S3 storage and HTTP requests. It might send in S3 events that look something like this:| service.name | duration_ms | s3.bucket | s3.size | http.url | http.browser_os | … |
|---|---|---|---|---|---|---|
| s3 | 13 | 5484 | 518324 | - | - | |
| s3 | 23 | 8595 | 177484 | - | - | |
| web | 64 | - | - | /home | chrome | |
| web | 237 | - | - | /login | iOS |
COUNT WHERE s3.bucket > 200 AND http.browser_os = iOS will return no results.
Many events will have S3 requests, and others will have end-user requests, while no events will have both.
Namespace Custom Fields
To help keep fields straight, it can be helpful to organize fields in the incoming events. We recommend using namespaces with dots can help bring them together. The automatic instrumentation in OpenTelemetry and Beelines follow this convention. For example:- tracing data is identified with
trace.trace_id,trace.parent_id, and so on. - HTTP requests use fields like
request.urlandrequest.user-agent - database spans include fields such as
db.query,db.query_args, anddb.rows_affected
app..
Use as many layers of hierarchy as makes sense: app.shopping_cart.subtotal and app.shopping_cart.items; app.user.email and app.user.id.
Ensure Schema Consistency
In a dataset that encompasses multiple services, it can be distressingly easy to create inconsistent field names.WHERE s3.bucket exists AND app.customer_login = tallen will return no results; a query for WHERE s3.bucket exists AND app.user_id = tallen will be more satisfying.
Look for methods to help enforce consistency across the schema.
Ensure Appropriate Field Data Types
Check that your data is a good match for the type Honeycomb thinks it is. For example, a field that looks like an integer might actually be a user ID. It would not make sense to round a user ID, which could happen with a large integer value. You should explicitly set a field like this tostring, either while instrumenting your code or from the Dataset Settings Schema page.