Like a well-organized closet, a little dataset planning can make a big difference later for day-to-day use. The recommendations below are things we have found that make it easier.
Datasets are used to partition your data into separate and queryable sets. Additionally, storage plans come with the ability to define storage limits per dataset - so separating data into different datasets allows you to, say, have one low-volume dataset retained for a long period of time, and have a high-volume dataset only kept around for a short period of time.
In general, all events in the same Dataset should be considered equivalent either in their frequency and scope, or in the system layer in which they occur. You should separate events into different Datasets when you cannot establish equivalency between them (e.g. data gathered from a dev environment vs prod).
You may, for example, find it useful to capture API and batch-processing events in the same Dataset if they share some
request_id field. By contrast, events from two different environments with only one differentiator (like the value of some “environment” column) might appear highly similar and, as a result, be more easily confused. Relying on consistent application of some “environment” filter is risky and can create misleading results.
Here is another example from one of our customers. They’ve put API and web requests in the same Dataset, because—for them—an API request is really one type of web request that has more fields. Our customer adds the extra API fields (even though the web requests don’t have them) because Honeycomb supports sparse data and provides filters that enable our customer to look at web or API requests, and so on. Our customer does not want to filter out web requests, however, when looking at something like overall traffic.
For this same company, SQL queries reside in a different Dataset because SQL queries are not in any way equivalent to API data: There can be multiple (or no) SQL queries for a single API query, for instance.