We use cookies or similar technologies to personalize your online experience and tailor marketing to you. Many of our product features require cookies to function properly. Your use of this site and online product constitutes your consent to these personalization technologies. Read our Privacy Policy to find out more.

X

Dataset best practices

Like a well-organized closet, a little dataset planning can make a big difference later for day-to-day use. The recommendations below are things we have found that make it easier.

Use Datasets to group data and control retention

Datasets are used to partition your data into separate and queryable sets. Additionally, storage plans come with the ability to define storage limits per dataset - so separating data into different datasets allows you to, say, have one low-volume dataset retained for a long period of time, and have a high-volume dataset only kept around for a short period of time.

In general, all events in the same Dataset should be considered equivalent either in their frequency and scope, or in the system layer in which they occur. You should separate events into different Datasets when you cannot establish equivalency between them (e.g. data gathered from a dev environment vs prod).

You may, for example, find it useful to capture API and batch-processing events in the same Dataset if they share some request_id field. By contrast, events from two different environments with only one differentiator (like the value of some “environment” column) might appear highly similar and, as a result, be more easily confused. Relying on consistent application of some “environment” filter is risky and can create misleading results.

Here is another example from one of our customers. They’ve put API and web requests in the same Dataset, because—for them—an API request is really one type of web request that has more fields. Our customer adds the extra API fields (even though the web requests don’t have them) because Honeycomb supports sparse data and provides filters that enable our customer to look at web or API requests, and so on. Our customer does not want to filter out web requests, however, when looking at something like overall traffic.

For this same company, SQL queries reside in a different Dataset because SQL queries are not in any way equivalent to API data: There can be multiple (or no) SQL queries for a single API query, for instance.