We use cookies or similar technologies to personalize your online experience and tailor marketing to you. Many of our product features require cookies to function properly. Your use of this site and online product constitutes your consent to these personalization technologies. Read our Privacy Policy to find out more.

X

Dataset best practices

Like a well-organized closet, a little dataset planning can make a big difference later for day-to-day use. The recommendations below are things we have found that make it easier.

Use Datasets to group data and control retention

Datasets are used to partition your data into separate and queryable sets. Additionally, storage plans come with the ability to define storage limits per dataset - so separating data into different datasets allows you to, say, have one low-volume dataset retained for a long period of time, and have a high-volume dataset only kept around for a short period of time.

In general, all events in the same Dataset should be considered equivalent either in their frequency and scope, or in the system layer in which they occur. You should separate events into different Datasets when you cannot establish equivalency between them (e.g. data gathered from a dev environment vs prod).

You may, for example, find it useful to capture API and batch-processing events in the same Dataset if they share some request_id field. By contrast, events from two different environments with only one differentiator (like the value of some “environment” column) might appear highly similar and, as a result, be more easily confused. Relying on consistent application of some “environment” filter is risky and can create misleading results.

Here is another example from one of our customers. They’ve put API and web requests in the same Dataset, because—for them—an API request is really one type of web request that has more fields. Our customer adds the extra API fields (even though the web requests don’t have them) because Honeycomb supports sparse data and provides filters that enable our customer to look at web or API requests, and so on. Our customer does not want to filter out web requests, however, when looking at something like overall traffic.

For this same company, SQL queries reside in a different Dataset because SQL queries are not in any way equivalent to API data: There can be multiple (or no) SQL queries for a single API query, for instance.

Categorize columns with good names

Large Datasets can quickly come to feel unwieldy. Once you have a Dataset with more than 40 columns or so, use naming conventions to categorize your columns: http.method, http.status, and http.url, for example, or server.hostname, server.buildnumber, and so on.

This practice makes columns easier to find in the Honeycomb UI. It also makes it easier for everyone on your team to have a shared understanding of a Dataset, sooner. If a column is labeled “status code,” for instance, you may know what that means, but the next person may not. We recommend choosing good column names as early as possible, such as when you are instrumenting your app and determining which fields to send with your events.

Confirm your field data types match your purpose

Check that your data is a good match for the type Honeycomb thinks it is. For example, a field that looks like an integer might actually be a user ID. It wouldn’t make sense to round a user ID, which could happen with a large integer value. You should explicitly set a field like this to string, either while instrumenting your code or from the Schema page.