Observability for modern distributed infrastructure is deeply-rooted in instrumentation. A well-instrumented system provides all the data needed to determine what’s happening inside it, and we should all aim to ship well-instrumented systems. That said, it is worth the effort to do even a basic job of instrumenting your code. Starting at the edge and just collecting timing information will give you much more insight into where to start looking for the source of a problem than you had before.
Here are some best practices for observability, most cribbed directly from this blog post: Best Practices for Observability.
Generate unique request IDs at the edge of your infrastructure, and propagate them through the entire request lifecycle (including to your databases, in the comments field).
Start at the top of your stack (usually a load balancer) and track all requests plus their status code and duration. Successful requests under your promised response time is your first approximation of uptime. Failures here will usually detract from your overall uptime, and successes will usually but not always contribute to your success rate.
Generate one event per service/hop/query/etc. A single API request should generate, for example, a log line or event at the edge (ELB/ALB), the load balancer (nginx), the API service, each microservice it gets passed off to, and for each query it generates on each storage layer. There are other sources of information and events that may be relevant when debugging (for example, your database likely generates events that say how long the queue length is and reporting internal statistics, you may have system stats) but one event per hop is the current easiest and best practice.
Wrap any call out to any other service/data store as a timing event. In Honeycomb, store that value in a header as well as a key/value pair in your service event. Finding where the system has gotten slow will often involve comparing the view from multiple directions. For example, a database may report that a query took 100ms, but the service may argue that it actually took 10 seconds. They can both be right, for example if the database doesn’t start counting time until it begins executing the query, and it has a large queue.
The high-level instrumentation advice given above can be broken down into a number of different instrumentation patterns, depending on the architecture of the software being instrumented. Check out our recommendations for a number of common patterns listed below:
Our Beelines understand the standard packages you’re using, then instrument them to send useful events to Honeycomb. There’s no custom instrumentation required to generate basic events but with a little optional configuration, you can include your own fields too.
Our Examples repository contains a wide range of instrumented sample applications that illustrate how to generate custom events and send them to Honeycomb.