Honeycomb for LLMs | Honeycomb

Honeycomb for LLMs

Observability is a critical part of building and maintaining applications that use large language models (LLMs). Learn how Observability can help you understand how your LLMs are behaving in production, and how to get started with Honeycomb.

Why Observability for LLMs 

Converting user input (and some contextual data for that user) into a useful output is something that LLMs are great at. However, it can often be difficult to debug then when they fail at their task. The reasons why they fail are varied:

  • Natural language inputs can be ill-specified, ambiguous, or simply unexpected.
  • You have little to no hope of predicting what users will input.
  • Similarly, you have little to no hope of predicting how the LLM will respond to a given input, let alone how useful that output is for users.
  • When users are presented with a natural language input, they may try things they would have otherwise not thought to try with other systems.
  • Small changes to the prompt can have a large impact on the output, making regressions easy.
  • Depending on the model you’re using or its settings, outputs are nondeterministic. Furthermore, this may be by design.

All of the above reasons make it difficult to understand how your LLM is behaving in production, and it’s not tractable to try and predict all of the ways it might fail before you go live. So, how do you debug your LLMs in production? The best way today is to Observe them and systematically analyze which user inputs, and which contextual pieces of data that are combined with those inputs, lead to particular outputs.

LLM Observability Needs Traces 

Trace data is essential to understanding the lifecycle of a system end-to-end as requests flow through it.

For LLMs, it’s critical to use OpenTelemetry traces for two reasons:

  1. Traces let you represent several operations that perform meaningful work before or after a call to an LLM.
  2. OpenTelemetry lets you correlate the behavior tracked in an LLM request with all other behavior in your application.

Most production applications that use LLMs perform several meaningful operations before and after a call to an LLM, particularly for retrieving relevant context for an input, or RAG (Retrieval-Augmented Generation). For example, it’s common to use an Embedding model to calculate vector embeddings for a user’s input, and then use those embeddings to query an index or database of other embeddings for semantically relevant context. Retrieval can sometimes be quite complex, and it’s important to understand how it behaves in production, because the results of retrieval directly impact how an LLM behaves.

Similarly, it’s also common to perform several operations on an LLM’s output rather than returning that output to the user directly. For example, there may be ways to not only validate that the output has a particular structure or contents, but some outputs can be programmatically modified to be correct based on other known information. Understanding how validation can fail, or how many outputs had to be programmatically corrected (and which corrections were involved), can inform how to improve your prompts or fine-tuning, if applicable.

Tracing is what ties all of these things together, and it’s also what ties the behavior of an LLM to the rest of your application. By tying the LLM request behavior to the rest of an application, you can get a complete picture of how your application is behaving for your users. For example, you may notice a spike in latency for your system that involves a request to the LLM. Is this because the LLM is taking longer to respond, or is it because some other network calls you’re making are taking longer? By looking at trace data that correlates the rest of your request flow with the LLM request, you can understand exactly what’s affecting a poor user experience.

Getting Started 

To explore Observability for LLMs, try our Text to JSON Quick Start guide, which requires an application that uses an LLM.