Observability is a critical part of building and maintaining applications that use large language models (LLMs). Learn how Observability can help you understand how your LLMs are behaving in production, and how to get started with Honeycomb.
Converting user input (and some contextual data for that user) into a useful output is something that LLMs are great at. However, it can often be difficult to debug then when they fail at their task. The reasons why they fail are varied:
All of the above reasons make it difficult to understand how your LLM is behaving in production, and it’s not tractable to try and predict all of the ways it might fail before you go live. So, how do you debug your LLMs in production? The best way today is to observe them and systematically analyze which user inputs, and which contextual pieces of data that are combined with those inputs, lead to particular outputs.
Trace data is essential to understanding the lifecycle of a system end-to-end as requests flow through it.
For LLMs, it’s critical to use OpenTelemetry traces for two reasons:
Most production applications that use LLMs perform several meaningful operations before and after a call to an LLM, particularly for retrieving relevant context for an input, or RAG (Retrieval-Augmented Generation). For example, it’s common to use an Embedding model to calculate vector embeddings for a user’s input, and then use those embeddings to query an index or database of other embeddings for semantically relevant context. Retrieval can sometimes be quite complex, and it’s important to understand how it behaves in production, because the results of retrieval directly impact how an LLM behaves.
Similarly, it’s also common to perform several operations on an LLM’s output rather than returning that output to the user directly. For example, there may be ways to not only validate that the output has a particular structure or contents, but some outputs can be programmatically modified to be correct based on other known information. Understanding how validation can fail, or how many outputs had to be programmatically corrected (and which corrections were involved), can inform how to improve your prompts or fine-tuning, if applicable.
Tracing is what ties all of these things together, and it’s also what ties the behavior of an LLM to the rest of your application. By tying the LLM request behavior to the rest of an application, you can get a complete picture of how your application is behaving for your users. For example, you may notice a spike in latency for your system that involves a request to the LLM. Is this because the LLM is taking longer to respond, or is it because some other network calls you’re making are taking longer? By looking at trace data that correlates the rest of your request flow with the LLM request, you can understand exactly what’s affecting a poor user experience.
To explore Observability for LLMs, try our Text to JSON Quick Start guide, which requires an application that uses an LLM.