Investigate Kubernetes Data in Honeycomb

Once your Kubernetes data is in Honeycomb, you can use it to analyze the performance of your Kubernetes applications in production. For example, you can trace application issues to infrastructure causes or pinpoint the users affected by an identified infrastructure issue.

This guide will walk you through the steps required to answer questions like:

  • How do resource limits compare to container resource use?
  • How does application performance vary with container resource limits?
  • Are application errors happening on specific nodes or across the fleet?

Before You Begin 

Before beginning this guide, you should have:

  • Created a running Kubernetes cluster.
  • Deployed an application to Kubernetes.
  • Completed the Kubernetes Quick Start.

Explore Your Data 

In Honeycomb, you can slice and dice your data by Kubernetes attributes from your Home view, or create Boards to save relevant queries and visualizations. Either way, you will want to leverage Honeycomb’s features to create Triggers and detect anomalies using BubbleUp and Correlations.

Slice and Dice Your Data 

Once you have Kubernetes data in Honeycomb, navigate to the Home view and select your Kubernetes dataset to begin exploring your data.

At a minimum, you will see event data, which you can group by various Kubernetes attributes. If you have instrumented your code, you will also see trace data.

Home view for Kubernetes dataset

Create a Board 

For quick reference over time, you should create a Board that you can customize to show Kubernetes-specific items of interest.

When creating your Board, we recommend that you use one of our customized Board Templates for Kubernetes data, which will get you started with queries and visualizations of particular interest to Kubernetes users. You can locate Board Templates by selecting Explore Templates from the Home view.

Our Kubernetes Board Templates include:

Kubernetes Pod Metrics: Queries and visualizations that help you investigate pod performance and resource usage within Kubernetes clusters. For example, you could use the Kubernetes Pod Metrics Board Template to determine if a pod uses too many resources. Queries include:

Query Name Query Description Fields Required
Pod CPU Usage The amount of CPU used by each pod in the cluster. CPU is reported as the average core usage measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers, and 1 hyper-thread on bare-metal Intel processors. k8s.pod.cpu.utilization
k8s.pod.name
Pod Memory Usage The amount of memory being used by each Kubernetes pod. k8s.pod.memory.usage
k8s.pod.name
Pod Uptime Smokestacks As pod uptime ever-increases, this query uses the smokestack method, which applies a LOG10 to the Pod Uptime metric, and newly started or restarted pods appear more significantly than pods that have been running a long time, which move into a straight line eventually. LOG10($k8s.pod.uptime)
k8s.pod.name
k8s.pod.uptime
Unhealthy Pods This query shows trouble that pods may be experiencing during their operating lifecycle. Many of these events are present during start-up and get resolved so the presence of a count isn’t necessarily bad. k8s.namespace.name
k8s.pod.name
reason
Pod CPU Utilization vs. Limit When a CPU Limit is present in a pod configuration, this query shows how much CPU that each pod uses as a percentage against that limit. k8s.pod.cpu_limit_utilization
k8s.pod.name
Pod CPU Utilization vs. Request When a CPU Request is present in a pod configuration, this query shows how much CPU that each pod uses as a percentage against that request value. k8s.pod.cpu_request_utilization
k8s.pod.name
Pod Memory Utilization vs. Limit When a Memory Limit is present in a pod configuration, this query shows how much memory that each pod uses as a percentage against that limit value. k8s.pod.memory_limit_utilization
k8s.pod.name
Pod Memory Utilization vs. Request When a Memory Request is present in a pod configuration, this query shows how much memory that each pod uses as a percentage against that request value. k8s.pod.memory_request_utilization
k8s.pod.name
Pod Network IO Rates Displays Network IO RATE_MAX for Transmit and Receive network traffic (in bytes) as a stacked graph, and gives the overall network rate and the individual rate for each node. k8s.pod.name
k8s.pod.network.io.receive
k8s.pod.network.io.transmit
Pods With Low Filesystem Availability Shows any pods where filesystem availability is below 5 GB. k8s.pod.filesystem.available
k8s.pod.name
Pod Filesystem Usage Shows the amount of filesystem usage per Kubernetes pod, displayed in a stack graph to show total filesystem usage of all pods. k8s.pod.filesystem.usage
k8s.pod.name
Pods Per Namespace Shows the number of pods currently running in each Kubernetes namespace. k8s.namespace.name
k8s.pod.name
Pods Per Node Shows the number of pods currently running in each Kubernetes Node. k8s.node.name
k8s.pod.name
Pod Network Errors Shows network errors in receive and transmit, grouped by pod. k8s.pod.name
k8s.pod.network.errors.receive
k8s.pod.network.errors.transmit
Pods Per Deployment The number of pods currently deployed in different Kubernetes deployments. k8s.deployment.name
k8s.pod.name

Kubernetes Node Metrics: Queries and visualizations that help you investigate node performance and resource usage within Kubernetes clusters. For example, you could use the Kubernetes Node Metrics Board Template to monitor if your nodes are functioning as expected. Queries include:

Query Name Query Description Fields Required
Node CPU Usage The amount of CPU used on each node in the cluster. CPU is reported as the average core usage measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers, and 1 hyper-thread on bare-metal Intel processors. k8s.node.cpu.utilization
k8s.node.name
Node Memory Utilization Shows percent of memory used on each Kubernetes node. IF(EXISTS($k8s.node.memory.available), MUL(DIV($k8s.node.memory.working_set, $k8s.node.memory.available), 100))
k8s.node.memory.available
k8s.node.memory.usage
k8s.node.name
Node Network IO Rates Displays Network IO RATE_MAX for Transmit and Receive network traffic as a stacked graph, and gives overall network rate and the individual rate for each node. k8s.node.name
k8s.node.network.io.receive
k8s.node.network.io.transmit
Unhealthy Nodes This query shows errors that Kubernetes nodes are experiencing. k8s.namespace.name
k8s.node.name
reason
severity_text
Node Filesystem Utilization Shows percent of filesystem used on each node. IF(EXISTS($k8s.node.filesystem.usage),MUL(DIV($k8s.node.filesystem.usage,$k8s.node.filesystem.capacity), 100))
k8s.node.filesystem.capacity
k8s.node.filesystem.usage
k8s.node.name
Node Uptime Smokestack As node uptime ever-increases, this query uses the smokestack method, which applies a LOG10 to the Node Uptime metric, and newly started or restarted nodes appear more significantly than nodes that have been running a long time, which move into a straight line eventually. LOG10($k8s.node.uptime)
k8s.node.name
k8s.node.uptime
Node Network Errors Shows network transmit and receive errors for each node. k8s.node.name
k8s.node.network.errors.receive
k8s.node.network.errors.transmit
Pods and Containers per Node Shows the number of pods and the number of containers per node as stacked graphs, and also shows total number of pods and containers across the environment. k8s.container.name
k8s.node.name
k8s.pod.name

Kubernetes Workload Health: Queries and visualizations that help you investigate Kubernetes-related application problems. For example, you could use the Kubernetes Workload Health Board Template to monitor health at a glance and connect application problems to infra issues.

This Board Template is more about statuses than numbers, such as unhealthy reasons, unscheduled DaemonSets, or numbers of restarts. For the queries on this Board Template, you generally want to see empty results and zeroes as healthy indicators. Queries include:

Query Name Query Description Fields Required
Container Restarts Shows the total number of restarts per pod, and the rate of restarts of pods where the restart count is greater than zero. k8s.container.name
k8s.container.restarts
k8s.namespace.name
k8s.pod.name
Unhealthy Pods This query shows trouble that pods may be experiencing during their operating lifecycle. Many of these events are present during start-up and get resolved so the presence of a count isn’t necessarily bad. k8s.namespace.name
k8s.pod.name
reason
Pending Pods Find pods in a “Pending” state. k8s.pod.name
k8s.pod.phase
Failed Pods Find pods in a “Failed” or “Unknown” state. k8s.pod.name
k8s.pod.phase
Unhealthy Nodes This query shows errors that Kubernetes nodes are experiencing. k8s.namespace.name
reason
k8s.pod.name
reason
severity_text
Unhealthy Volumes This query shows volume creation and attachment failures. k8s.namespace.name
k8s.pod.name
reason
severity_text
Unscheduled Daemonset Pods Track cases where a pod in a daemonset is not currently running on every node in the cluster as it should be. SUB($k8s.daemonset.desired_scheduled_nodes, $k8s.daemonset.current_scheduled_nodes)
k8s.daemonset.current_scheduled_nodes
k8s.daemonset.desired_scheduled_nodes
k8s.daemonset.name
k8s.namespace.name
Stateful Set Pod Readiness Track any stateful sets where pods are in an non-ready state that should be in a ready state. SUB($k8s.statefulset.desired_pods,$k8s.statefulset.ready_pods)
k8s.statefulset.desired_pods
k8s.statefulset.name
k8s.statefulset.ready_pods
Deployment Pod Status Look for Deployments where Pods have not fully deployed. Numbers greater than zero show pods in a deployment that are not yet “ready”. SUB($k8s.deployment.desired,$k8s.deployment.available)
k8s.deployment.available
k8s.deployment.desired
k8s.deployment.name
Job Failures Track the number of failed pods in Kubernetes jobs. k8s.job.failed_pods
k8s.job.name
Active Cron Jobs Track the number of active pods in each Kubernetes cron job. k8s.cronjob.active_jobs
k8s.cronjob.name

Create Triggers 

Once you have created a Board, you will likely want to configure some Triggers, so you can receive notifications when your data in Honeycomb crosses defined thresholds. Some examples of triggers that may be of interest to Kubernetes users include:

  • CPU Use: Pods or nodes that are reaching set CPU limits
  • Memory Use: Pods experiencing OOMKilled or nodes that are reaching a certain memory usage limit
  • Unhealthy pods: Pods that are experiencing a problematic status in Kubernetes Events, such as a reason of BackOff, Failed, Err, or Unhealthy

To create a Trigger:

  1. On your Board, locate a Query from which you want to create a Trigger.
  2. Select the query to open it in the Query Builder display.
  3. Select the three-dot overflow menu, located to the left of Run Query, and select Make Trigger.
  4. Configure the trigger by defining trigger details, an alert threshold, and your notification preferences.

Investigate Anomalies 

Follow our guided Kubernetes sandbox scenario to see how you can simplify debugging by using the core analysis loop–Honeycomb BubbleUp, Correlations, and rich queryable data–to link application behavior to underlying infrastructure.

Examples 

Let’s look at some examples to learn how to use Honeycomb to investigate some common issues.

Trace an Application Issue to Infrastructure 

The OpenTelemetry Kubernetes Attributes Processor adds Kubernetes context to your telemetry, allowing for correlation with an application’s traces, metrics, and logs. With this data now on our spans, let’s investigate some slow traces and identify the cause of them.

Find Slow Traces 

To find slow traces:

  1. In Query Builder, enter VISUALIZE HEATMAP(duration_ms).
  2. Select Run Query. This creates a heatmap below the Query Builder. The slowest traces appear towards the top of the heatmap.
  3. In the heatmap, select a slow request towards the top of the chart. In this example, it appears as a teal square with a high duration.
  4. In the menu that appears, select View trace. Heatmap that displays mostly fast traces with a few slow traces.

The trace’s detailed Trace Waterfall view appears next. In this example, the span is very slow when communicating between the audit and storage services. In addition, a span, displayed in red, contains an error. Select the errored span and use the Trace Sidebar to view the errored span’s fields and attributes. This information can provide some clues, but using BubbleUp would be faster. Heatmap that displays mostly fast traces with a few slow traces.

Identify the Cause 

To investigate further, return to the previous query and use BubbleUp:

  1. In the top left of the Trace Waterfall view, select the back arrow next to “Query in all datasets”. The previous Query Results page with the heatmap appears.
  2. In the heatmap, draw a box around the slow trace data to define the selection. A menu appears.
  3. Select Detect Anomalies (BubbleUp). The BubbleUp charts appear below the heatmap. BubbleUp creates charts to show differences between the selected slow requests and all other requests returned for the time window.

In our example BubbleUp results, one pod (k8s.pod.name) looks to be a significant outlier and that audit job (audit_id) 130720 is failing. BubbleUp Charts with K8s.pod.name chart outlier selected. BubbleUp Charts with audit_id chart outlier selected.

Find Correlations 

Now, let’s see if there are any correlations between our previously identified application issue and our infrastructure. Within the query results, select the Correlations tab below the heatmap. The dropdown windows allows us to use a pre-existing Board with saved queries to correlate data with our Query Results.

In our example, our Correlations board is Kubernetes Pod Metrics, which is available as a Board Template for your own use. The Correlations results show two indicators: Spikes in pod memory, and CPU consumption.

Heatmap and Correlations charts showing correlations with slow traces and low CPU and pod memory.

Hover over the spikes in the Kubernetes Pod Start Events Correlation chart. A line appears on all charts in the display to indicate the same point in time. The hovering reveals that the Storage pod spikes at the same time as the slow requests on the heatmap. Each spike has a Started event, as seen in the Kubernetes Pod Start Events chart, the last chart on the right. This means that Kubernetes is restarting the Storage service’s container.

We can conclude that the Storage service does not have enough resources to process and store audit job 130720, which leads to the application issues we originally noticed.

Pinpoint Users Affected by an Infrastructure Issue 

You can use the Kubernetes Workload Health Board to monitor and investigate infrastructure issues, and in conjunction with Query Builder, identify affected users. (Kubernetes Workload Health Board is available as a Board Template, which you can use and apply to your data.)

In our example, the Unhealthy Nodes Query in our Kubernetes Workload Health Board is showing unhealthy nodes appearing off and on between the 23rd and 27th of October. Unhealthy nodes query as seen in the Kubernetes Workload Health Board with unhealthy node behavior appearing

To investigate, go to Query Builder to see what the application performance looks like for users during this time range. In Query Builder:

  1. Run a query with:

    VISUALIZE WHERE GROUP BY
    HEATMAP(duration_ms) user.id exists user.id
  2. Use the time picker to adjust the time window to the Last 7 days.

  3. Select Run Query.

In our example’s query results, it looks like there are occasional spikes in slow requests, similar to the spikes in unhealthy nodes previously noticed on the Kubernetes Workload Health board. Heatmap showing a series of slow requests.

To confirm this:

  1. Select the Correlations tab below the heatmap.
  2. Select the dropdown window that displays the selected data source.
  3. Choose the Kubernetes Workload Health board from the available options.

In our example’s Correlations display, the slow requests from our query and spikes in unhealthy nodes from the Unhealthy Nodes chart align. Clumps of pending pods are also seen in the Pending Pods chart at this same time too. Correlations tab display with 6 charts from the Kubernetes Workload Health board.

To learn more about the users affected:

  • Select the Overview tab to the left of the Correlations tab. The query being used includes GROUP BY user.id, which allows you to group results by that field and see them listed in the Overview tab’s summary table. Hovering over each user.id’s row adjusts the above heatmap display of slow traces, so another correlation between a specific affected user and slow traces can be determined.

  • Try adding additional fields to the existing query, such as user information like user.email to the GROUP BY clause. The goal is to give more readable and potentially actionable information about the affected users.

Identify Traffic Source 

The Honeycomb Network Agent captures raw network packets from the network interface that is shared by all resources on a Kubernetes node, assembles them into whole payloads, parses them, and then converts them to events.

In conjunction with the Network Agent, you can use Derived Columns to identify the vector of network traffic and differentiate between internal cluster traffic and application traffic.

Create a Derived Column that Analyzes Traffic Vectors 

Network traffic events contain source.* and destination.* fields, which you can use to create a Derived Column that identifies whether traffic came from outside your cluster and whether it crossed namespaces.

Create a Derived Column and enter the display name traffic_vector and the following Function:

IF(NOT(EXISTS($source.k8s.node.name)), "inbound",
    IF(NOT(EXISTS($destination.k8s.node.name)), "outbound",
        IF(EQUALS($destination.k8s.namespace.name, $source.k8s.namespace.name), "cross-pod", "cross-namespace")
    )
)

Create a Derived Column that Differentiates Between Systems 

Inside a Kubernetes cluster, many HTTP requests perform various tasks, such as completing health checks and interacting with the cloud. You can create a Derived Column that filters traffic inside the cluster out by examining the user agent.

Create a Derived Column and enter the display name is_system_traffic and the following Function:

IF(STARTS_WITH($user_agent.original, "kube-probe"), true,
    IF(STARTS_WITH(differe$user_agent.original, "aws-sdk-go"), true,
        IF(STARTS_WITH($user_agent.original, "amazon-vpc-cni-k8s"), true, false)
    )
)

Create a Query Using Your Derived Columns 

Use your new Derived Columns (traffic_vector and is_system_traffic) to create a Query that filters out system traffic and shows application traffic with vector.

In Query Builder, run a query with:

VISUALIZE WHERE GROUP BY
COUNT is_system_traffic = false source.k8s.pod.name
traffic_vector
destination.k8s.namespace.name
destination.k8s.pod.name

In our example’s query results, specific details appears about the network traffic.

Query created from derived columns that filters out Kubernetes system traffic and shows application traffic with vector