Investigate Kubernetes Data in Honeycomb

Once your Kubernetes data is in Honeycomb, you can use it to analyze the performance of your Kubernetes applications in production. For example, you can trace application issues to infrastructure causes or pinpoint the users affected by an identified infrastructure issue.

Before You Begin

Explore Your Data

In Honeycomb, you can slice and dice your data by Kubernetes attributes from your Home view, or create Boards to save relevant queries and visualizations. Either way, you will want to leverage Honeycomb’s features to create Triggers and detect anomalies using BubbleUp and Correlations.

Slice and Dice Your Data

Once you have Kubernetes data in Honeycomb, navigate to the Home view and select your Kubernetes dataset to begin exploring your data.

At a minimum, you will see event data, which you can group by various Kubernetes attributes. If you have instrumented your code, you will also see trace data.

Create a Board

For quick reference over time, you should create a Board that you can customize to show Kubernetes-specific items of interest.

When creating your Board, we recommend that you use one of our customized Board Templates for Kubernetes data, which will get you started with queries and visualizations of particular interest to Kubernetes users. You can locate Board Templates by selecting Explore Templates from the Home view.

Kubernetes Pod Metrics

Kubernetes Pod Metrics: Queries and visualizations that help you investigate pod performance and resource usage within Kubernetes clusters. For example, you could use the Kubernetes Pod Metrics Board Template to determine if a pod uses too many resources. Queries include:

Query Name	Query Description	Required Fields
Pod CPU Usage	Shows the amount of CPU used by each pod in the cluster. CPU is reported as the average core usage measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers, and 1 hyper-thread on bare-metal Intel processors.	`k8s.pod.cpu.utilization` `k8s.pod.name`
Pod Memory Usage	Shows the amount of memory being used by each Kubernetes pod.	`k8s.pod.memory.usage` `k8s.pod.name`
Pod Uptime Smokestacks	As pod uptime ever-increases, this query uses the smokestack method, which applies a LOG10 to the Pod Uptime metric, and newly started or restarted pods appear more significantly than pods that have been running a long time, which move into a straight line eventually.	`LOG10($k8s.pod.uptime)` `k8s.pod.name` `k8s.pod.uptime`
Unhealthy Pods	Shows trouble that pods may be experiencing during their operating lifecycle. Many of these events are present during start-up and get resolved so the presence of a count isn’t necessarily bad.	`k8s.namespace.name` `k8s.pod.name` `reason`
Pod CPU Utilization vs. Limit	When a CPU Limit is present in a pod configuration, this query shows how much CPU that each pod uses as a percentage against that limit.	`k8s.pod.cpu_limit_utilization` `k8s.pod.name`
Pod CPU Utilization vs. Request	When a CPU Request is present in a pod configuration, this query shows how much CPU that each pod uses as a percentage against that request value.	`k8s.pod.cpu_request_utilization` `k8s.pod.name`
Pod Memory Utilization vs. Limit	When a Memory Limit is present in a pod configuration, this query shows how much memory that each pod uses as a percentage against that limit value.	`k8s.pod.memory_limit_utilization` `k8s.pod.name`
Pod Memory Utilization vs. Request	When a Memory Request is present in a pod configuration, this query shows how much memory that each pod uses as a percentage against that request value.	`k8s.pod.memory_request_utilization` `k8s.pod.name`
Pod Network IO Rates	Displays Network IO RATE_MAX for Transmit and Receive network traffic (in bytes) as a stacked graph, and gives the overall network rate and the individual rate for each node.	`k8s.pod.name` `k8s.pod.network.io.receive` `k8s.pod.network.io.transmit`
Pods With Low Filesystem Availability	Shows any pods where filesystem availability is below 5 GB.	`k8s.pod.filesystem.available` `k8s.pod.name`
Pod Filesystem Usage	Shows the amount of filesystem usage per Kubernetes pod, displayed in a stack graph to show total filesystem usage of all pods.	`k8s.pod.filesystem.usage` `k8s.pod.name`
Pods Per Namespace	Shows the number of pods currently running in each Kubernetes namespace.	`k8s.namespace.name` `k8s.pod.name`
Pods Per Node	Shows the number of pods currently running in each Kubernetes Node.	`k8s.node.name` `k8s.pod.name`
Pod Network Errors	Shows network errors in receive and transmit, grouped by pod.	`k8s.pod.name` `k8s.pod.network.errors.receive` `k8s.pod.network.errors.transmit`
Pods Per Deployment	Shows the number of pods currently deployed in different Kubernetes deployments.	`k8s.deployment.name` `k8s.pod.name`

Kubernetes Node Metrics: Queries and visualizations that help you investigate node performance and resource usage within Kubernetes clusters. For example, you could use the Kubernetes Node Metrics Board Template to monitor if your nodes are functioning as expected. Queries include:

Query Name	Query Description	Required Fields
Node CPU Usage	Shows the amount of CPU used on each node in the cluster. CPU is reported as the average core usage measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers, and 1 hyper-thread on bare-metal Intel processors.	`k8s.node.cpu.utilization` `k8s.node.name`
Node Memory Utilization	Shows percent of memory used on each Kubernetes node.	`IF(EXISTS($k8s.node.memory.available), MUL(DIV($k8s.node.memory.working_set, $k8s.node.memory.available), 100))` `k8s.node.memory.available` `k8s.node.memory.usage` `k8s.node.name`
Node Network IO Rates	Displays Network IO RATE_MAX for Transmit and Receive network traffic as a stacked graph, and gives overall network rate and the individual rate for each node.	`k8s.node.name` `k8s.node.network.io.receive` `k8s.node.network.io.transmit`
Unhealthy Nodes	Shows errors that Kubernetes nodes are experiencing.	`k8s.namespace.name` `k8s.node.name` `reason` `severity_text`
Node Filesystem Utilization	Shows percent of filesystem used on each node.	`IF(EXISTS($k8s.node.filesystem.usage),MUL(DIV($k8s.node.filesystem.usage,$k8s.node.filesystem.capacity), 100))` `k8s.node.filesystem.capacity` `k8s.node.filesystem.usage` `k8s.node.name`
Node Uptime Smokestack	As node uptime ever-increases, this query uses the smokestack method, which applies a LOG10 to the Node Uptime metric, and newly started or restarted nodes appear more significantly than nodes that have been running a long time, which move into a straight line eventually.	`LOG10($k8s.node.uptime)` `k8s.node.name` `k8s.node.uptime`
Node Network Errors	Shows network transmit and receive errors for each node.	`k8s.node.name` `k8s.node.network.errors.receive` `k8s.node.network.errors.transmit`
Pods and Containers per Node	Shows the number of pods and the number of containers per node as stacked graphs, and also shows total number of pods and containers across the environment.	`k8s.container.name` `k8s.node.name` `k8s.pod.name`

Kubernetes Workload Health: Queries and visualizations that help you investigate Kubernetes-related application problems. For example, you could use the Kubernetes Workload Health Board Template to monitor health at a glance and connect application problems to infra issues.

This Board Template is more about statuses than numbers, such as unhealthy reasons, unscheduled DaemonSets, or numbers of restarts. For the queries on this Board Template, you generally want to see empty results and zeroes as healthy indicators. Queries include:

Query Name	Query Description	Required Fields
Container Restarts	Shows the total number of restarts per pod, and the rate of restarts of pods where the restart count is greater than zero.	`k8s.container.name` `k8s.container.restarts` `k8s.namespace.name` `k8s.pod.name`
Unhealthy Pods	Shows trouble that pods may be experiencing during their operating lifecycle. Many of these events are present during start-up and get resolved so the presence of a count isn’t necessarily bad.	`k8s.namespace.name` `k8s.pod.name` `reason`
Pending Pods	Shows pods in a “Pending” state.	`k8s.pod.name` `k8s.pod.phase`
Failed Pods	Shows pods in a “Failed” or “Unknown” state.	`k8s.pod.name` `k8s.pod.phase`
Unhealthy Nodes	Shows errors that Kubernetes nodes are experiencing.	`k8s.namespace.name` `reason` `k8s.pod.name` `reason` `severity_text`
Unhealthy Volumes	Shows volume creation and attachment failures.	`k8s.namespace.name` `k8s.pod.name` `reason` `severity_text`
Unscheduled Daemonset Pods	Tracks cases where a pod in a daemonset is not currently running on every node in the cluster as it should be.	`SUB($k8s.daemonset.desired_scheduled_nodes, $k8s.daemonset.current_scheduled_nodes)` `k8s.daemonset.current_scheduled_nodes` `k8s.daemonset.desired_scheduled_nodes` `k8s.daemonset.name` `k8s.namespace.name`
Stateful Set Pod Readiness	Tracks any stateful sets where pods are in an non-ready state that should be in a ready state.	`SUB($k8s.statefulset.desired_pods,$k8s.statefulset.ready_pods)` `k8s.statefulset.desired_pods` `k8s.statefulset.name` `k8s.statefulset.ready_pods`
Deployment Pod Status	Shows Deployments where Pods have not fully deployed. Numbers greater than zero show pods in a deployment that are not yet “ready”.	`SUB($k8s.deployment.desired,$k8s.deployment.available)` `k8s.deployment.available` `k8s.deployment.desired` `k8s.deployment.name`
Job Failures	Tracks the number of failed pods in Kubernetes jobs.	`k8s.job.failed_pods` `k8s.job.name`
Active Cron Jobs	Tracks the number of active pods in each Kubernetes cron job.	`k8s.cronjob.active_jobs` `k8s.cronjob.name`

Create Triggers

Once you have created a Board, you will likely want to configure some Triggers, so you can receive notifications when your data in Honeycomb crosses defined thresholds. Some examples of triggers that may be of interest to Kubernetes users include:

Investigate Anomalies

Examples

Let’s look at some examples to learn how to use Honeycomb to investigate some common issues.

Trace an Application Issue to Infrastructure

The OpenTelemetry Kubernetes Attributes Processor adds Kubernetes context to your telemetry, allowing for correlation with an application’s traces, metrics, and logs. With this data now on our spans, let’s investigate some slow traces and identify the cause of them.

Find Slow Traces

The trace’s detailed Trace Waterfall view appears next. In this example, the span is very slow when communicating between the audit and storage services. In addition, a span, displayed in red, contains an error. Select the errored span and use the Trace Sidebar to view the errored span’s fields and attributes. This information can provide some clues, but using BubbleUp would be faster. Heatmap that displays mostly fast traces with a few slow traces.

Heatmap that displays mostly fast traces with a few slow traces.

Identify the Cause

In our example BubbleUp results, one pod (k8s.pod.name) looks to be a significant outlier and that audit job (audit_id) 130720 is failing. BubbleUp Charts with K8s.pod.name chart outlier selected.

BubbleUp Charts with K8s.pod.name chart outlier selected.

BubbleUp Charts with audit_id chart outlier selected.

Find Correlations

Now, let’s see if there are any correlations between our previously identified application issue and our infrastructure. Within the query results, select the Correlations tab below the heatmap. The dropdown windows allows us to use a pre-existing Board with saved queries to correlate data with our Query Results.

In our example, our Correlations board is Kubernetes Pod Metrics, which is available as a Board Template for your own use. The Correlations results show two indicators: Spikes in pod memory, and CPU consumption.

Hover over the spikes in the Kubernetes Pod Start Events Correlation chart. A line appears on all charts in the display to indicate the same point in time. The hovering reveals that the Storage pod spikes at the same time as the slow requests on the heatmap. Each spike has a Started event, as seen in the Kubernetes Pod Start Events chart, the last chart on the right. This means that Kubernetes is restarting the Storage service’s container.

We can conclude that the Storage service does not have enough resources to process and store audit job 130720, which leads to the application issues we originally noticed.

Pinpoint Users Affected by an Infrastructure Issue

You can use the Kubernetes Workload Health Board to monitor and investigate infrastructure issues, and in conjunction with Query Builder, identify affected users. (Kubernetes Workload Health Board is available as a Board Template, which you can use and apply to your data.)

In our example, the Unhealthy Nodes Query in our Kubernetes Workload Health Board is showing unhealthy nodes appearing off and on between the 23rd and 27th of October. Unhealthy nodes query as seen in the Kubernetes Workload Health Board with unhealthy node behavior appearing

Unhealthy nodes query as seen in the Kubernetes Workload Health Board with unhealthy node behavior appearing

To investigate, go to Query Builder to see what the application performance looks like for users during this time range. In Query Builder:

VISUALIZE	WHERE	GROUP BY
HEATMAP(duration_ms)	user.id exists	user.id

In our example’s query results, it looks like there are occasional spikes in slow requests, similar to the spikes in unhealthy nodes previously noticed on the Kubernetes Workload Health board. Heatmap showing a series of slow requests.

Heatmap showing a series of slow requests.

In our example’s Correlations display, the slow requests from our query and spikes in unhealthy nodes from the Unhealthy Nodes chart align. Clumps of pending pods are also seen in the Pending Pods chart at this same time too. Correlations tab display with 6 charts from the Kubernetes Workload Health board.

Correlations tab display with 6 charts from the Kubernetes Workload Health board.

Honeycomb.io Documentation