Once your Kubernetes data is in Honeycomb, you can use it to analyze the performance of your Kubernetes applications in production. For example, you can trace application issues to infrastructure causes or pinpoint the users affected by an identified infrastructure issue.
This guide will walk you through the steps required to answer questions like:
Before beginning this guide, you should have:
In Honeycomb, you can slice and dice your data by Kubernetes attributes from your Home view, or create Boards to save relevant queries and visualizations. Either way, you will want to leverage Honeycomb’s features to create Triggers and explore anomalies using BubbleUp and Correlations.
Once you have Kubernetes data in Honeycomb, navigate to the Home view and select your Kubernetes dataset to begin exploring your data.
At a minimum, you will see event data, which you can group by various Kubernetes attributes. If you have instrumented your code, you will also see trace data.
For quick reference over time, you should create a Board that you can customize to show Kubernetes-specific items of interest.
When creating your Board, we recommend that you use one of our customized Board Templates for Kubernetes data, which will get you started with queries and visualizations of particular interest to Kubernetes users. You can locate Board Templates by selecting Explore Templates from the Home view.
Our Kubernetes Board Templates include:
Kubernetes Pod Metrics: Queries and visualizations that help you investigate pod performance and resource usage within Kubernetes clusters. For example, you could use the Kubernetes Pod Metrics Board Template to determine if a pod uses too many resources. Queries include:
Query Name | Query Description | Fields Required |
---|---|---|
Pod CPU Usage | The amount of CPU used by each pod in the cluster. CPU is reported as the average core usage measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers, and 1 hyper-thread on bare-metal Intel processors. | k8s.pod.cpu.utilization k8s.pod.name |
Pod Memory Usage | The amount of memory being used by each Kubernetes pod. | k8s.pod.memory.usage k8s.pod.name |
Pod Uptime Smokestacks | As pod uptime ever-increases, this query uses the smokestack method, which applies a LOG10 to the Pod Uptime metric, and newly started or restarted pods appear more significantly than pods that have been running a long time, which move into a straight line eventually. | LOG10($k8s.pod.uptime) k8s.pod.name k8s.pod.uptime |
Unhealthy Pods | This query shows trouble that pods may be experiencing during their operating lifecycle. Many of these events are present during start-up and get resolved so the presence of a count isn’t necessarily bad. | k8s.namespace.name k8s.pod.name reason |
Pod CPU Utilization vs. Limit | When a CPU Limit is present in a pod configuration, this query shows how much CPU that each pod uses as a percentage against that limit. | k8s.pod.cpu_limit_utilization k8s.pod.name |
Pod CPU Utilization vs. Request | When a CPU Request is present in a pod configuration, this query shows how much CPU that each pod uses as a percentage against that request value. | k8s.pod.cpu_request_utilization k8s.pod.name |
Pod Memory Utilization vs. Limit | When a Memory Limit is present in a pod configuration, this query shows how much memory that each pod uses as a percentage against that limit value. | k8s.pod.memory_limit_utilization k8s.pod.name |
Pod Memory Utilization vs. Request | When a Memory Request is present in a pod configuration, this query shows how much memory that each pod uses as a percentage against that request value. | k8s.pod.memory_request_utilization k8s.pod.name |
Pod Network IO Rates | Displays Network IO RATE_MAX for Transmit and Receive network traffic as a stacked graph, and gives the overall network rate and the individual rate for each node. | k8s.pod.name k8s.pod.network.io.receive k8s.pod.network.io.transmit |
Pods With Low Filesystem Availability | Shows any pods where filesystem availability is below 5 GB. | k8s.pod.filesystem.available k8s.pod.name |
Pod Filesystem Usage | Shows the amount of filesystem usage per Kubernetes pod, displayed in a stack graph to show total filesystem usage of all pods. | k8s.pod.filesystem.usage k8s.pod.name |
Pods Per Namespace | Shows the number of pods currently running in each Kubernetes namespace. | k8s.namespace.name k8s.pod.name |
Pods Per Node | Shows the number of pods currently running in each Kubernetes Node. | k8s.node.name k8s.pod.name |
Pod Network Errors | Shows network errors in receive and transmit, grouped by pod. | k8s.pod.name k8s.pod.network.errors.receive k8s.pod.network.errors.transmit |
Pods Per Deployment | The number of pods currently deployed in different Kubernetes deployments. | k8s.deployment.name k8s.pod.name |
Kubernetes Node Metrics: Queries and visualizations that help you investigate node performance and resource usage within Kubernetes clusters. For example, you could use the Kubernetes Node Metrics Board Template to monitor if your nodes are functioning as expected. Queries include:
Query Name | Query Description | Fields Required |
---|---|---|
Node CPU Usage | The amount of CPU used on each node in the cluster. CPU is reported as the average core usage measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers, and 1 hyper-thread on bare-metal Intel processors. | k8s.node.cpu.utilization k8s.node.name |
Node Memory Utilization | Shows percent of memory used on each Kubernetes node. | IF(EXISTS($k8s.node.memory.available), MUL(DIV($k8s.node.memory.working_set, $k8s.node.memory.available), 100)) k8s.node.memory.available k8s.node.memory.usage k8s.node.name |
Node Network IO Rates | Displays Network IO RATE_MAX for Transmit and Receive network traffic as a stacked graph, and gives overall network rate and the individual rate for each node. | k8s.node.name k8s.node.network.io.receive k8s.node.network.io.transmit |
Unhealthy Nodes | This query shows errors that Kubernetes nodes are experiencing. | k8s.namespace.name k8s.node.name reason severity_text |
Node Filesystem Utilization | Shows percent of filesystem used on each node. | IF(EXISTS($k8s.node.filesystem.usage),MUL(DIV($k8s.node.filesystem.usage,$k8s.node.filesystem.capacity), 100)) k8s.node.filesystem.capacity k8s.node.filesystem.usage k8s.node.name |
Node Uptime Smokestack | As node uptime ever-increases, this query uses the smokestack method, which applies a LOG10 to the Node Uptime metric, and newly started or restarted nodes appear more significantly than nodes that have been running a long time, which move into a straight line eventually. | LOG10($k8s.node.uptime) k8s.node.name k8s.node.uptime |
Node Network Errors | Shows network transmit and receive errors for each node. | k8s.node.name k8s.node.network.errors.receive k8s.node.network.errors.transmit |
Pods and Containers per Node | Shows the number of pods and the number of containers per node as stacked graphs, and also shows total number of pods and containers across the environment. | k8s.container.name k8s.node.name k8s.pod.name |
Kubernetes Workload Health: Queries and visualizations that help you investigate Kubernetes-related application problems. For example, you could use the Kubernetes Workload Health Board Template to monitor health at a glance and connect application problems to infra issues.
This Board Template is more about statuses than numbers, such as unhealthy reasons, unscheduled DaemonSets, or numbers of restarts. For the queries on this Board Template, you generally want to see empty results and zeroes as healthy indicators. Queries include:
Query Name | Query Description | Fields Required |
---|---|---|
Container Restarts | Shows the total number of restarts per pod, and the rate of restarts of pods where the restart count is greater than zero. | k8s.container.name k8s.container.restarts k8s.namespace.name k8s.pod.name |
Unhealthy Pods | This query shows trouble that pods may be experiencing during their operating lifecycle. Many of these events are present during start-up and get resolved so the presence of a count isn’t necessarily bad. | k8s.namespace.name k8s.pod.name reason |
Pending Pods | Find pods in a “Pending” state. | k8s.pod.name k8s.pod.phase |
Failed Pods | Find pods in a “Failed” or “Unknown” state. | k8s.pod.name k8s.pod.phase |
Unhealthy Nodes | This query shows errors that Kubernetes nodes are experiencing. | k8s.namespace.name reason k8s.pod.name reason severity_text |
Unhealthy Volumes | This query shows volume creation and attachment failures. | k8s.namespace.name k8s.pod.name reason severity_text |
Unscheduled Daemonset Pods | Track cases where a pod in a daemonset is not currently running on every node in the cluster as it should be. | SUB($k8s.daemonset.desired_scheduled_nodes, $k8s.daemonset.current_scheduled_nodes) k8s.daemonset.current_scheduled_nodes k8s.daemonset.desired_scheduled_nodes k8s.daemonset.name k8s.namespace.name |
Stateful Set Pod Readiness | Track any stateful sets where pods are in an non-ready state that should be in a ready state. | SUB($k8s.statefulset.desired_pods,$k8s.statefulset.ready_pods) k8s.statefulset.desired_pods k8s.statefulset.name k8s.statefulset.ready_pods |
Deployment Pod Status | Look for Deployments where Pods have not fully deployed. Numbers greater than zero show pods in a deployment that are not yet “ready”. | SUB($k8s.deployment.desired,$k8s.deployment.available) k8s.deployment.available k8s.deployment.desired k8s.deployment.name |
Job Failures | Track the number of failed pods in Kubernetes jobs. | k8s.job.failed_pods k8s.job.name |
Active Cron Jobs | Track the number of active pods in each Kubernetes cron job. | k8s.cronjob.active_jobs k8s.cronjob.name |
Once you have created a Board, you will likely want to configure some Triggers, so you can receive notifications when your data in Honeycomb crosses defined thresholds. Some examples of triggers that may be of interest to Kubernetes users include:
BackOff
, Failed
, Err
, or Unhealthy
To create a Trigger:
Follow our guided Kubernetes sandbox scenario to see how you can simplify debugging by using the core analysis loop–Honeycomb BubbleUp, Correlations, and rich queryable data–to link application behavior to underlying infrastructure.
Let’s look at some examples to learn how to use Honeycomb to investigate some common issues.
The OpenTelemetry Kubernetes Attributes Processor adds Kubernetes context to your telemetry, allowing for correlation with an application’s traces, metrics, and logs. With this data now on our spans, let’s investigate some slow traces and identify the cause of them.
To find slow traces:
HEATMAP(duration_ms)
.The trace’s detailed Trace Waterfall view appears next. In this example, the span is very slow when communicating between the audit and storage services. In addition, a span, displayed in red, contains an error. Select the errored span and use the Trace Sidebar to view the errored span’s fields and attributes. This information can provide some clues, but using BubbleUp would be faster.
To investigate further, return to the previous query and use BubbleUp:
In our example BubbleUp results, one pod (k8s.pod.name
) looks to be a significant outlier and that audit job (audit_id
) 130720
is failing.
Now, let’s see if there are any correlations between our previously identified application issue and our infrastructure. Within the query results, select the Correlations tab below the heatmap. The dropdown windows allows us to use a pre-existing Board with saved queries to correlate data with our Query Results.
In our example, our Correlations board is Kubernetes Pod Metrics, which is available as a Board Template for your own use. The Correlations results show two indicators: Spikes in pod memory, and CPU consumption.
Hover over the spikes in the Kubernetes Pod Start Events Correlation chart. A line appears on all charts in the display to indicate the same point in time. The hovering reveals that the Storage pod spikes at the same time as the slow requests on the heatmap. Each spike has a Started event, as seen in the Kubernetes Pod Start Events chart, the last chart on the right. This means that Kubernetes is restarting the Storage service’s container.
We can conclude that the Storage service does not have enough resources to process and store audit job 130720
, which leads to the application issues we originally noticed.
You can use the Kubernetes Workload Health Board to monitor and investigate infrastructure issues, and in conjunction with Query Builder, identify affected users. (Kubernetes Workload Health Board is available as a Board Template, which you can use and apply to your data.)
In our example, the Unhealthy Nodes Query in our Kubernetes Workload Health Board is showing unhealthy nodes appearing off and on between the 23rd and 27th of October.
To investigate, go to Query Builder to see what the application performance looks like for users during this time range. In Query Builder:
Run a query with:
VISUALIZE | WHERE | GROUP BY |
---|---|---|
HEATMAP(duration_ms) | user.id exists | user.id |
Use the time picker to adjust the time window to the Last 7 days.
Select Run Query.
In our example’s query results, it looks like there are occasional spikes in slow requests, similar to the spikes in unhealthy nodes previously noticed on the Kubernetes Workload Health board.
To confirm this:
In our example’s Correlations display, the slow requests from our query and spikes in unhealthy nodes from the Unhealthy Nodes chart align. Clumps of pending pods are also seen in the Pending Pods chart at this same time too.
To learn more about the users affected:
Select the Overview tab to the left of the Correlations tab.
The query being used includes GROUP BY user.id
, which allows you to group results by that field and see them listed in the Overview tab’s summary table.
Hovering over each user.id
’s row adjusts the above heatmap display of slow traces, so another correlation between a specific affected user and slow traces can be determined.
Try adding additional fields to the existing query, such as user information like user.email
to the GROUP BY clause.
The goal is to give more readable and potentially actionable information about the affected users.
The Honeycomb Network Agent captures raw network packets from the network interface that is shared by all resources on a Kubernetes node, assembles them into whole payloads, parses them, and then converts them to events.
In conjunction with the Network Agent, you can use Derived Columns to identify the vector of network traffic and differentiate between internal cluster traffic and application traffic.
Network traffic events contain source.*
and destination.*
fields, which you can use to create a Derived Column that identifies whether traffic came from outside your cluster and whether it crossed namespaces.
Create a Derived Column and enter the display name traffic_vector
and the following Function:
IF(NOT(EXISTS($source.k8s.node.name)), "inbound",
IF(NOT(EXISTS($destination.k8s.node.name)), "outbound",
IF(EQUALS($destination.k8s.namespace.name, $source.k8s.namespace.name), "cross-pod", "cross-namespace")
)
)
Inside a Kubernetes cluster, many HTTP requests perform various tasks, such as completing health checks and interacting with the cloud. You can create a Derived Column that filters traffic inside the cluster out by examining the user agent.
Create a Derived Column and enter the display name is_system_traffic
and the following Function:
IF(STARTS_WITH($user_agent.original, "kube-probe"), true,
IF(STARTS_WITH(differe$user_agent.original, "aws-sdk-go"), true,
IF(STARTS_WITH($user_agent.original, "amazon-vpc-cni-k8s"), true, false)
)
)
Use your new Derived Columns (traffic_vector
and is_system_traffic
) to create a Query that filters out system traffic and shows application traffic with vector.
In Query Builder, run a query with:
VISUALIZE | WHERE | GROUP BY |
---|---|---|
COUNT | is_system_traffic = false | source.k8s.pod.name traffic_vector destination.k8s.namespace.name destination.k8s.pod.name |
In our example’s query results, specific details appears about the network traffic.