Get instant insights into your system with Board Templates.
Board Templates are pre-configured Boards that come with ready-made queries and visualizations, providing valuable insights with minimal set up. Use a template as starting point to create a Board.
Templates are designed for specific use cases and built around industry best practices, ensuring effective configurations for tracking key metrics and visualizing data accurately.
Choose from a variety of templates to quickly gain insights across different areas of your system:
The Service Health Board Template offers an overview of your services’ health. It provides insights into request volumes, identifies where the slowest requests are occurring, and more.
The Service Health Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Trace Counts by Service | Shows total trace volume by service. |
|
Trace Counts by HTTP Status Code | Shows total trace volume by status code. |
|
Trace Duration Heatmap | Shows a heatmap of the duration for all traces. |
|
Duration Heatmap | Shows a heatmap of duration across all services. |
|
Duration by Service | Shows key duration percentiles by service. |
|
Duration by Route | Shows duration by route or endpoint. |
|
Duration by Name | Shows duration by function name. |
|
Errors by Service | Shows a count of errors grouped by service. |
|
Errors by Route | Shows a count of errors grouped by route or endpoint. |
|
The RUM Board Template provides an overview of real user monitoring data from your frontend applications.
The RUM Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Largest Contentful Paint (LCP) | Shows ratings based on the render time for the largest content on a page. |
|
Cumulative Layout Shift (CLS) | Shows ratings based on the stability of content layout on a page. |
|
Interaction to Next Paint (INP) | Shows ratings based on the responsiveness of a page. |
|
LCP P75 | Shows the 75th percentile for LCP. |
|
CLS P75 | Shows the 75th percentile for CLS. |
|
INP P75 | Shows the 75th percentile for INP. |
|
Total Events by Type | Shows event types ranked by occurrence. |
|
Largest Resource Requests | Shows the largest resource requests ranked by the average length of their response content. |
|
Top 5 Endpoints by Request Count | Shows the top 5 endpoints ranked by number of requests. |
|
Slowest Requests by Endpoint | Shows the slowest endpoints based on the 75th percentile of request durations. |
|
Top Landing Pages by Session Count | Shows the most visited landing pages ranked by session count. |
|
Pages With the Most Events | Shows pages with the highest number of events, highlighting the most active pages. |
|
The MySQL Board Template provides insights into MySQL database operations, including thread count by type, query rate, resource usage, and row/table locks.
The MySQL Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Server Status | Shows server uptime. Use to track server restarts. |
|
Buffer Pool Pages | Shows the number of pages in the InnoDB buffer pool by type. Use to understand buffer pool utilization. |
|
Buffer Pool Data Pages | Shows the number of data pages in the InnoDB buffer pool by status (clean or dirty). Use to track page writes to disk. |
|
Buffer Pool Page Flushes | Shows the rate of page flush requests from the InnoDB buffer pool. Use to help identify input/output pressure. |
|
Buffer Pool Operations | Shows buffer pool operations by type. Use to identify patterns in buffer pool usage. |
|
Row and Page Operations | Shows the rate of InnoDB row and page operations. Use to provide insight into database workload and input/output patterns. |
|
Doublewrite Rate | Shows the rate of writes to the InnoDB doublewrite buffer. Use to understanding database durability. |
|
Handler Requests and Thread Status | Shows the rate of requests to various handlers and the state of system threads. Provides insight into how the database is processing queries and allows monitoring of connection usage and thread efficiency. |
|
Row and Table Locks | Shows InnoDB lock statistics, and MySQL Table locks. Use to help identify lock contention. |
|
Resource Usage | Shows the rate of opened resources and temporary resources. Use to help identify resource utilization, and the usage of temporary tables or files. |
|
Query Rate | Shows query throughput and slow query rates across MySQL instances. Use to pinpoint instances with the highest query load. |
|
Thread Count by Type | Shows thread count by type. Use to indicate operations currently being performed by the set of threads executing within the server. |
|
Table Open Cache Efficiency | Shows Table Cache Efficiency. Use to monitor filesystem input/output within the instances. |
|
The Redis Board Template provides insights into Redis primary and replica nodes, including command activity, latency/volume and execution time, expired keys, and CPU consumption.
This Board Template utilizes the Redis receiver provided by the OpenTelemetry Collector Contrib distribution. View OpenTelemetry documentation for set up instructions.
Note that the Redis receiver does not automatically publish some key server attributes, like address
or port
.
The visualizations on this Board Template utilize server address
to ensure that visualizing across multiple Redis instances is possible.
The Redis Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Cache Connections | Shows connections received and rejected per server. Use to diagnose connectivity issues. |
|
Uptime | Shows the number of seconds since a server start by server. |
|
Server Durability | Shows the number of write operations that have happened since the last successful RDB snapshot. Use to track durability issues per server. |
|
Key Count | Shows the number of keys per database and per server. |
|
Server CPU Time | Shows the CPU consumed by Redis server since server start. |
|
Client Activity | Shows Redis client activity per server address and activity between connected and blocked clients. |
|
Command Activity | Shows the number of commands processed per second and the number of commands processed by the server. Use to track operational load of servers. |
|
Client I/O | Shows the input/output buffers of Redis clients by server. Use to diagnose or troubleshoot input/output issues with clients. |
|
Network Activity | Shows network input/output by server. |
|
P99 Command Latency | Shows the P99 of command latency. Use to identify anomalous commands. |
|
Command Volume and Execution Time | Shows the number of calls for a command and the total time for all executions of a command per server. |
|
Average Command Latency | Shows the average latency of commands by server. Use to understand the baseline latency of a command. |
|
Expired Keys | Shows the total number of key expiration events per server. |
|
Keyspace Hits and Misses | Shows the number of successful and failed key lookups per server. |
|
Memory Profile | Shows memory metrics per server. |
|
Primary Replication | Shows the replication offsets per server. |
|
Follower Replication | Shows the replication offset for follower instances. |
|
The Airflow Board Template gives an overview of data workflow performance. Monitoring Airflow operations can highlight problems which may occur in the process of running data pipelines.
The required fields in the Airflow Board Template are derived from Airflow’s support for OpenTelemetry logs, metrics, and traces.
View our documentation about instrumenting your Python data pipelines and applications.
The Airflow Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
DAG Processing Import Errors | Shows the sum of the number of errors from trying to parse DAG files by host.name . Parsing errors prevent DAGs from being loaded. Tracking these errors helps identify configuration or syntax issues that need immediate attention. |
|
DAG Processing Import Errors by File Path | Shows the sum of the number of errors during import and parse of DAG files, broken out by DAG File Path and host.name . Tracking these errors helps identify configuration or syntax issues with a given file or host. |
|
Duration of Tasks (AVG, P95) | Shows the average and P95 duration of a Task by DAG ID, task ID, and host.name . Execution time helps identify which specific tasks are performance bottlenecks, allowing you to optimize your workflows. Note: Uses trace signal type. |
|
DAG Failed Duration (AVG) | Shows the average duration in milliseconds (ms) taken for a DagRun to reach a failed state by DAG ID and host.name . Failed DAG runs consume valuable resources. Monitoring this metric helps to identify inefficient failure patterns. |
|
DAG Success Duration (AVG) | Shows the average duration in milliseconds (ms) for a DagRun to reach success state by DAG ID and host.name . Monitoring duration allows you to optimize resource allocation and set appropriate SLAs. |
|
Task Counts | Shows the count of Tasks grouped by DAG ID, task ID, host.name , and state . Use the overall workflow health and the proportion of tasks experiencing issues to highlight potential issues with Airflow operations. Note: Uses trace signal type. |
|
DAG Schedule Delay | Shows the average duration in milliseconds (ms) of delay between the scheduled DagRun start date and the actual DagRun start date, grouped by DAG ID and host.name . Use to identify scheduler bottlenecks, resource constraints, or overloaded Airflow instances that prevents timely workflow execution. |
|
Scheduler Tasks | Shows the sum of Airflow Scheduler Tasks that are executing or starving by host ID. Use to understand scheduler load, identify periods when the scheduler might be overwhelmed with too many tasks, and ensure task distribution works as expected. |
|
Executor Tasks | Shows the maximum count of Executor Tasks (queued, running and open slots), grouped by host.name . Note that Queued reflects the number of queued tasks on executor, Running reflects the number of running tasks on executor, and Open Slots reflects the number of open slots on executor. |
|
Pool Task Slots by Host | Shows the maximum count of Airflow Pool Slots - Deferred, Queued, Open, Running, Starving and Scheduled by Host. Can be used to monitor resource allocation, identify when pools are at capacity, and optimize your configuration to match your workflow needs. |
|
The Kafka Board Template provides insight into Kafka brokers, topics, partition, and consumers.
This Board Template relies on the Kafka Metrics receiver provided by the OpenTelemetry Collector Contrib distribution. View OpenTelemetry documentation for set up instructions.
For relevant Java Virtual Machine (JVM) metrics, the OpenTelemetry Java Agent should be included in Kafka nodes as well.
The Kafka Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Number of Active Brokers | Shows the number of active brokers. |
|
Consumer Group Membership | Shows the number of consumers per broker. |
|
Consumer Progress Lag vs Offset Rate | Shows the average rate of Kafka consumer group lag and offsets over time, grouped by topic partitions. Use to monitor consumer progress and to detect delays by comparing offset increases to lag. |
|
Partition Offset Overview | Shows the rate of change in the oldest and current offsets across Kafka partitions. |
|
Partition Count By Topic | Shows the number of partitions for each topic. Use for capacity planning an ensuring proper topic configuration. |
|
Partition Replication Health | Shows the number of in-sync replicas for each partition compared to total replicas. Use to identify under-replicated partitions. |
|
Consumer Group Lag by Topic | Shows total lag across all partitions for each consumer group and topic combination. |
|
Partition Balance Analysis | Shows distribution of offsets across partitions for each topic. Use to identify potential partition imbalances. |
|
High Consumer Lag | Shows high consumer group lag, which may indicate potential consumer issues. |
|
Message Throughput | Shows the approximate message throughput for each topic by measuring the rate of change in offset over time. |
|
JVM Thread Count by Cluster and State | Shows the total JVM thread count across Kafka clusters, grouped by thread state. Use to identify thread contention or resource leaks. |
|
JVM Garbage Collection Durations | Shows the median JVM and the P90 garbage collection durations. Use to understand garbage collection efficiency and memory management health. |
|
Max Recent JVM CPU Utilization | Shows the highest CPU utilization within the JVM at a default 30 minute window. Use to identify potential load spikes or bottlenecks that may affect your cluster. |
|
JVM Memory Usage and Commitment | Shows memory usage patterns in clusters, providing a view in how memory is used and committed in the JVM. Use to track inefficient memory usage. |
|
The Linux Host Board Template provides useful queries for monitoring Linux hosts. It provides insights into CPU, memory, disk, filesystem, and network utilization on the configured hosts.
This Board Template utilizes the Host Metrics receiver provided by the OpenTelemetry Collector Contrib distribution. View OpenTelemetry documentation for set up instructions.
Configuration of the hostmetrics
receiver for this Board Template requires specific scrapers to be configured, namely:
The Linux Host Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Process CPU Time Breakdown | Shows the total CPU time consumed by different processes, broken down by process owner and command. Use to identify which processes are consuming the most CPU resources over time. |
|
Memory Consumption Trends | Shows the average memory usage across host, operating system, and state. Use to monitor and diagnose system memory usage trends. |
|
CPU Utilization Trends | Shows the distribution of CPU time spent on user processes, system operations, and idle time. Use to identify which hosts are under load. |
|
Disk I/O | Shows the active Disk input and output based on device. Use to identify high read/write rates. |
|
Memory Usage by Process | Shows Linux processes by memory usage and virtual memory consumption. Use to troubleshoot resource bottlenecks and optimize memory allocation. |
|
Filesystem Usage | Shows filesystem usage across different mount points, devices, and modes. Use for capacity planning and troubleshooting storage issues. |
|
Network Metrics | Shows network operations per network interface. |
|
The Postgres Board Template provides insight into Postgres’s operations, including active connections, database size, table count, and transaction throughput.
The Postgres Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Active Connections | Shows the current number of active connections. |
|
Database Size | Shows the database size over time. Use to help with capacity planning and identifying unexpected growth patterns. |
|
Database and Table Count | Shows visibility into number of databases and tables, which can identify database sprawl. |
|
Transaction Throughput | Shows the rate of commits and rollbacks per database, which provides insight into transaction throughput and success rates. |
|
Block Read Performance | Shows the the sources of block reads and their rates. Use to diagnose input/output performance issues. |
|
Index Usage | Shows the rate of index scans. Use to identify frequently used indexes. |
|
Database Operations | Shows database operations. Use to provide insight into workload patterns. |
|
Background Writer Activity | Shows buffer writes by source. Use to identify potential input/output bottlenecks. |
|
Checkpoint Frequency | Shows the rate of checkpoints by type (requested versus scheduled), which can help identify if checkpoints are occurring too frequently. |
|
Checkpoint Duration | Shows time spent on checkpoint operations across databases and tables. Longer checkpoint durations can negatively impact database performance. |
|
Table Size | Shows the top 10 largest tables, which may identify tables that require optimization or partitioning. |
|
Index Size | Shows the top 10 largest indexes, which may identify indexes that need rebuilding or optimization. |
|
Cache Hit Ratio | Shows the sum of block reads satisfied from the buffer cache. A higher number indicates better performance. |
|
Replication WAL Delay | Shows time between flushing recent WAL and notification standby servers have completed operation on it. Use to track replication delays. |
|
Replication Data Delay | Shows the amount of data delayed in replication, which can help identify network or performance issues affecting replication. |
|
Database Locks by Type | Shows the maximum number of database locks per type. Use for situations where multiple concurrent transactions may cause resource contention. |
|
Postgres Memory Utilization | Shows memory usage and amount of committed memory for postgres processes. Use to identify inefficient processes. |
|
Postgres CPU Utilization Trends | Shows CPU utilization for PostgreSQL processes. Use to identify inefficient queries, excessive index scanning, and so on. |
|
Number of Postgres Operations | Shows the number of PostgreSQL operations per database and table name. |
|
The Spring Boot Board Template provides insight into application health and performance metrics for Spring Boot microservices.
The Spring Boot Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Database Usage | Shows database performance metrics. Use to help identify slow-performing queries and connection issues. |
|
API Endpoint Latency | Shows a heatmap of API endpoint response times. Use to highlight bottlenecks or anomalies in performance. |
|
Garbage Collection Performance Monitor | Shows maximum, average, and P95 duration of garbage collection metrics. Use to identify memory allocation patterns that causes application slow down. |
|
Request Per Minute | Shows requests made per minute. Use to observe the traffic patterns and to detect unexpected load or errors. |
|
Heap used vs Heap Max Limit | Shows the JVM memory matrix and compares current memory usage against maximum heap limit. Use to identify out of memory errors. |
|
API Errors | Shows error responses with status code >= 400 . Use to monitor API health. |
|
Response Size Distribution | Shows response payload size. Use to monitor data transfer efficiency, and to identify any unexpectedly large response. |
|
JVM CPU Time Rate | Shows CPU consumption rate metrics. Use to identify processing-intensive operations and to detect performance decline overtime. |
|
The Django Board Template provides insight into application heath and performance metrics for a Django application.
This board utilizes the OpenTelemetry Python API for automatic instrumentation via the OpenTelemetry Python SDK.
View the OpenTelemetry Python API documentation and their Django instrumentation instructions.
The Django Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Request Count Per Minute | Shows requests made per minute. Use to observe the traffic patterns and to detect unexpected load or errors. |
|
HTTP Response Duration | Shows the P95 response duration by route, status code and server name. Highlights Django HTTP performance. |
|
HTTP Errors | Shows the count of HTTP errors by route, status code, and host.name . Use to assess the success and error rate of APIs. |
|
Exceptions | Shows exceptions thrown in the service. Use to access overall health of the application. |
|
AVG and P95 Request Size | Shows the average and P95 HTTP request size to monitor payload efficiency. |
|
AVG and P95 Response Size | Shows the average and P95 HTTP response size to monitor payload efficiency. |
|
P95 and Heatmap of Job Duration | Shows the P95 and Heatmap of Job Duration by messaging destination, messaging system, and server name. Provides insights into status async job runners. |
|
Jobs Executed | Shows the count of root traces with messaging system and destination. Can be used to assess overall performance of the async job operations. |
|
DB connection Count Per Min | Shows the connection count per minute where db connection event is “open”. Helps gain visibility into connection pooling efficiency. |
|
The Rails Board Template gives you visibility into Rails behavior, performance, and health. The queries and visualizations help identify slow database queries, inefficient code paths, and other performance bottlenecks.
The required fields in the Rails Board Template are derived from Ruby and Ruby on Rails support for OpenTelemetry logs, metrics, and traces.
View our documentation on instrumenting your Ruby and Ruby on Rails applications.
The Rails Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Requests Served | Shows count of requests served by Rails by host.name . Use to provide an overview of traffic volume at a glance. |
|
HTTP Response Duration | Shows P95 response duration by route, controller namespace, controller function, status code, and host.name . Use for Rails HTTP performance. |
|
HTTP Duration Heatmap | Shows a heatmap of HTTP response duration by route, status code and host.name . Use to assess and investigate outliers. |
|
HTTP Errors | Shows count of HTTP errors by route, Controller namespace, status code, and host.name . Use to assess success and error rate of Rails web endpoints. |
|
DB Statement Duration | Shows a heatmap and the P95 of database duration per database name, operation, statement and host.name . A heatmap provides more information to help identify outlier DB statements. |
|
P95 and Heatmap of Job Duration | Shows P95 and a heatmap of Job Duration by messaging destination, messaging system, service name, and host.name . Provides insights into status of Rails async job runners, such as ActiveJob and Sidekiq. |
|
Exceptions | Shows exceptions thrown by type, code namespace, and host.name . Use to assess overall health of your Rails application. |
|
Jobs Executed | Shows count of root traces with messaging system and destination. Use to assess overall performance of Rails async job operations. |
|
The Kubernetes Pod Metrics Board Template includes queries that help you investigate pod performance and resource usage within Kubernetes clusters:
Query Name | Query Description | Required Fields |
---|---|---|
Pod CPU Usage | Shows the amount of CPU used by each pod in the cluster. CPU is reported as the average core usage measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers, and 1 hyper-thread on bare-metal Intel processors. |
|
Pod Memory Usage | Shows the amount of memory being used by each Kubernetes pod. |
|
Pod Uptime Smokestacks | As pod uptime ever-increases, this query uses the smokestack method, which applies a LOG10 to the Pod Uptime metric, and newly started or restarted pods appear more significantly than pods that have been running a long time, which move into a straight line eventually. |
|
Unhealthy Pods | Shows trouble that pods may be experiencing during their operating lifecycle. Many of these events are present during start-up and get resolved so the presence of a count isn’t necessarily bad. |
|
Pod CPU Utilization vs. Limit | When a CPU Limit is present in a pod configuration, this query shows how much CPU that each pod uses as a percentage against that limit. |
|
Pod CPU Utilization vs. Request | When a CPU Request is present in a pod configuration, this query shows how much CPU that each pod uses as a percentage against that request value. |
|
Pod Memory Utilization vs. Limit | When a Memory Limit is present in a pod configuration, this query shows how much memory that each pod uses as a percentage against that limit value. |
|
Pod Memory Utilization vs. Request | When a Memory Request is present in a pod configuration, this query shows how much memory that each pod uses as a percentage against that request value. |
|
Pod Network IO Rates | Displays Network IO RATE_MAX for Transmit and Receive network traffic (in bytes) as a stacked graph, and gives the overall network rate and the individual rate for each node. |
|
Pods With Low Filesystem Availability | Shows any pods where filesystem availability is below 5 GB. |
|
Pod Filesystem Usage | Shows the amount of filesystem usage per Kubernetes pod, displayed in a stack graph to show total filesystem usage of all pods. |
|
Pods Per Namespace | Shows the number of pods currently running in each Kubernetes namespace. |
|
Pods Per Node | Shows the number of pods currently running in each Kubernetes Node. |
|
Pod Network Errors | Shows network errors in receive and transmit, grouped by pod. |
|
Pods Per Deployment | Shows the number of pods currently deployed in different Kubernetes deployments. |
|
The Kubernetes Node Metrics Board Template includes queries that help you investigate node performance and resource usage within Kubernetes clusters:
Query Name | Query Description | Required Fields |
---|---|---|
Node CPU Usage | Shows the amount of CPU used on each node in the cluster. CPU is reported as the average core usage measured in cpu units. One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers, and 1 hyper-thread on bare-metal Intel processors. |
|
Node Memory Utilization | Shows percent of memory used on each Kubernetes node. |
|
Node Network IO Rates | Displays Network IO RATE_MAX for Transmit and Receive network traffic as a stacked graph, and gives overall network rate and the individual rate for each node. |
|
Unhealthy Nodes | Shows errors that Kubernetes nodes are experiencing. |
|
Node Filesystem Utilization | Shows percent of filesystem used on each node. |
|
Node Uptime Smokestack | As node uptime ever-increases, this query uses the smokestack method, which applies a LOG10 to the Node Uptime metric, and newly started or restarted nodes appear more significantly than nodes that have been running a long time, which move into a straight line eventually. |
|
Node Network Errors | Shows network transmit and receive errors for each node. |
|
Pods and Containers per Node | Shows the number of pods and the number of containers per node as stacked graphs, and also shows total number of pods and containers across the environment. |
|
The Kubernetes Workload Health Board Template includes queries that help you diagnose Kubernetes-related application issues:
Query Name | Query Description | Required Fields |
---|---|---|
Container Restarts | Shows the total number of restarts per pod, and the rate of restarts of pods where the restart count is greater than zero. |
|
Unhealthy Pods | Shows trouble that pods may be experiencing during their operating lifecycle. Many of these events are present during start-up and get resolved so the presence of a count isn’t necessarily bad. |
|
Pending Pods | Shows pods in a “Pending” state. |
|
Failed Pods | Shows pods in a “Failed” or “Unknown” state. |
|
Unhealthy Nodes | Shows errors that Kubernetes nodes are experiencing. |
|
Unhealthy Volumes | Shows volume creation and attachment failures. |
|
Unscheduled Daemonset Pods | Tracks cases where a pod in a daemonset is not currently running on every node in the cluster as it should be. |
k8s.namespace.name |
Stateful Set Pod Readiness | Tracks any stateful sets where pods are in an non-ready state that should be in a ready state. |
|
Deployment Pod Status | Shows Deployments where Pods have not fully deployed. Numbers greater than zero show pods in a deployment that are not yet “ready”. |
|
Job Failures | Tracks the number of failed pods in Kubernetes jobs. |
|
Active Cron Jobs | Tracks the number of active pods in each Kubernetes cron job. |
|
The OpenTelemetry Collector Operations Board Template includes queries with key metrics emitted by the OpenTelemetry Collector during its operation:
Query Name | Query Description | Required Fields |
---|---|---|
Exporter Span Failures | Shows when errors happen during enqueueing or sending in exporters. |
|
Collector Uptime Smokestacks | Shows the uptime for different pods with a Log10 to make it clearer where restarts are happening. |
|
Exporter Metric Send Failures | Shows when errors happen during sending from exporters. |
|
Exporter Metrics Enqueue Failures | Shows when errors happen during enqueueing in exporters. |
|
Exporter Log Records Failures | Shows when errors happen during enqueueing or sending in exporters. |
|
The OpenTelemetry Java Metrics Board Template includes queries that help you investigate application issues related to the Java Virtual Machine (JVM).
Metrics for Java applications are sourced from the JVM and reported by the OpenTelemetry Java Agent or Honeycomb OpenTelemetry Distribution for Java.
Query Name | Query Description | Required Fields |
---|---|---|
JVM Memory Usage (Young Generation) | Shows memory usage for Eden space on the JVM heap, which is where newly created objects are stored. When it fills, a minor Garbage Collection (GC) occurs, moving all “live” objects to the Survivor space. In addition to current memory usage, committed represents the guaranteed available memory, and limit represents maximum usable. |
|
JVM Memory Usage (Old Generation) | Shows memory usage for tenured Gen JVM heap space, which stores long-lived objects. When a Full or Major GC is performed, it is expensive and may pause app execution. Committed represents guaranteed available memory, and limit represents maximum usable memory. |
|
JVM Garbage Collection (GC) Activity | Shows JVM garbage collection activity. JVM GC actions occur periodically to reclaim memory but consume CPU cycles to do so. In the worst cases, a GC can cause the entire JVM to pause, making the application appear unresponsive. |
|
JVM CPU Utilization | Shows system CPU utilization and 1-minute load average, as captured by the JVM. |
|
JVM Buffer Memory Usage | Shows usage of buffer memory, which is provided by the OS and is outside the JVM’s heap memory allocation. Buffer memory is used by Java NIO to quickly write data to network or disk. |
|
JVM Non-Heap Memory Usage | Shows usage of JVM non-heap memory, which is allocated above and beyond the heap size you’ve configured. JVM non-heap memory is a section of memory in the JVM that stores class information (Metaspace), compiled code cache, thread stack, and so on. It cannot be garbage collected. |
|
The AWS Lambda Health Board Template includes queries that monitor the health of AWS Lambda functions, including metrics for invocations, errors, throttles, and concurrency:
Query Name | Query Description | Required Fields |
---|---|---|
Duration & Execution by ID/Version | Tracks the execution time of Lambda functions, identified by their ID or version. Useful for analyzing the performance and efficiency of different versions or instances of a function over time. |
|
Lambda Invocations by Function | Shows the total number of times each Lambda function is invoked. It helps in tracking the frequency of usage of different functions, enabling a clear understanding of which functions are most or least used. |
|
Latency by Function/Metric | Shows the response time for each Lambda function, broken down by specific metrics. Useful for identifying functions that may be experiencing performance issues due to high latency. |
|
Function Error Count and Rate | Shows two key pieces of information: the total number of errors encountered by each Lambda function and the error rate, calculated as the ratio of errors to total invocations. Useful for pinpointing functions that are failing or experiencing issues. |
|
Lambda Throttles | Shows the instances where Lambda invocations are being throttled, such as when the number of function calls exceeds the concurrency limits. Tracking this helps in managing and optimizing the scalability settings for each function. |
|
Function Concurrency | Monitors the simultaneous execution count of each Lambda function, tracking how many instances of a function are running at the same time. |
|
The AWS EC2 Board Template includes queries that monitor the health of AWS EC2 instances, including status failures, disk Read and Write operations, and EBS operations.
The AWS EC2 Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
CPU Utilization | Shows CPU utilization per EC2 instance. |
|
Network I/O | Shows network input and output per EC2 instance. |
|
EBS Read Operations | Shows the number of read operations committed by the instance. |
|
EBS Write Operations | Shows the number of write operations committed by the instance. |
|
EBS IO Balance | Shows available input and output per second that attached EBS volumes are utilizing. Use to monitor potential throttling on an EBS volume attached to an instance. |
|
Instance Metadata Service Outliers | Shows the number of instances that are not currently using IMDSv2. Use to identify potential security issues with EC2 instances. |
|
EC2 Disk Read/Write | Shows Write and Read operations undertaken by EC2 instances. Use to monitor EBS volume usage. |
|
EC2 Instance Status Failures | Shows any EC2 instances that have failed a status check in the provided time period. |
|
The AWS ALB/ELB Board Template includes queries that monitor the Load Balancer’s health, status codes, active connections, and requests.
The AWS ALB/ELB Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Request Count Per Target | Shows how requests are distributed across targets. Use to diagnose imbalanced traffic in the load balancer. |
|
Healthy vs. Unhealthy Host Count | Shows the number of healthy versus unhealthy hosts per load balancer, which is segmented across target groups and availability zones. Use to quickly spot failing load balancer targets. |
|
Load Balancer Status Codes | Shows status codes per load balancer. Use to identify routing or traffic management issues. |
|
Active Connections | Shows active connections per load balancer. |
|
State Routing | Shows load balancer state routing. Use to identify network configuration errors, unresponsive applications, or health check delays. |
|
Load Balancer Capacity Units | Shows LCUs consumed during a given period of time. Use to optimize load balancer cost and detecting bottlenecks. |
|
Anomalous Host Count | Shows the number of hosts behaving abnormally. Use to detect and diagnose excessive error rates, latency issues, or inconsistent health check results. |
|
DNS Target State | Shows load balancer DNS target state resolution. Use to identify failing targets and DNS misconfigurations. |
|
TLS Negotiation Errors | Shows the number of TLS negotiation errors per load balancer. |
|
Connection Error Count | Shows errors on targets. Use to diagnose and troubleshoot misconfigured load balancer targets. |
|
The SQS Board Template provides insight into critical AWS SQS operations.
The SQS Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Request Count Per Minute | Shows requests made per minute. Use to observe the traffic patterns and detect unexpected load or errors. |
|
HTTP Response Duration | Shows the P95 response duration by route, status code, and server name. Use for Django HTTP performance. |
|
HTTP Errors | Shows count of HTTP errors by route, status code, and host.name . Use to assess success and error rates of APIs. |
|
Exceptions | Shows exceptions thrown in the service. Use to assess the overall health of the application. |
|
AVG and P95 Request Size | Shows the average and P95 HTTP request size. Use to monitor payload efficiency. |
|
AVG and P95 Response Size | Shows the average and P95 HTTP response size. Use to monitor payload efficiency. |
|
P95 and Heatmap of Job Duration | Shows the P95 and a heatmap of Job Duration by messaging destination, messaging system, and server name. Provides insights into status async job runners. |
|
Jobs Executed | Shows count of root traces with messaging system and destination. Use to assess overall performance of the async job operations. |
|
DB connection Count Per Min | Shows the connection count per minute where database connection event is “open”. Use to gain visibility into connection pooling efficiency. |
|
The RDS Board Template provides insight to monitor and optimize performance for AWS RDS databases.
The RDS Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Number of Connections | Shows the number of connections to RDS instances. |
|
Database Load | Shows the level of session activity on RDS instances. |
|
Disk Queue Depth | Shows the number of outstanding input/output waiting to access the disk. High queue depth can indicate the workload is generating more read/write requests than underlying storage can handle. |
|
Freeable Memory | Shows the minimum freeable memory per database instance. Use to identify memory pressure in RDS instances. |
|
Read/Write Operations | Shows the read and write operations per second that the RDS instance is performing. Use to diagnose bottlenecks, optimize workloads, and manage cost. |
|
CPU Utilization | Shows maximum CPU utilization across database instance identifiers. |
|
Free Storage Space | Shows the amount of free storage space per database instance. |
|
Burst Balance | Shows the burst capacity per database instance. Lower burst capacity can affect input/output performance. Use for capacity planning and to optimize database performance. |
|
Read/Write Latency | Visualizes Read/Write latency per database instance. Use for troubleshooting slow queries, inefficient indexes, or locking issues. |
|
Transaction Log Disk Usage | Shows the amount of storage consumed by database transaction logs. Use to prevent storage exhaustion. |
|
Checkpoint Lag | Shows checkpoint lag. Use to determine latency between leader and followers in replication. |
|
Swap Usage | Shows swap activity (from RAM to disk) per RDS instance. Use for identifying performance issues related to memory pressure. |
|
Network Throughput | Shows the rate at which network data is being sent from RDS instances. Use to identify excessive data transfer or increased query latencies. |
|
For teams using Refinery to sample their data, the Refinery Board Template provides an overview of sampling operations.
The Refinery Board Template includes the following queries:
Query Name | Query Description | Required Fields |
---|---|---|
Stress Relief Status | Shows the current stress level on the Refinery cluster. |
|
Dropped From Stress | Shows how many traces are being dropped due to stress on the Refinery cluster. |
|
Stress Relief Log | Shows reasons why Refinery is going into stress relief. |
|
Cache Health | Shows metrics for cache health. |
|
Cache Ejections | Shows number of traces ejected from cache. |
|
Intercommunications | Shows total events from outside Refinery and events redirected from a peer. |
|
Receive Buffers | Shows receive buffer operations. |
|
Peer Send Buffers | Show metrics for the queue used to buffer spans to send to peer nodes. |
|
Upstream Send Buffers | Shows metrics for the queue used to buffer spans to send to Honeycomb. |
|
EMADynamicSampler Performance | Shows EMADynamicSampler sampling effectiveness. |
|
EMAThroughputSampler Performance | Shows EMAThroughputSampler sampling effectiveness. |
|
WindowedThroughput Performance | Shows WindowedThroughput sampling effectiveness. |
|
TotalThroughputSampler Performance | Shows TotalThroughputSampler sampling effectiveness. |
|
DynamicSampler Performance | Shows DynamicSampler sampling effectiveness. |
|
RulesBasedSampler Performance | Shows RulesBasedSampler sampling effectiveness. |
|
Trace Indicators | Shows total traces sent before completion and span received for a trace already sent. |
|
Sampling Decisions | Shows total traces accepted and sent or dropped. |
|
Refinery Send Event Error Logs | Shows errors when sending events to its peers or upstream to our API server. |
|
Refinery Handler Event Error Logs | Shows errors when receiving or parsing events being sent to a node. |
|
Refinery Events Exceeding Max Size | Shows errors when events are too large to be sent to Honeycomb. |
|
The Activity Log Security Board Template includes queries that track API Key activity:
Query Name | Query Description | Required Fields |
---|---|---|
API Key Added Permissions | Shows when permissions are added to an existing API key. |
|
API Key Activities by User | Displays the number of changes to API keys broken down by user. |
|
Authentication Type by User | Displays which type of authentication is used for each user. |
|
The Activity Log Leaderboard Board Template includes queries that highlight advanced and frequent usage of Honeycomb by your team:
Query Name | Query Description | Required Fields |
---|---|---|
Queries by User | Shows which environments are being queried. |
|
Complex Queries by User | Shows which users frequently use Visualize, Where, and Having clauses. |
|
Top Query Visualizations | Shows the most commonly used visualizations. |
|
Top Tinkerers | Lists which users perform the most updates to SLOs, Triggers, and Calculated Fields. |
|
Queries by Dataset | Shows which datasets are being queried the most. |
|
Queries by Environment | Shows a count of run queries as grouped by environment. |
|
The Activity Log Trigger and SLO Activity Board Template includes queries related to trigger and SLO activations and modifications:
Query Name | Query Description | Required Fields |
---|---|---|
Trigger State Changes | Shows instances when triggers have been triggered or resolved. |
|
Trigger Modifications | Shows creations, modifications, and deletions of triggers. |
|
Most Updated Triggers | Shows triggers that received the most changes recently. |
|
Top Updated SLOs by Update Type | Shows creations, modifications, and deletions of SLOs and the supporting SLI (Calculated Field). |
|
SLOs Created and Deleted | Shows creation and deletion of SLOs. |
|
SLI Expression Changes by SLO | Shows when SLIs (Calculated Fields) related to SLOs have been changed. |
|
To explore common issues when working with Board Templates, visit Common Issues with Visualization: Board Templates.