Call for Research Proposals — up to $50,000. Deadline June 1, 2026.Apply now

Key Metrics & Alerts

The two metrics that tell you whether your Avalanche L1 is healthy, with recommended healthy, warning, and paging thresholds.

Once you have monitoring set up, the question is what to actually alert on. AvalancheGo exposes hundreds of Prometheus metrics, but two of them carry most of the signal:

  1. Query failure rate — covers almost every consensus, networking, and validator problem; most issues surface here first.
  2. Disk space remaining — an independent failure the query failure rate cannot see.

Page on these two. Most other problems (a misbehaving validator, a network partition, a stalled chain) show up as a rise in the query failure rate, so the secondary metrics below are mainly for diagnosing why it moved. The exception is correctness failures such as bad blocks, which can occur independently and are worth their own alert.

Thresholds are guidelines — most of the secondary values below reflect operational policy, not hard limits in the code. Validate them against your own chain's baseline before paging. Throughout, <chain> is your L1's blockchain ID (or its primary alias), which appears as the chain label on per-chain metrics.


Critical metrics

1. Query failure rate

The single most sensitive indicator of L1 health. A poll is a round of voting where the node asks a sample of validators whether they prefer a block; it succeeds when enough validators respond in time. Because each validator carries a share of stake, the success rate drops by roughly an offline validator's stake share whenever one stops responding — so this one number catches networking faults, down validators, and finalization problems together.

Healthy~100% successful
Warning< 95% successful
Paging< 90% successful

As it falls, finalization slows. Once a node's connected stake drops below AlphaConfidence/K (75% at the defaults), it stops sending queries and the chain stalls for that node. AvalancheGo does not expose a success percentage directly — compute it from the Snowman poll counters (polls_successful, polls_failed):

rate(avalanche_snowman_polls_successful{chain="<chain>"}[5m])
/
(
  rate(avalanche_snowman_polls_successful{chain="<chain>"}[5m])
+ rate(avalanche_snowman_polls_failed{chain="<chain>"}[5m])
)

When this degrades, check the secondary metrics below to find the cause.

2. Disk space remaining

A completely independent failure. The node keeps participating in consensus — with the query failure rate looking perfectly healthy — right up until it runs out of disk and shuts itself down. The failure rate gives you no warning of a disk problem, which is exactly why disk needs its own alert.

Healthy> 20% free
Warning< 20% free
Paging< 10% free
avalanche_resource_tracker_disk_available_percentage

AvalancheGo tracks free space on its database volume natively and self-governs on it: by default it reports itself unhealthy below 10% free and performs a fatal shutdown below 3% free (--system-tracker-disk-warning-available-space-percentage defaults to 10, --system-tracker-disk-required-available-space-percentage defaults to 3). Page at the 10% mark — the point the node itself flags unhealthy — so you have runway to add storage or prune well before the 3% shutdown. If you run in Kubernetes or on a managed host, alert on the equivalent volume-usage metric too, since the node's own metric stops reporting once the process is down.


Secondary metrics

The query failure rate usually moves first, so these are your main tools for diagnosing why it dropped — but several are still worth paging on in their own right (the paging thresholds are noted per metric), because some failures can occur without the failure rate reacting. The thresholds below match the alert configuration we run for production mainnet L1s.

Bad blocks (EVM L1s)

Metricavalanche_subnetevm_vm_eth_chain_block_bad_count{chain="<chain>"}
Healthy0
Pagingany sustained increase

Blocks that failed validation (state-root mismatch, invalid transactions). A rising count means this node is diverging from the network. It is an independent signal — the counter increments separately from the consensus poll counters, so a node can accumulate bad blocks without the failure rate moving, which is exactly why it gets its own alert.

The metric is namespaced by VM. A Subnet-EVM L1 runs the VM as an out-of-process plugin, so its metric carries a vm_ segment: avalanche_subnetevm_vm_eth_chain_block_bad_count. The in-process C-Chain (Coreth) has no vm_ segment: avalanche_evm_eth_chain_block_bad_count{chain="C"}. There is no built-in alert threshold — badBlockLimit (10) in the source is just an in-memory cache size, so alert on any sustained increase rather than a fixed count.

Connected stake

Metricavalanche_stake_percent_connected{chain="<chain>"}
Healthy≥ 0.8 (80%)
Paging< 0.8 (80%)

The fraction of total validator stake the node has live connections to. Note this is a fraction in [0, 1], not a 0–100 value — compare against 0.8, or multiply by 100 to display a percentage. When it falls, the node cannot reach enough stake to complete polls — a direct cause of failure-rate drops. The node's own health check fails below ~80% (alpha/k plus a buffer, at the defaults); query sending stops below 75% (AlphaConfidence/K).

Processing blocks

Metricavalanche_snowman_blks_processing{chain="<chain>"}
Healthylow and stable
Warningsustained climb (e.g. > 6 over 5 min)
Pagingsustained spike (e.g. > 15 over 5 min)

Blocks in consensus but not yet finalized. A sustained climb usually means finalization is stalling. The thresholds above are operational policy, not code defaults — AvalancheGo's own consensus health check trips on MaxOutstandingItems (256) and MaxItemProcessingTime (30s), so tune the numbers to your chain's block rate and watch the trend. To confirm whether the chain is genuinely stuck (versus just busy), check that avalanche_snowman_last_accepted_height{chain="<chain>"} is still increasing.

Benched validators

Metricavalanche_benchlist_benched_num{chain="<chain>"}
Healthy0
Paging> 1 over 10 min

The count of peers the node has temporarily stopped querying because they keep failing. A non-zero value means at least one validator is unreachable, but the gauge doesn't name which one — you'll need the node's logs to identify it. Also note benchlisting is capped by stake: a high-stake validator can keep failing without ever being benched, so 0 doesn't guarantee every validator is healthy.

Number of validators

Metricavalanche_stake_num_validators{chain="<chain>"}
Healthyyour expected validator count
Paging< 1

The validator set the node currently sees. Dropping to 0 means it has lost its view of the set entirely (a P-Chain or L1 manager problem).

Health check failures

Metricavalanche_health_checks_failing{check="health",tag="all"}
Healthy0
Paging> 0 (sustained)

A catch-all gauge of how many checks are currently failing in the node's health endpoint (networking, router, database, disk, BLS key, pending upgrades, bootstrap status, validation). It carries two labels: check (one of health, liveness, readiness) and tag (all, application, or a specific subnet ID). Use tag="all" for the complete rollup — tag="application" covers only node-wide checks and excludes per-subnet ones, so it is not a true catch-all. To watch one L1 specifically, select that subnet's ID as the tag. Because the gauge reports the current count rather than an event total, any non-zero value already means the node is unhealthy — page on > 0, optionally requiring it to persist a minute or two to avoid flapping on transient checks.

CPU usage

Metricavalanche_resource_tracker_cpu_usage (and host/container CPU)
Healthywell below your core count
Pagingsustained saturation

AvalancheGo exposes its own CPU usage as avalanche_resource_tracker_cpu_usage, measured in cores (a value of 2.0 means two full cores), not a percentage. Watch this alongside host- or container-level CPU (which comes from your infrastructure, not AvalancheGo). Sustained saturation slows block verification and message handling, which shows up downstream as a higher failure rate.


Summary

MetricTypePage when
Query failure rate (polls_successful / polls_failed)Critical< 90% successful
Disk space remaining (disk_available_percentage)Critical< 10% free
Bad blocks (subnetevm_vm_eth_chain_block_bad_count)Secondaryany sustained increase
Connected stake (stake_percent_connected)Secondary< 0.8
Processing blocks (blks_processing)Secondarysustained spike
Benched validators (benchlist_benched_num)Secondary> 1 / 10 min
Number of validators (stake_num_validators)Secondary< 1
Health check failures (health_checks_failing)Secondary> 0 sustained
CPU usage (resource_tracker_cpu_usage)Secondarysustained saturation

Start with the two critical alerts — query failure rate and disk. Add the secondary metrics to your dashboards so you can quickly find the cause when the failure rate moves.

Is this guide helpful?