ACP-267: Primary Network validator uptime requirement increases from 80% to 90%.Read the proposal

Operations and Maintenance

Day-2 operations for Avalanche infrastructure — AvalancheGo upgrades, monitoring, health checks, staking key backup, database snapshots, and rolling restarts.

This guide covers ongoing operations for infrastructure deployed with avalanche-deploy. Commands are available for both Terraform + Ansible and Kubernetes deployment paths.

Health Checks

Run comprehensive health checks across all nodes:

# Basic health checks
make health-checks

# Include L1 chain status
make health-checks CHAIN_ID=$CHAIN_ID
# Basic health checks
make k8s-health-checks

# Include L1 chain status
make k8s-health-checks CHAIN_ID=$CHAIN_ID

Health checks verify:

  • AvalancheGo service/pod status
  • NodeID and version consistency
  • P-Chain, X-Chain, and C-Chain bootstrap status
  • L1 block number (if CHAIN_ID is provided)

Monitoring

Deploy Prometheus and Grafana

make monitoring

Access Grafana: http://<monitoring-ip>:3000 (default credentials: admin/admin)

Pre-Built Dashboards

DashboardMetrics
Avalanche L1Block height, transaction throughput, validator status
L1 EVMGas usage, contract calls, pending transactions
P-ChainStaking metrics, validator set changes
System HealthCPU, memory, disk, network for all nodes

Prometheus is pre-configured to scrape both AvalancheGo metrics and node_exporter system metrics from all hosts.

Viewing Logs

# View logs from all nodes
make logs

Or SSH directly to inspect a specific node:

ssh -i ~/.ssh/avalanche-deploy ubuntu@<node-ip> \
  "sudo journalctl -u avalanchego -f --no-pager -n 100"

Rolling Restart

Restart all nodes one at a time with health checks between each restart. This ensures zero downtime:

make rolling-restart

The playbook:

  1. Stops AvalancheGo on one node
  2. Starts AvalancheGo
  3. Waits for the node to report healthy
  4. Moves to the next node

Upgrading AvalancheGo

Perform a zero-downtime rolling upgrade to a new AvalancheGo version:

make upgrade VERSION=1.14.1

Subnet-EVM is bundled with AvalancheGo v1.12.0+ and updates automatically with each AvalancheGo upgrade. No separate plugin management is needed.

The upgrade playbook follows the same rolling pattern as restarts: one node at a time with health checks between each upgrade.

Staking Key Backup and Restore

Backup

# Backup all validator keys to S3 (KMS encrypted)
make backup-keys CLOUD=aws

Keys are encrypted with AWS KMS. Validator instances access the S3 bucket via IAM role — no credentials stored on disk.

# Deploy a daily backup CronJob
make k8s-backup-keys BACKUP_BUCKET=my-bucket BACKUP_PROVIDER=s3

Supports S3 and GCS. Use IRSA (AWS) or Workload Identity (GCP) for credential-free access on managed Kubernetes.

Restore

# Restore keys from one node to another
make restore-keys CLOUD=aws SOURCE=primary-validator-1 TARGET_IP=10.0.1.50

List Backups

aws s3 ls s3://$(terraform -chdir=terraform/primary-network/aws output -raw staking_keys_bucket)/

Database Snapshots

Create lz4-compressed snapshots of node databases for fast bootstrapping:

# Create a snapshot
make create-snapshot CLOUD=aws NODE=primary-validator-1

# Create with custom name
make create-snapshot CLOUD=aws NODE=primary-validator-1 NAME=mainnet-2025-02

# List snapshots
make list-snapshots CLOUD=aws

# Restore a snapshot
make restore-snapshot CLOUD=aws TARGET=migration-target

Snapshots include SHA256 checksums for integrity verification. Enable integrity checking with:

cd ansible && ansible-playbook -i inventory/aws_hosts playbooks/primary-network/restore-snapshot.yml \
  --limit migration-target \
  -e verify_integrity=true

Verified restore mode requires approximately 3x the snapshot size in free disk space (download + extract + verify).

Reset L1 Chain Data

Wipe L1 chain data on all nodes for redeployment. This preserves staking keys and Primary Network data:

make reset-l1
make k8s-reset-l1

Scales down pods, cleans chain data from PVCs (preserves staking keys), removes L1 tracking config, and scales back up.

Tear Down Infrastructure

Permanently destroy all cloud resources:

# Destroy L1 infrastructure
make destroy

# Destroy Primary Network infrastructure
make primary-destroy CLOUD=aws

This permanently deletes all VMs, disks, and networking. Staking keys previously backed up to S3 are preserved, but node databases are permanently lost.

Command Reference

L1 Operations

CommandDescription
make statusCheck node sync status
make health-checksRun comprehensive health checks
make logsView node logs
make rolling-restartZero-downtime rolling restart
make upgrade VERSION=x.y.zRolling AvalancheGo upgrade
make monitoringDeploy Prometheus + Grafana
make reset-l1Wipe L1 chain data (keeps keys)
make destroyTear down all infrastructure

Primary Network Operations

CommandDescription
make primary-status CLOUD=awsCheck Primary Network node status
make backup-keys CLOUD=awsBackup staking keys to S3
make restore-keys CLOUD=aws SOURCE=... TARGET_IP=...Restore staking keys
make create-snapshot CLOUD=aws NODE=...Create database snapshot
make restore-snapshot CLOUD=aws TARGET=...Restore database snapshot
make list-snapshots CLOUD=awsList available S3 snapshots
make prepare-migration CLOUD=aws NODE=...Prepare node for migration
make migrate-validator CLOUD=aws SOURCE=... TARGET=...Execute validator migration
make primary-destroy CLOUD=awsTear down Primary Network infra

Kubernetes Operations

CommandDescription
make k8s-health-checksRun comprehensive health checks
make k8s-backup-keys BACKUP_BUCKET=...Deploy staking key backup CronJob
make k8s-reset-l1Wipe L1 chain data (keeps keys)
make k8s-init-validator-managerInitialize ValidatorManager contract
make k8s-erpcDeploy eRPC load balancer
make k8s-faucetDeploy token faucet
make k8s-blockscoutDeploy Blockscout block explorer
make k8s-graph-nodeDeploy The Graph Node
make k8s-safeDeploy Safe multisig infrastructure
make k8s-monitoringDeploy Prometheus + Grafana
make k8s-icm-relayerDeploy ICM Relayer
make k8s-cleanupRemove all Helm releases

Is this guide helpful?