Operations and Maintenance
Day-2 operations for Avalanche infrastructure — AvalancheGo upgrades, monitoring, health checks, staking key backup, database snapshots, and rolling restarts.
This guide covers ongoing operations for infrastructure deployed with avalanche-deploy. Commands are available for both Terraform + Ansible and Kubernetes deployment paths.
Health Checks
Run comprehensive health checks across all nodes:
# Basic health checks
make health-checks
# Include L1 chain status
make health-checks CHAIN_ID=$CHAIN_ID# Basic health checks
make k8s-health-checks
# Include L1 chain status
make k8s-health-checks CHAIN_ID=$CHAIN_IDHealth checks verify:
- AvalancheGo service/pod status
- NodeID and version consistency
- P-Chain, X-Chain, and C-Chain bootstrap status
- L1 block number (if
CHAIN_IDis provided)
Monitoring
Deploy Prometheus and Grafana
make monitoringAccess Grafana: http://<monitoring-ip>:3000 (default credentials: admin/admin)
Pre-Built Dashboards
| Dashboard | Metrics |
|---|---|
| Avalanche L1 | Block height, transaction throughput, validator status |
| L1 EVM | Gas usage, contract calls, pending transactions |
| P-Chain | Staking metrics, validator set changes |
| System Health | CPU, memory, disk, network for all nodes |
Prometheus is pre-configured to scrape both AvalancheGo metrics and node_exporter system metrics from all hosts.
Viewing Logs
# View logs from all nodes
make logsOr SSH directly to inspect a specific node:
ssh -i ~/.ssh/avalanche-deploy ubuntu@<node-ip> \
"sudo journalctl -u avalanchego -f --no-pager -n 100"Rolling Restart
Restart all nodes one at a time with health checks between each restart. This ensures zero downtime:
make rolling-restartThe playbook:
- Stops AvalancheGo on one node
- Starts AvalancheGo
- Waits for the node to report healthy
- Moves to the next node
Upgrading AvalancheGo
Perform a zero-downtime rolling upgrade to a new AvalancheGo version:
make upgrade VERSION=1.14.1Subnet-EVM is bundled with AvalancheGo v1.12.0+ and updates automatically with each AvalancheGo upgrade. No separate plugin management is needed.
The upgrade playbook follows the same rolling pattern as restarts: one node at a time with health checks between each upgrade.
Staking Key Backup and Restore
Backup
# Backup all validator keys to S3 (KMS encrypted)
make backup-keys CLOUD=awsKeys are encrypted with AWS KMS. Validator instances access the S3 bucket via IAM role — no credentials stored on disk.
# Deploy a daily backup CronJob
make k8s-backup-keys BACKUP_BUCKET=my-bucket BACKUP_PROVIDER=s3Supports S3 and GCS. Use IRSA (AWS) or Workload Identity (GCP) for credential-free access on managed Kubernetes.
Restore
# Restore keys from one node to another
make restore-keys CLOUD=aws SOURCE=primary-validator-1 TARGET_IP=10.0.1.50List Backups
aws s3 ls s3://$(terraform -chdir=terraform/primary-network/aws output -raw staking_keys_bucket)/Database Snapshots
Create lz4-compressed snapshots of node databases for fast bootstrapping:
# Create a snapshot
make create-snapshot CLOUD=aws NODE=primary-validator-1
# Create with custom name
make create-snapshot CLOUD=aws NODE=primary-validator-1 NAME=mainnet-2025-02
# List snapshots
make list-snapshots CLOUD=aws
# Restore a snapshot
make restore-snapshot CLOUD=aws TARGET=migration-targetSnapshots include SHA256 checksums for integrity verification. Enable integrity checking with:
cd ansible && ansible-playbook -i inventory/aws_hosts playbooks/primary-network/restore-snapshot.yml \
--limit migration-target \
-e verify_integrity=trueVerified restore mode requires approximately 3x the snapshot size in free disk space (download + extract + verify).
Reset L1 Chain Data
Wipe L1 chain data on all nodes for redeployment. This preserves staking keys and Primary Network data:
make reset-l1make k8s-reset-l1Scales down pods, cleans chain data from PVCs (preserves staking keys), removes L1 tracking config, and scales back up.
Tear Down Infrastructure
Permanently destroy all cloud resources:
# Destroy L1 infrastructure
make destroy
# Destroy Primary Network infrastructure
make primary-destroy CLOUD=awsThis permanently deletes all VMs, disks, and networking. Staking keys previously backed up to S3 are preserved, but node databases are permanently lost.
Command Reference
L1 Operations
| Command | Description |
|---|---|
make status | Check node sync status |
make health-checks | Run comprehensive health checks |
make logs | View node logs |
make rolling-restart | Zero-downtime rolling restart |
make upgrade VERSION=x.y.z | Rolling AvalancheGo upgrade |
make monitoring | Deploy Prometheus + Grafana |
make reset-l1 | Wipe L1 chain data (keeps keys) |
make destroy | Tear down all infrastructure |
Primary Network Operations
| Command | Description |
|---|---|
make primary-status CLOUD=aws | Check Primary Network node status |
make backup-keys CLOUD=aws | Backup staking keys to S3 |
make restore-keys CLOUD=aws SOURCE=... TARGET_IP=... | Restore staking keys |
make create-snapshot CLOUD=aws NODE=... | Create database snapshot |
make restore-snapshot CLOUD=aws TARGET=... | Restore database snapshot |
make list-snapshots CLOUD=aws | List available S3 snapshots |
make prepare-migration CLOUD=aws NODE=... | Prepare node for migration |
make migrate-validator CLOUD=aws SOURCE=... TARGET=... | Execute validator migration |
make primary-destroy CLOUD=aws | Tear down Primary Network infra |
Kubernetes Operations
| Command | Description |
|---|---|
make k8s-health-checks | Run comprehensive health checks |
make k8s-backup-keys BACKUP_BUCKET=... | Deploy staking key backup CronJob |
make k8s-reset-l1 | Wipe L1 chain data (keeps keys) |
make k8s-init-validator-manager | Initialize ValidatorManager contract |
make k8s-erpc | Deploy eRPC load balancer |
make k8s-faucet | Deploy token faucet |
make k8s-blockscout | Deploy Blockscout block explorer |
make k8s-graph-node | Deploy The Graph Node |
make k8s-safe | Deploy Safe multisig infrastructure |
make k8s-monitoring | Deploy Prometheus + Grafana |
make k8s-icm-relayer | Deploy ICM Relayer |
make k8s-cleanup | Remove all Helm releases |
Is this guide helpful?
Deploy Primary Network on Kubernetes
Run Avalanche Primary Network validators and RPC nodes on Kubernetes using Helm charts with Prometheus and Grafana monitoring.
Troubleshooting
Common issues and solutions for Avalanche Deploy — SSH connectivity, node sync, L1 creation, RPC access, genesis configuration, snapshots, migration, and add-on debugging.