Troubleshooting Runtime Issues
Diagnose and resolve common tmpnet runtime issues for local processes and Kubernetes deployments
This guide helps you diagnose and resolve common issues with tmpnet's different runtime environments. Issues are organized by runtime type for quick reference.
Local Process Runtime Issues
Port Conflicts
Symptom: Error messages like "address already in use" or "bind: address already in use" when starting a network.
Cause: A previous network is still running, or another application is using the ports.
Solution:
# Check for orphaned avalanchego processes
ps aux | grep avalanchego
# Kill any orphaned processes
pkill -f avalanchego
# Verify ports are free
lsof -i :9650-9660Prevention: Always use dynamic port allocation by setting ports to "0":
network.DefaultFlags = tmpnet.FlagsMap{
"http-port": "0", // Let OS assign available port
"staking-port": "0", // Let OS assign available port
}Avoid hardcoding port numbers unless you have a specific reason. Dynamic ports prevent conflicts when running multiple networks or tests concurrently.
Process Not Stopping
Symptom: After calling network.Stop(), avalanchego processes remain running in the background.
Cause: Process termination may fail silently, or cleanup may not complete properly.
Solution:
# Find all avalanchego processes
ps aux | grep avalanchego
# Try graceful termination first
pkill -TERM -f avalanchego
sleep 5
# If processes still running, force kill as last resort
pkill -9 -f avalanchego
# Clean up temporary directories if needed
# First verify which network you want to delete
ls -lt ~/.tmpnet/networks/
# Then delete the specific network directory
rm -rf ~/.tmpnet/networks/20250312-143052.123456Use pkill -9 (SIGKILL) only as a last resort after graceful termination fails. SIGKILL doesn't allow cleanup and can leave the database in an inconsistent state.
Prevention: Always use context with timeout for Stop operations:
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := network.Stop(ctx); err != nil {
// Log error but continue cleanup
log.Printf("Failed to stop network cleanly: %v", err)
}Binary Not Found
Symptom: Error "avalanchego not found" or "executable file not found in $PATH" when starting nodes.
Cause: The avalanchego binary path is incorrect or not specified.
Solution:
# Verify the binary exists
ls -lh /path/to/avalanchego
# Use absolute path when configuring
export AVALANCHEGO_PATH="$(pwd)/bin/avalanchego"
# Or specify in code
runtimeCfg := &tmpnet.ProcessRuntimeConfig{
AvalancheGoPath: "/absolute/path/to/avalanchego",
}Verification:
# Test the binary works
/path/to/avalanchego --version
# Should output version informationWhen using relative paths, ensure they resolve correctly from your test working directory. Absolute paths are more reliable for test automation.
Logs Location
Where to find logs: Node logs are stored in the network directory under each node's subdirectory.
# Find the latest network
ls -lt ~/.tmpnet/networks/
# Use the 'latest' symlink
tail -f ~/.tmpnet/networks/latest/NodeID-*/logs/main.log
# Or specify the timestamp directory
tail -f ~/.tmpnet/networks/20250312-143052.123456/NodeID-7Xhw2mX5xVHr1ANraYiTgjuB8Jqdbj8/logs/main.logUseful log commands:
# View all node logs simultaneously
tail -f ~/.tmpnet/networks/latest/NodeID-*/logs/main.log
# Search for errors across all nodes
grep -r "ERROR" ~/.tmpnet/networks/latest/*/logs/
# Monitor a specific node
export NODE_ID="NodeID-7Xhw2mX5xVHr1ANraYiTgjuB8Jqdbj8"
tail -f ~/.tmpnet/networks/latest/$NODE_ID/logs/main.logKubernetes Runtime Issues
Pod Stuck in Pending
Symptom: Node pods remain in "Pending" state and never start.
Common causes:
- Insufficient cluster resources (CPU/memory)
- Node selector constraints not met
- Storage class unavailable
- Image pull errors (see below)
Diagnosis:
# Check pod status details
kubectl describe pod avalanchego-node-0 -n tmpnet
# Look for events section
kubectl get events -n tmpnet --sort-by='.lastTimestamp'
# Check node resources
kubectl top nodesSolutions:
# If resource limits are too high, adjust them
kubectl edit statefulset avalanchego -n tmpnet
# Verify your cluster has available nodes
kubectl get nodes
# Check for node taints
kubectl describe nodes | grep -i taintImage Pull Errors
Symptom: Pod status shows "ImagePullBackOff" or "ErrImagePull".
Cause: Cannot pull the Docker image from the registry.
Diagnosis:
# Check image pull status
kubectl describe pod avalanchego-node-0 -n tmpnet | grep -A 5 "Events:"
# Verify image name
kubectl get pod avalanchego-node-0 -n tmpnet -o jsonpath='{.spec.containers[0].image}'Solutions:
# Verify image exists in registry
docker pull avaplatform/avalanchego:latest
# If using private registry, check image pull secrets
kubectl get secrets -n tmpnet
# Create image pull secret if needed
kubectl create secret docker-registry regcred \
--docker-server=<registry> \
--docker-username=<username> \
--docker-password=<password> \
-n tmpnetAlternative: Use a local image with kind:
# Load image into kind cluster
kind load docker-image avaplatform/avalanchego:latest --name tmpnet-clusterIngress Not Working
Symptom: Cannot reach node APIs through ingress endpoints, connection refused or timeouts.
Cause: Ingress controller not installed, misconfigured, or ingress rules not applied.
Diagnosis:
# Check if ingress controller is running
kubectl get pods -n ingress-nginx
# Verify ingress resource exists
kubectl get ingress -n tmpnet
# Check ingress details
kubectl describe ingress avalanchego-ingress -n tmpnet
# Test service directly (bypassing ingress)
kubectl port-forward svc/avalanchego-node-0 9650:9650 -n tmpnet
curl http://localhost:9650/ext/healthSolutions:
# Install ingress controller if missing
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
# Verify ingress host configuration
kubectl get ingress -n tmpnet -o yaml | grep host:
# Check service endpoints
kubectl get endpoints -n tmpnetFor kind clusters, ensure you created the cluster with extraPortMappings to expose ports 80/443. See the kind ingress documentation.
StatefulSet Not Updating
Symptom: After updating the StatefulSet (e.g., changing image version), pods still run the old image.
Cause: StatefulSet update strategy is set to OnDelete by default, requiring manual pod deletion.
Solution:
# Check update strategy
kubectl get statefulset avalanchego -n tmpnet -o jsonpath='{.spec.updateStrategy}'
# Manually delete pods to trigger update
kubectl delete pod avalanchego-node-0 -n tmpnet
# StatefulSet will recreate with new image
# Or delete all pods
kubectl delete pods -l app=avalanchego -n tmpnetChange to rolling updates:
kubectl patch statefulset avalanchego -n tmpnet -p '{"spec":{"updateStrategy":{"type":"RollingUpdate"}}}'Persistent Volume Issues
Symptom: Pod cannot start with error "FailedMount" or "PVC not bound".
Cause: Persistent Volume Claims (PVCs) cannot be provisioned or bound.
Diagnosis:
# Check PVC status
kubectl get pvc -n tmpnet
# Should show "Bound" status
# If "Pending", check details
kubectl describe pvc data-avalanchego-node-0 -n tmpnet
# Verify storage class exists
kubectl get storageclassSolutions:
# If using kind or minikube, ensure default storage class exists
kubectl get storageclass
# For kind, standard storage class should be available by default
# For custom clusters, install a storage provisioner
# Delete stuck PVCs if needed (will delete data!)
kubectl delete pvc data-avalanchego-node-0 -n tmpnetGeneral Runtime Issues
Health Check Failures
Symptom: Node reports as unhealthy or IsHealthy() returns false in tests.
Cause: Node may still be bootstrapping, or there's a configuration issue.
Health check endpoint: GET /ext/health/liveness on the HTTP port.
Diagnosis:
# Check health endpoint directly
curl http://localhost:9650/ext/health/liveness
# Expected healthy response:
# {"checks":{"network":{"message":{"..."},"timestamp":"...","duration":123,"contiguousFailures":0,"timeOfFirstFailure":null}},"healthy":true}
# Check if node is still bootstrapping
curl http://localhost:9650/ext/info | jq '.result.isBootstrapped'Solutions:
Wait longer - bootstrapping can take time:
// Use generous timeout for health checks
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
err := node.WaitForHealthy(ctx)
if err != nil {
return fmt.Errorf("node failed to become healthy: %w", err)
}Check logs for errors:
# Look for bootstrap progress
tail -f ~/.tmpnet/networks/latest/NodeID-*/logs/main.log | grep -i "bootstrap"
# Check for errors
tail -f ~/.tmpnet/networks/latest/NodeID-*/logs/main.log | grep -i "error"The first node in a network typically takes longer to start because it must wait for staking to be enabled. Subsequent nodes bootstrap from the first node.
Monitoring Not Working
Symptom: No metrics or logs appear in Prometheus/Grafana/Loki dashboards.
Diagnosis:
# Check if collectors are running
ps aux | grep prometheus
ps aux | grep promtail
# Verify environment variables
echo $PROMETHEUS_URL
echo $LOKI_URL
# Check service discovery configs exist
ls -la ~/.tmpnet/prometheus/file_sd_configs/
ls -la ~/.tmpnet/promtail/file_sd_configs/Solutions:
# Start collectors if not running
tmpnetctl start-metrics-collector
tmpnetctl start-logs-collector
# Verify binaries are in PATH
which prometheus
which promtail
# If using nix, ensure development shell is active
nix develop
# Check collector logs
tail -f ~/.tmpnet/prometheus/*.log
tail -f ~/.tmpnet/promtail/*.logVerify metrics are being collected:
# Query Prometheus directly
curl -s "${PROMETHEUS_URL}/api/v1/query?query=up" \
-u "${PROMETHEUS_USERNAME}:${PROMETHEUS_PASSWORD}" \
| jqPerformance Troubleshooting
Slow Network Bootstrap
Symptom: Network takes longer than 5 minutes to bootstrap.
Common causes:
- Network too large (many nodes/subnets)
- Insufficient system resources
- Debug logging enabled
Solutions:
Reduce network size for testing:
// Use fewer nodes for faster tests
network.Nodes = tmpnet.NewNodesOrPanic(3) // Instead of 5+Reduce logging verbosity:
network.DefaultFlags = tmpnet.FlagsMap{
"log-level": "info", // Instead of "debug" or "trace"
}Increase system resources:
# Check current resource usage
top
df -h ~/.tmpnet/
# Clean up old networks
rm -rf ~/.tmpnet/networks/202*High Memory Usage
Symptom: avalanchego processes consume excessive memory, system becomes slow.
Diagnosis:
# Check memory usage per process
ps aux | grep avalanchego | awk '{print $2, $4, $11}'
# Monitor over time
watch -n 5 'ps aux | grep avalanchego'Solutions:
Limit database size:
network.DefaultFlags = tmpnet.FlagsMap{
"db-type": "memdb", // Use in-memory DB for tests
"pruning-enabled": "true",
"state-sync-enabled": "false", // Disable if not needed
}Stop old networks:
# Stop all running networks
for dir in ~/.tmpnet/networks/*/; do
export TMPNET_NETWORK_DIR="$dir"
tmpnetctl stop-network
doneDebugging Techniques
Enable Verbose Logging
Increase log verbosity to diagnose issues:
node.Flags = tmpnet.FlagsMap{
"log-level": "trace", // Most verbose
"log-display-level": "trace",
}Capture Process Output
Redirect process output to see initialization errors:
# Run avalanchego manually with same config
/path/to/avalanchego \
--config-file=~/.tmpnet/networks/latest/NodeID-*/flags.json \
2>&1 | tee avalanchego-debug.logNetwork State Inspection
Inspect the network state directory:
# View network configuration
cat ~/.tmpnet/networks/latest/config.json | jq
# View node flags
cat ~/.tmpnet/networks/latest/NodeID-*/flags.json | jq
# Check process status
cat ~/.tmpnet/networks/latest/NodeID-*/process.json | jqTest Individual Components
Test components in isolation:
// Test just node health
func TestNodeHealth(t *testing.T) {
node := network.Nodes[0]
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
err := node.WaitForHealthy(ctx)
require.NoError(t, err)
}Getting Help
If you're still experiencing issues:
- Check logs - Always check node logs first for error messages
- Search GitHub issues - Check avalanchego issues for similar problems
- Ask the community - Post in Avalanche Discord #developers channel
- Include details - Share error messages, logs, and your configuration
Information to include when asking for help:
- tmpnet version:
go list -m github.com/ava-labs/avalanchego - Runtime type: Local process or Kubernetes
- Operating system and version
- Error messages and relevant log excerpts
- Network configuration (redact sensitive data)
- Steps to reproduce the issue
Next Steps
Is this guide helpful?