Recover from Regional Failover
When a region goes down, Route 53 health checks detect the failure and automatically redirect traffic to the healthy region. This runbook describes how to diagnose the failed region, restore it, and confirm that both regions are serving traffic again.
Configure your values​
The commands in this runbook reference your cluster context ARNs, the failed region's values file, and its health check ID. Set them below once and all commands update automatically.
Expand to set page variables
Step 1 - Confirm failover is active​
Verify which health check is failing:
aws route53 get-health-check-status --health-check-id xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}' \
--output table
Confirm the healthy region is still passing:
aws route53 get-health-check-status --health-check-id yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}' \
--output table
During failover, the platform continues to operate normally through the healthy region. Users may experience slightly higher latency if they are geographically closer to the failed region.
Step 2 - Diagnose the failed region​
Check the state of the platform pods in the failed region:
kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
get pods -n vcluster-platform -l app=loft
Check pod logs for errors:
kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
logs -n vcluster-platform -l app=loft --tail=50
Check the ALB and ingress status:
kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
get ingress -n vcluster-platform loft
Common failure causes:
- Pod crashes (CrashLoopBackOff): Check logs for database connectivity errors or resource exhaustion.
- Node failures: Check
kubectl get nodesforNotReadynodes. - ALB unhealthy: Verify the ALB target group health in the AWS console or
with
aws elbv2 describe-target-health. - Network partition: Verify that VPC peering between the cluster VPC and the database VPC is active, that routes exist on the correct route tables (including public route tables if nodes are in public subnets), and that the database security group allows port 3306 from the cluster VPC CIDR.
Step 3 - Restore the failed region​
The recovery steps depend on the failure cause.
If pods are crashlooping or unhealthy, restart the deployment:
kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
rollout restart deployment/loft -n vcluster-platform
If the deployment is scaled to zero or missing, re-apply the values file:
- vCluster CLI
- Helm
vcluster platform start \
--namespace vcluster-platform \
--kube-context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
--values platform-us-east-1-values.yaml \
--upgrade \
--no-tunnel
helm upgrade loft vcluster-platform --install --create-namespace --repository-config='' \
--namespace vcluster-platform \
--repo "https://charts.loft.sh/" \
--version 4.8.0 \
--kube-context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
-f platform-us-east-1-values.yaml \
--server-side=true --force-conflicts
Then scale up if needed:
kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
scale deployment -n vcluster-platform loft --replicas=3
If the EKS cluster itself is down, follow the AWS documentation to restore the cluster or create a replacement, then redeploy the platform using the region's values file.
Step 4 - Wait for the rollout to complete​
kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
rollout status deployment/loft -n vcluster-platform
Step 5 - Verify recovery​
Confirm the Route 53 health check for the restored region returns healthy:
aws route53 get-health-check-status --health-check-id xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}' \
--output table
Route 53 health checks run every 10 seconds with a failure threshold of 3. After the platform becomes healthy, expect up to 30 seconds before the health check status updates. DNS clients with cached responses may take an additional 60 seconds (the ALB alias record TTL) to start resolving to the restored region.
Verify both regions are serving traffic:
for CTX in arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 arn:aws:eks:eu-west-1:123456789012:cluster/platform-multi-region-eu-west-1; do
echo "=== $CTX ==="
kubectl --context "$CTX" get pods -n vcluster-platform -l app=loft
echo
done
Confirm the custom DERP server is operational by checking for NetworkPeer
resources:
kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
get networkpeers.storage.loft.sh -A
Both peers should show Online: true.