Skip to main content
Version: v4.9 Stable

Recover from Regional Failover

info
This feature is available from the Platform version v4.8.0

When a region goes down, Route 53 health checks detect the failure and automatically redirect traffic to the healthy region. This runbook describes how to diagnose the failed region, restore it, and confirm that both regions are serving traffic again.

Configure your values​

The commands in this runbook reference your cluster context ARNs, the failed region's values file, and its health check ID. Set them below once and all commands update automatically.

Expand to set page variables
Modify the following with your specific values to replace on the whole page and generate copyable commands:

Step 1 - Confirm failover is active​

Verify which health check is failing:

aws route53 get-health-check-status --health-check-id xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}' \
--output table

Confirm the healthy region is still passing:

Modify the following with your specific values to generate a copyable command:
aws route53 get-health-check-status --health-check-id yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}' \
--output table
note

During failover, the platform continues to operate normally through the healthy region. Users may experience slightly higher latency if they are geographically closer to the failed region.

Step 2 - Diagnose the failed region​

Check the state of the platform pods in the failed region:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
get pods -n vcluster-platform -l app=loft

Check pod logs for errors:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
logs -n vcluster-platform -l app=loft --tail=50

Check the ALB and ingress status:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
get ingress -n vcluster-platform loft

Common failure causes:

  • Pod crashes (CrashLoopBackOff): Check logs for database connectivity errors or resource exhaustion.
  • Node failures: Check kubectl get nodes for NotReady nodes.
  • ALB unhealthy: Verify the ALB target group health in the AWS console or with aws elbv2 describe-target-health.
  • Network partition: Verify that VPC peering between the cluster VPC and the database VPC is active, that routes exist on the correct route tables (including public route tables if nodes are in public subnets), and that the database security group allows port 3306 from the cluster VPC CIDR.

Step 3 - Restore the failed region​

The recovery steps depend on the failure cause.

If pods are crashlooping or unhealthy, restart the deployment:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
rollout restart deployment/loft -n vcluster-platform

If the deployment is scaled to zero or missing, re-apply the values file:

vcluster platform start \
--namespace vcluster-platform \
--kube-context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
--values platform-us-east-1-values.yaml \
--upgrade \
--no-tunnel

Then scale up if needed:

Modify the following with your specific values to generate a copyable command:
kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
scale deployment -n vcluster-platform loft --replicas=3

If the EKS cluster itself is down, follow the AWS documentation to restore the cluster or create a replacement, then redeploy the platform using the region's values file.

Step 4 - Wait for the rollout to complete​

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
rollout status deployment/loft -n vcluster-platform

Step 5 - Verify recovery​

Confirm the Route 53 health check for the restored region returns healthy:

aws route53 get-health-check-status --health-check-id xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}' \
--output table
note

Route 53 health checks run every 10 seconds with a failure threshold of 3. After the platform becomes healthy, expect up to 30 seconds before the health check status updates. DNS clients with cached responses may take an additional 60 seconds (the ALB alias record TTL) to start resolving to the restored region.

Verify both regions are serving traffic:

for CTX in arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 arn:aws:eks:eu-west-1:123456789012:cluster/platform-multi-region-eu-west-1; do
echo "=== $CTX ==="
kubectl --context "$CTX" get pods -n vcluster-platform -l app=loft
echo
done

Confirm the custom DERP server is operational by checking for NetworkPeer resources:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
get networkpeers.storage.loft.sh -A

Both peers should show Online: true.