Version: v4.10 Stable

Recover from Regional Failover

info

This feature is available from the Platform version v4.8.0

When a region goes down, Route 53 health checks detect the failure and automatically redirect traffic to the healthy region. This runbook describes how to diagnose the failed region, restore it, and confirm that both regions are serving traffic again.

Configure your values

The commands in this runbook reference your cluster context ARNs, the failed region's values file, and its health check ID. Set them below once and all commands update automatically.

Expand to set page variables

Modify the following with your specific values to replace on the whole page and generate copyable commands:

FIRST_REGION_CONTEXT

SECOND_REGION_CONTEXT

FAILED_REGION_CONTEXT

FAILED_REGION_VALUES_FILE

FAILED_REGION_HEALTH_CHECK_ID

Step 1 - Confirm failover is active

Verify which health check is failing:

aws route53 get-health-check-status --health-check-id xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}' \
--output table

Confirm the healthy region is still passing:

Modify the following with your specific values to generate a copyable command:

HEALTHY_REGION_HEALTH_CHECK_ID

aws route53 get-health-check-status --health-check-id yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}' \
--output table

note

During failover, the platform continues to operate normally through the healthy region. Users may experience slightly higher latency if they are geographically closer to the failed region.

Step 2 - Diagnose the failed region

Check the state of the platform pods in the failed region:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
get pods -n vcluster-platform -l app=loft

Check pod logs for errors:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
logs -n vcluster-platform -l app=loft --tail=50

Check the ALB and ingress status:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
get ingress -n vcluster-platform loft

Common failure causes:

Pod crashes (CrashLoopBackOff): Check logs for database connectivity errors or resource exhaustion.
Node failures: Check kubectl get nodes for NotReady nodes.
ALB unhealthy: Verify the ALB target group health in the AWS console or with aws elbv2 describe-target-health.
Network partition: Verify that VPC peering between the cluster VPC and the database VPC is active, that routes exist on the correct route tables (including public route tables if nodes are in public subnets), and that the database security group allows port 3306 from the cluster VPC CIDR.

Step 3 - Restore the failed region

The recovery steps depend on the failure cause.

If pods are crashlooping or unhealthy, restart the deployment:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
rollout restart deployment/loft -n vcluster-platform

If the deployment is scaled to zero or missing, re-apply the values file:

vCluster CLI
Helm

vcluster platform start \
--namespace vcluster-platform \
--kube-context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
--values platform-us-east-1-values.yaml \
--upgrade \
--no-tunnel

Modify the following with your specific values to generate a copyable command:

CHART_VERSION

helm upgrade loft vcluster-platform --install --create-namespace --repository-config='' \
--namespace vcluster-platform \
--repo "https://charts.loft.sh/" \
--version 4.8.0 \
--kube-context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
-f platform-us-east-1-values.yaml \
--server-side=true --force-conflicts

Then scale up if needed:

Modify the following with your specific values to generate a copyable command:

REPLICA_COUNT

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
scale deployment -n vcluster-platform loft --replicas=3

If the EKS cluster itself is down, follow the AWS documentation to restore the cluster or create a replacement, then redeploy the platform using the region's values file.

Step 4 - Wait for the rollout to complete

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
rollout status deployment/loft -n vcluster-platform

Step 5 - Verify recovery

Confirm the Route 53 health check for the restored region returns healthy:

aws route53 get-health-check-status --health-check-id xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx \
--query 'HealthCheckObservations[].{Region:Region,Status:StatusReport.Status}' \
--output table

note

Route 53 health checks run every 10 seconds with a failure threshold of 3. After the platform becomes healthy, expect up to 30 seconds before the health check status updates. DNS clients with cached responses may take an additional 60 seconds (the ALB alias record TTL) to start resolving to the restored region.

Verify both regions are serving traffic:

for CTX in arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 arn:aws:eks:eu-west-1:123456789012:cluster/platform-multi-region-eu-west-1; do
echo "=== $CTX ==="
kubectl --context "$CTX" get pods -n vcluster-platform -l app=loft
echo
done

Confirm the custom DERP server is operational by checking for NetworkPeer resources:

kubectl --context arn:aws:eks:us-east-1:123456789012:cluster/platform-multi-region-us-east-1 \
get networkpeers.storage.loft.sh -A

Both peers should show Online: true.

Configure your values​

Step 1 - Confirm failover is active​

Step 2 - Diagnose the failed region​

Step 3 - Restore the failed region​

Step 4 - Wait for the rollout to complete​

Step 5 - Verify recovery​

Configure your values

Step 1 - Confirm failover is active

Step 2 - Diagnose the failed region

Step 3 - Restore the failed region

Step 4 - Wait for the rollout to complete

Step 5 - Verify recovery