cloudconsulting.agustin

Azure Disaster Recovery - Top 10 Checklist

Design and operational hardening checklist focused on RPO, RTO, failover speed, replication, orchestration, backups, cost and testability for Azure disaster recovery.

🔧

1. Azure Site Recovery (ASR)

ASR is the primary orchestration tool for VM replication, failover, failback and recovery plans; design for target RPO/RTO and automation of runbooks.

Top 5 design recommendations

  1. Define RPO/RTO targets per workload and map them to ASR replication frequency and recovery plan actions.
  2. Choose appropriate replication (VM-level, agentless, or application-consistent) and storage performance tier for target RPO.
  3. Segment DR protection groups by application and dependency (DB, app, cache) to orchestrate consistent failover.
  4. Design recovery plans with pre/post scripts, automated IP re-mapping and sequencing for multi-tier apps.
  5. Use ASR with paired-region selection and network mapping (vNet/VPN/Peering) to minimize failover complexity.

Top 5 operational best practices

  1. Automate replication health checks and configure alerts for replication lag and RPO breaches.
  2. Run scheduled test failovers using recovery plans; validate app integrity and DNS/traffic cutover.
  3. Keep recovery playbooks and runbook scripts in source control and deploy via pipelines for consistency.
  4. Document failover roles, runbook owners, and a rollback plan; practice failback procedures regularly.
  5. Monitor ASR storage/throughput and plan scaling for peak replication windows to avoid increased RPO.

📦

2. Azure Backup and Restore

Backups provide point-in-time recovery for data and VMs; ensure retention, encryption, and restore speed align with RTO/RPO requirements.

Top 5 design recommendations

  1. Define backup SLAs per data class and select appropriate retention and frequency to meet RPO targets.
  2. Use Azure Backup for VM/app-consistent snapshots and Azure Backup for SQL (Azure VM) or native PaaS backups for managed DBs.
  3. Enable soft-delete and immutable vault options where compliance and ransomware protection are required.
  4. Store backups in geo-redundant vaults that match your DR region strategy for fast recoverability.
  5. Design restore procedures (full, file-level, DB point-in-time) and measure restore times during planning to meet RTOs.

Top 5 operational best practices

  1. Automate backup job monitoring and daily restore verification tests to validate RTO assumptions.
  2. Track backup size/growth and storage costs; archive long-term backups to reduce expense where possible.
  3. Secure vault access with RBAC and separate operational/pricing accounts for billing visibility.
  4. Document and automate recovery runbooks for common restore scenarios and store them with runbook artifacts.
  5. Integrate backup alerts into incident management and ensure ticketing runbooks include restore steps.

📊

3. Azure SQL Geo-Replication & Failover Groups

For databases, choose replication patterns (active geo-replication, auto-failover groups) that deliver required RTO/RPO and failover speed.

Top 5 design recommendations

  1. Set RPO expectations and map to replication mode: asynchronous for cross-region; synchronous when supported and required.
  2. Use Auto-failover groups for coordinated failover across multiple databases; plan listener and connection string patterns to minimize cutover time.
  3. Design secondary region DB compute sizing and read/write readiness (warm or passive) for desired failover speed.
  4. Test transactional consistency for multi-db applications and use transaction-level replication or application-level compensation as needed.
  5. Plan backup and PITR in addition to replication to handle logical data corruption or operator errors affecting both primary and replica.

Top 5 operational best practices

  1. Monitor replication lag and failover readiness; set alerts for latency that threatens RPO goals.
  2. Practice planned and unplanned failovers regularly and measure application-level RTO end-to-end.
  3. Automate DNS/connection string updates via application configuration or use listener endpoints to reduce downtime.
  4. Maintain runbooks for data sync verification, flipback procedures and data reconciliation post-failover.
  5. Keep a tested backup plan for logical recovery even when replication exists; validate PITR restores periodically.

🌐

4. Cosmos DB Multi-Region and Consistency

Cosmos DB offers multi-region writes and tunable consistency—choose the model that balances RTO, RPO, latency and throughput.

Top 5 design recommendations

  1. Select multi-region configuration (single write vs multi-write) aligned to application RTO/RPO and conflict handling requirements.
  2. Tune consistency level (strong, bounded staleness, session, etc.) for the trade-off between RPO and read latency.
  3. Distribute RU/s and partitioning strategy to ensure replica regions can handle production load after failover.
  4. Plan indexing and TTL policies to reduce geo-replication storage and replication lag.
  5. Define region failover priority and automated vs manual failover strategy based on business impact.

Top 5 operational best practices

  1. Monitor replication latency, RU consumption and regional availability; set alerts for SLA deviations.
  2. Test failover and failback processes, including application behavior under different consistency settings.
  3. Automate region management and promote/demote regions using documented runbooks and pipelines.
  4. Maintain data residency and compliance checks when adding or failing over to regions.
  5. Keep application-level conflict resolution strategies and telemetry to detect divergence after multi-master events.

📦

5. Azure Storage Replication (LRS, GRS, RA-GRS)

Choose appropriate storage redundancy (LRS/GRS/RA-GRS) to meet RPO, read availability across regions and recovery needs for blobs/files/disks.

Top 5 design recommendations

  1. Map data criticality to redundancy: LRS for local durability; GRS/RA-GRS for geo-resilience and read access post-failover.
  2. For hot failover-read needs, prefer RA-GRS to allow read access in secondary while primary is unavailable.
  3. Design naming and lifecycle policies to make cross-region restores predictable and scriptable.
  4. Consider snapshot and backup cadence for managed disks and file shares to provide point-in-time recovery beyond replication.
  5. Plan storage account region pairs and limits to align with ASR and application failover targets.

Top 5 operational best practices

  1. Monitor replication status and secondary read availability; alert on replication delays that impact RPO.
  2. Test emergency restores from secondary and validate data integrity and performance post-restore.
  3. Track storage costs per redundancy tier and use lifecycle/archive tiers for cost control where feasible.
  4. Document procedures to rehydrate archived data and to switch application endpoints to secondary storage locations.
  5. Include storage replication checks in DR test plans and recovery runbooks.

🛠

6. Recovery Orchestration and Runbooks

Orchestration links replication, DNS, traffic failover, infra provisioning and app-level scripts—critical for minimizing manual steps and achieving RTO.

Top 5 design recommendations

  1. Use ASR recovery plans combined with Azure Automation runbooks or Logic Apps to automate pre/post failover tasks.
  2. Store runbook scripts, templates and artifacts (bicep/ARM) in a repository and version them for repeatability.
  3. Design stages: infrastructure bring-up, DB promotion/synchronization, app config update, traffic cutover, validation.
  4. Keep secrets and certificates in Key Vault and integrate runbooks to fetch them securely during recovery.
  5. Provide idempotent runbooks and safe-guards (confirmation gates) for production failovers to avoid accidental double actions.

Top 5 operational best practices

  1. Keep recovery artifacts (templates, scripts) in CI/CD pipelines and test deploy to DR subscription on a schedule.
  2. Runbook dry-runs should be scheduled and results logged; integrate with incident management for transparent reporting.
  3. Rotate automation credentials and restrict runbook edit/run permissions via RBAC.
  4. Maintain a playbook index mapping runbooks to recovery steps for each recovery plan and role assignment.
  5. Monitor runbook execution times and optimize long-running tasks to meet RTO SLAs.

🚀

7. Traffic Routing and Failover (Traffic Manager, Front Door)

DNS and edge routing control how traffic is redirected during failover—design TTLs, health probes and origin validation to control failover speed and correctness.

Top 5 design recommendations

  1. Choose DNS-based (Traffic Manager) or HTTP-based (Front Door) failover depending on application protocol and desired cutover speed.
  2. Set low TTLs for failover-critical records and align probe frequency to avoid false failovers while enabling fast recovery.
  3. Use origin validation, managed private origins or IP allow-listing to protect secondary/DR origins from unauthorized traffic.
  4. Design multi-layer failover: edge (Front Door) + regional (Traffic Manager) for complex global DR scenarios.
  5. Document automated vs manual failover triggers and include circuits for emergency DNS changes with pre-authorized approvers.

Top 5 operational best practices

  1. Regularly test DNS/Front Door failovers; automate validation checks post-cutover to confirm app health and latency.
  2. Monitor probe results, rule hits and global traffic distribution to detect regional degradations early.
  3. Keep DNS records in IaC and pipelines for quick, audited updates during emergencies.
  4. Document expected propagation windows and fallback steps for slow DNS caches or ISP caching behavior.
  5. Automate front-door rules and WAF exceptions as part of recovery plans to avoid manual edge configuration under pressure.

💻

8. Infrastructure as Code, Pipelines and Artifacts

IaC and CI/CD artifacts ensure DR environments can be recreated quickly and consistently—critical for reproducible RTOs and minimal manual error.

Top 5 design recommendations

  1. Store ARM/Bicep templates, container images, and pipeline definitions in a versioned repo to allow reproducible DR builds.
  2. Design parameterized templates for region/size/sku differences to support quick reprovisioning in alternate regions.
  3. Include artifacts for networking, NSGs, ASGs, Route Tables and recovery-specific components in DR templates.
  4. Define pipeline stages for DR environment creation, validation and teardown to avoid drift and control cost.
  5. Pre-publish required images and artifacts to the DR region (image gallery, ACR) to reduce provisioning time during failover.

Top 5 operational best practices

  1. Keep DR pipeline runs in a dedicated DR subscription to validate that templates produce expected artifacts without impacting production.
  2. Automate artifact replication (ACR geo-replication, Managed Image replication) to DR regions as part of CI/CD.
  3. Run periodic template-driven rebuilds in non-production DR sandboxes to detect template drift and permission issues.
  4. Keep secrets and keys outside templates (Key Vault) and ensure pipeline service principals have least-privilege access in DR subscriptions.
  5. Document and version dependency graphs so orchestration (templates + runbooks) can rebuild the application stack within RTO targets.

🔍

9. DR Testing, Exercises and Playbooks

Regular, audited DR tests validate RPO/RTO assumptions, expose gaps, and train teams to execute recovery workflows under pressure.

Top 5 design recommendations

  1. Define test cadence (quarterly, semi-annual) and test types: tabletop, partial failover, full failover, rollback validation.
  2. Construct measurable success criteria (time to DNS change, application health, data integrity checks) tied to RTO/RPO.
  3. Design isolated test environments that reuse DR automation but do not impact production resources or blow budgets.
  4. Include stakeholders and cross-team runbooks in test plans to validate human processes and escalation paths.
  5. Document and automate post-test retrospectives and action item tracking to close identified gaps quickly.

Top 5 operational best practices

  1. Automate test scheduling and evidence collection (logs, screenshots, metrics) and retain results for auditability.
  2. Run partial drills (DB failover, traffic switch) regularly to maintain muscle memory and validate teams/automation.
  3. Measure actual RTO/RPO during tests and reconcile differences with SLAs and runbook updates.
  4. Include application smoke tests and synthetic transactions post-failover to validate end-to-end functionality.
  5. Ensure test outcomes feed into a prioritized remediation backlog and track closure across cycles.

💰

10. Cost, Regions, Capacity and SLA Planning

DR decisions impact cost and capacity; plan region selection, compute sizing, autoscale and cross-region quotas to meet SLAs within budget.

Top 5 design recommendations

  1. Perform BIA to assign criticality and acceptable RTO/RPO; translate into resource and replication sizing for DR regions.
  2. Choose region pairs considering latency, paired-region services, capacity quotas and legal/data residency constraints.
  3. Design autoscale and warm-standby patterns to balance cost vs failover readiness; reserve capacity where fast RTO is required.
  4. Estimate cross-region networking egress and replication costs and include them in DR budget and run-cost models.
  5. Plan quota and subscription limits (cores, IPs, storage) in DR regions in advance to avoid provisioning failures during failover.

Top 5 operational best practices

  1. Track DR run costs in regular reports and run tabletop cost-impact scenarios for expected failover durations.
  2. Pre-request and increase quotas in DR regions and keep contact with Azure Support for emergency quota increases.
  3. Maintain an inventory of reserved instances, prepaid commitments and their applicability to DR workloads.
  4. Audit cross-region network and replication egress regularly; optimize replication windows to reduce peak costs.
  5. Include cost/ROI as part of DR after-action reviews and adapt DR posture to evolving business priorities.

This article was originally published on 2025-NOV-20 and last reviewed on 2025-NOV-20.