📈
1. Define Monitoring Goals for DR and Migrated Servers
Define availability, RPO/RTO monitoring, telemetry retention and incident response SLAs for DR and migrated servers.
Top 5 design recommendations
- Classify servers by criticality and map to monitoring retention and alert severity.
- Choose Log Analytics workspace strategy (per environment, per subscription, or central) for DR visibility.
- Define metrics and logs required for RPO/RTO verification (replication health, recovery points, failover tests).
- Plan network and private endpoint access for monitoring agents on DR/migrated servers.
- Design tagging to link monitored resources to migration waves and DR runbooks.
Top 5 operational best practices
- Baseline pre-migration telemetry and compare post-migration to validate parity.
- Schedule synthetic availability tests for DR endpoints and migrated app endpoints.
- Automate onboarding of migrated servers to Log Analytics and Defender policies.
- Run periodic failover drills and validate monitoring alerts trigger expected incidents.
- Keep runbooks and escalation contacts in sync with migration waves.
💻
2. Instrumentation: AMA, Extensions and Defender Agents
Standardize on Azure Monitor Agent (AMA) and Defender onboarding for consistent telemetry on DR and migrated servers.
Top 5 design recommendations
- Use AMA (Azure Monitor Agent) for logs/metrics and migrate from legacy agents before cutover.
- Plan Log Analytics workspace sizing and retention for DR test logs and long-term forensic needs.
- Define required VM extensions and Defender for Servers plan for hybrid machines.
- Use managed identities or service principals for agent onboarding at scale.
- Design DCRs (Data Collection Rules) to separate DR telemetry from normal operations if needed.
Top 5 operational best practices
- Automate agent deployment via ARM/Policy/Automation runbooks during migration waves.
- Validate agent connectivity to workspace and Sentinel ingestion after failover tests.
- Monitor agent health and extension update status centrally.
- Document rollback steps to remove agents from decommissioned DR VMs.
- Use deployment tags to track which servers were instrumented pre- and post-migration.
👥
3. Microsoft Sentinel Workspace Design
Design Sentinel workspaces and data connectors to ensure DR and migrated servers feed security telemetry and alerts.
Top 5 design recommendations
- Decide central vs per-environment Sentinel workspace for incident correlation across DR and production.
- Enable required connectors (Azure Activity, Security Events, Syslog, Windows Event) for migrated servers.
- Plan retention and storage for long-term forensic needs from DR tests and failovers.
- Map analytics rules to DR scenarios (replication anomalies, unexpected failovers, post-cutover anomalies).
- Define playbooks for automated response during failover and post-migration incidents.
Top 5 operational best practices
- Validate connectors ingesting DR server logs during test failovers.
- Tune analytics rules to reduce noise from expected DR activities.
- Use watchlists to track migrated server inventories and asset owners.
- Run hunting queries after cutover to detect configuration drift or suspicious activity.
- Document incident playbook steps specific to DR failover and failback events.
📥
4. Alerts, Action Groups and Escalation Paths
Define alert rules, action groups and escalation for DR and migrated server incidents to meet RTO targets.
Top 5 design recommendations
- Classify alerts by severity and map to action groups with on-call rotations for DR waves.
- Use metric alerts for replication health and log alerts for failover events.
- Integrate alerts with ITSM and incident management for automated ticketing during failovers.
- Design alert suppression windows for planned failovers and tests to avoid noise.
- Define SLA-driven escalation timelines and automated runbook triggers.
Top 5 operational best practices
- Test action groups and escalation during DR drills to confirm notifications reach responders.
- Use SMS/voice/email and webhook channels for multi-channel escalation.
- Maintain an on-call roster aligned to migration waves and DR owners.
- Document alert tuning changes made for migration windows and revert after stabilization.
- Audit alert delivery and incident creation after each failover test.
⚙
5. Dashboards and Workbooks for DR Validation
Create workbooks and dashboards that surface replication health, failover status and post-migration validation metrics.
Top 5 design recommendations
- Build a DR overview workbook showing replication health, RPO, last recovery point and test failover results.
- Create per-application dashboards that compare pre- and post-migration performance metrics.
- Expose Sentinel incidents and Defender alerts in a single operations dashboard for SOC/IT ops.
- Design role-based dashboards for app owners, security, and platform teams.
- Include runbook links and remediation steps directly from dashboards for rapid response.
Top 5 operational best practices
- Keep dashboards lightweight and focused on actionable metrics for DR decision makers.
- Refresh dashboards after each migration wave and DR test to reflect current baselines.
- Share dashboards with stakeholders and embed in runbooks and post-mortems.
- Automate snapshot exports of dashboards for audit and compliance evidence after failovers.
- Use workbook templates to standardize DR reporting across applications.
🔍
6. Test Migrations and Failover Validation
Validate monitoring, detection and response during test failovers and post-migration cutovers.
Top 5 design recommendations
- Plan test failovers with monitoring validation checkpoints (alerts, logs, metrics, Sentinel incidents).
- Define success criteria and automated checks for application health post-failover.
- Ensure Log Analytics retention covers test artifacts for forensic review.
- Include Defender scans and Sentinel hunting queries as part of validation steps.
- Design rollback triggers if monitoring shows critical regressions during cutover.
Top 5 operational best practices
- Run full monitoring validation during each test failover and record results.
- Execute Sentinel hunting queries immediately after tests to detect anomalies.
- Perform Defender vulnerability scans post-migration to detect configuration drift.
- Update runbooks with lessons learned and monitoring tuning changes.
- Communicate test outcomes to stakeholders and schedule remediation for findings.
📄
7. Log Analytics Workspace Strategy
Design workspaces for scale, retention and cross-subscription correlation for DR and migrated servers.
Top 5 design recommendations
- Choose between central workspace or per-environment workspaces based on data sovereignty and query performance.
- Plan retention tiers for DR test artifacts and long-term forensic evidence.
- Use workspace-linked resources and private endpoints for secure ingestion from DR networks.
- Define workspace naming and tagging to map to migration waves and application owners.
- Consider workspace capacity planning for peak ingestion during failover tests.
Top 5 operational best practices
- Monitor ingestion and query costs and adjust retention or sampling for DR test windows.
- Archive critical test logs to long-term storage if required for compliance.
- Use workspace permissions and RBAC to limit access to sensitive forensic logs.
- Automate workspace provisioning for new migration waves.
- Document workspace mapping to applications and DR plans.
🛠
8. Data Collection Rules and DCRs
Use Data Collection Rules to control which logs and metrics are collected from DR and migrated servers.
Top 5 design recommendations
- Define DCRs per workload type to limit noise and control costs during failover tests.
- Use DCRs to route sensitive logs to dedicated workspaces with stricter access controls.
- Plan DCR versioning and change control for migration waves.
- Map DCRs to agent configuration and extension deployment processes.
- Include sampling and filtering rules for high-volume telemetry sources.
Top 5 operational best practices
- Test DCRs in a staging workspace before applying to production DR servers.
- Monitor DCR application and agent compliance across migrated servers.
- Keep DCRs in source control and document changes tied to migration waves.
- Use automation to apply DCRs during cutover and remove or adjust after stabilization.
- Audit DCR effectiveness and tune filters to reduce unnecessary ingestion.
💻
9. Agent Lifecycle and Management
Manage agent deployment, updates and health for Azure Monitor and Defender agents on DR/migrated servers.
Top 5 design recommendations
- Standardize agent versions and define maintenance windows for updates.
- Use Azure Policy to enforce agent installation and configuration.
- Plan rollback procedures for agent failures during cutover.
- Design monitoring for agent heartbeat and extension status.
- Use managed identities for agent onboarding to reduce credential sprawl.
Top 5 operational best practices
- Automate agent patching and monitor for failed updates.
- Alert on missing heartbeats and remediate via runbooks.
- Keep a secure inventory of agented servers and their workspace assignments.
- Test agent reinstallation procedures during DR drills.
- Document agent configuration baselines for migrated workloads.
🔒
10. Defender for Cloud Policies and Plans
Enable Defender plans and policies to protect migrated servers and DR environments from threats.
Top 5 design recommendations
- Choose Defender plans (Servers, SQL, Storage) appropriate for migrated workloads.
- Define policy assignments and exclusions for DR test environments to avoid noise.
- Map Defender recommendations to remediation runbooks and automation.
- Plan secure score targets and compliance baselines for migrated servers.
- Design integration points between Defender alerts and Sentinel incidents.
Top 5 operational best practices
- Enable auto-provisioning for Defender agents on new migrated VMs.
- Review Defender recommendations after each migration wave and remediate high-risk items.
- Use secure score to track improvement across migration waves.
- Integrate Defender alerts with ITSM and Sentinel playbooks for automated response.
- Document exceptions and maintain an approval log for policy deviations.
📊
11. Analytics Rules and Threat Detection
Create and tune Sentinel analytics rules to detect threats specific to DR and migrated servers.
Top 5 design recommendations
- Map detection rules to DR-specific scenarios (unexpected replication stops, unauthorized failover triggers).
- Use machine learning and UEBA rules where applicable to detect anomalous post-migration behavior.
- Design rule severity and suppression to avoid alert storms during planned failovers.
- Leverage built-in Microsoft rules and customize with KQL for environment specifics.
- Plan rule testing and validation during migration test windows.
Top 5 operational best practices
- Run rule tuning sessions after each migration wave to reduce false positives.
- Document rule rationale and expected outcomes for auditability.
- Use test incidents to validate playbooks and response actions.
- Monitor rule performance and adjust thresholds based on baseline changes.
- Keep a change log for analytics rule updates tied to migration events.
🔧
12. Playbooks, Runbooks and Automated Response
Automate common remediation and DR-specific responses using Logic Apps and Automation runbooks.
Top 5 design recommendations
- Design playbooks for common DR incidents (failed replication, failed test failover, post-cutover anomalies).
- Integrate playbooks with action groups and Sentinel automation rules.
- Use runbooks for environment remediation tasks (agent re-install, workspace reassignment).
- Plan secure credentials and managed identities for playbook actions.
- Include manual approval steps for high-impact automated actions.
Top 5 operational best practices
- Test playbooks during DR drills and validate expected outcomes.
- Keep playbooks in source control and use CI/CD for updates.
- Monitor playbook run history and failures for continuous improvement.
- Limit playbook permissions and audit actions taken by automation.
- Document rollback steps if automated remediation causes unintended effects.
📈
13. Baseline Performance and Health Metrics
Establish baselines for performance and health to detect regressions after migration or failover.
Top 5 design recommendations
- Capture historical metrics for CPU, memory, disk I/O and network to set expected ranges.
- Define SLA thresholds and alerting bands for migrated workloads.
- Include application-level metrics and synthetic transactions in baselines.
- Plan for seasonal or batch-driven load patterns when defining baselines.
- Store baseline snapshots for comparison after each migration wave.
Top 5 operational best practices
- Recompute baselines after major migrations and update alert thresholds accordingly.
- Use automated reports to compare pre- and post-migration performance.
- Run load tests post-migration to validate capacity planning.
- Document deviations and remedial actions taken during cutover.
- Share baseline dashboards with application owners for acceptance.
📶
14. Network and Connectivity Monitoring
Monitor connectivity, latency and private endpoint health for DR and migrated servers.
Top 5 design recommendations
- Instrument network paths with Network Watcher, connection monitors and synthetic tests.
- Monitor private endpoints and VPN/ExpressRoute links used by DR replication.
- Design alerts for latency, packet loss and route changes that impact RTO.
- Include NSG flow logs and Azure Firewall logs in Sentinel for security correlation.
- Plan network baselines for expected throughput during failover tests.
Top 5 operational best practices
- Run connection monitors before and after failovers to validate connectivity.
- Alert on VPN/ExpressRoute status changes and automate remediation where possible.
- Collect and retain NSG flow logs for post-incident analysis.
- Include network checks in cutover runbooks and dashboards.
- Validate private DNS resolution for migrated servers and endpoints.
📦
15. Storage, Backup and Replication Monitoring
Monitor replication health, backup jobs and storage performance for migrated and DR servers.
Top 5 design recommendations
- Monitor replication health metrics (ASR) and backup job success/failure rates.
- Instrument storage accounts and disks for IOPS, latency and capacity alerts.
- Plan retention and recovery point verification for compliance and RPO validation.
- Integrate backup and replication alerts into Sentinel for security correlation.
- Design dashboards showing backup coverage and last successful recovery point.
Top 5 operational best practices
- Validate backup and replication jobs after migration and record recovery points.
- Alert on failed backups and automate ticket creation for remediation.
- Run periodic restore tests and include monitoring validation steps.
- Monitor storage capacity and forecast growth after migration.
- Document backup/replication ownership and escalation paths.
👤
16. Identity and Access Monitoring
Monitor identity events, privileged access and conditional access changes for migrated and DR servers.
Top 5 design recommendations
- Ingest Entra ID (Azure AD) sign-in and audit logs into Sentinel for correlation with server events.
- Monitor privileged role assignments and Just-In-Time elevation events.
- Design alerts for suspicious sign-ins on migrated or DR admin accounts.
- Include conditional access policy changes in monitoring scope.
- Map identity events to application owners for rapid response.
Top 5 operational best practices
- Review privileged access logs after migration and revoke unnecessary roles.
- Test conditional access policies post-migration to ensure expected behavior.
- Alert on anomalous admin activity and integrate with Sentinel playbooks.
- Keep identity owner contact info up to date for escalations.
- Run periodic access reviews tied to migration waves.
📝
17. Compliance, Policies and Secure Score
Use Defender secure score and policy assessments to track compliance for migrated and DR servers.
Top 5 design recommendations
- Map compliance controls to Defender recommendations and policy initiatives.
- Define acceptable secure score thresholds for production and DR environments.
- Plan policy exemptions for temporary DR test activities with approval tracking.
- Integrate compliance evidence collection into monitoring workbooks.
- Design automated remediation for common policy violations.
Top 5 operational best practices
- Review secure score changes after each migration wave and remediate high-impact items.
- Document policy exceptions and their expiration dates.
- Use automated evidence collection for audits after failover tests.
- Schedule periodic compliance reviews with stakeholders.
- Track remediation progress and report to governance boards.
🚨
18. Incident Response and Playbook Testing
Prepare incident response plans and test playbooks specifically for DR and migrated server incidents.
Top 5 design recommendations
- Define incident types and response steps for DR-specific scenarios.
- Map playbooks to incident severity and required automated actions.
- Include forensic collection steps and evidence preservation in playbooks.
- Design communication templates for stakeholders during failovers.
- Plan post-incident reviews and remediation tracking.
Top 5 operational best practices
- Run tabletop exercises and live playbook tests during migration windows.
- Validate forensic collection and chain-of-custody steps after tests.
- Record lessons learned and update playbooks accordingly.
- Ensure incident roles and contacts are current for each migration wave.
- Automate incident post-mortem data collection for continuous improvement.
📑
19. Forensics, Evidence Collection and Retention
Plan log retention, export and forensic evidence collection for DR tests and post-migration investigations.
Top 5 design recommendations
- Define retention policies for logs and alerts required for compliance and forensics.
- Plan secure export paths to immutable storage for critical incident evidence.
- Include forensic collection steps in playbooks and runbooks.
- Map retention to regulatory requirements for migrated workloads.
- Design access controls for forensic evidence to preserve chain of custody.
Top 5 operational best practices
- Automate export of critical logs after major failover tests to long-term storage.
- Validate integrity of exported evidence and document checksums.
- Limit access to forensic stores and log all access events.
- Include forensic evidence steps in incident post-mortems.
- Review retention policies annually and after major migrations.
💰
20. Cost Management and Data Ingestion Controls
Control ingestion costs and monitor data volume spikes during migration and DR tests.
Top 5 design recommendations
- Estimate ingestion and retention costs for Log Analytics and Sentinel during migration waves.
- Use sampling, filtering and DCRs to limit high-volume telemetry during tests.
- Design workspace tiers and retention to balance cost and forensic needs.
- Plan alerts for unexpected ingestion spikes during failovers.
- Map cost centers to migration waves and tag resources for chargeback.
Top 5 operational best practices
- Monitor ingestion and query costs daily during migration windows.
- Apply temporary retention reductions for non-critical logs during tests.
- Use budgets and alerts to prevent runaway costs from unexpected telemetry.
- Report cost variances after each migration wave and adjust plans.
- Archive older logs to cheaper storage when forensic needs allow.
🛠
21. Multi-Subscription and Multi-Tenant Monitoring
Design monitoring across subscriptions and tenants to maintain visibility for DR and migrated servers.
Top 5 design recommendations
- Decide central monitoring subscription vs distributed model for cross-subscription correlation.
- Use Lighthouse or cross-tenant connectors for managed monitoring scenarios.
- Plan RBAC and least-privilege access across subscriptions for monitoring teams.
- Design workspace linking and data export strategies for multi-subscription queries.
- Map subscription boundaries to migration waves and ownership.
Top 5 operational best practices
- Automate workspace and connector provisioning across subscriptions for each migration wave.
- Validate cross-subscription queries and dashboards after cutover.
- Audit cross-subscription access and remove stale permissions post-migration.
- Use consistent tagging and naming across subscriptions for reporting.
- Document multi-subscription monitoring topology for operations teams.
📱
22. Third-Party SIEM and ITSM Integrations
Integrate Sentinel, Monitor and Defender with existing SIEMs and ITSM tools for unified incident handling.
Top 5 design recommendations
- Plan connectors and data flows to external SIEMs and ITSM platforms for DR incident workflows.
- Define which alerts and incidents should be forwarded and which remain in Sentinel.
- Design secure webhook and API integrations with proper authentication and throttling.
- Map ticketing fields to Sentinel incident properties for consistent triage.
- Include integration tests in migration validation plans.
Top 5 operational best practices
- Test end-to-end ticket creation and closure during DR drills.
- Monitor integration health and queue lengths for forwarded incidents.
- Keep mapping documentation current for incident fields and priorities.
- Use retries and dead-letter handling for failed webhook deliveries.
- Audit integration access and rotate credentials regularly.
📄
23. Reporting, Dashboards and Audit Evidence
Provide stakeholders with regular reports and audit evidence for DR tests and migrated servers.
Top 5 design recommendations
- Design standardized reports for DR readiness, test outcomes and security posture.
- Include secure score, incident trends and remediation status in executive reports.
- Automate report generation and distribution after each migration wave.
- Store audit evidence and runbook execution logs in a central, immutable store.
- Map reporting cadence to governance and compliance requirements.
Top 5 operational best practices
- Distribute post-test reports to stakeholders and include remediation plans.
- Keep an audit trail of who ran tests and who approved exceptions.
- Archive reports and evidence for compliance retention periods.
- Use automated dashboards for near-real-time status during cutovers.
- Review reporting templates annually and after major migrations.
📚
24. Training, Runbooks and Knowledge Transfer
Train operations and security teams on monitoring, playbooks and DR runbooks for migrated servers.
Top 5 design recommendations
- Create runbooks for onboarding, failover validation and incident response specific to migrated servers.
- Design role-based training for app owners, platform ops and SOC analysts.
- Include monitoring and Sentinel query examples in training materials.
- Plan knowledge transfer sessions aligned to migration waves.
- Maintain a runbook library with versioning and approval history.
Top 5 operational best practices
- Run hands-on training and tabletop exercises before each migration wave.
- Keep runbooks updated with lessons learned from tests and incidents.
- Record training sessions and make them available for on-demand refreshers.
- Validate runbook steps during live DR drills and update as needed.
- Track training completion for critical roles and enforce refresh cycles.
🔧
25. Troubleshooting, Diagnostics and Playbooks
Maintain troubleshooting guides and playbooks for common monitoring and DR issues on migrated servers.
Top 5 design recommendations
- Document common failure modes and diagnostic steps for replication, agent, and connectivity issues.
- Design playbooks for automated diagnostics (collect logs, run health checks, restart agents).
- Include escalation paths and contact lists in troubleshooting guides.
- Map diagnostics outputs to runbook remediation steps for rapid resolution.
- Keep troubleshooting artifacts and scripts in a secure, versioned repository.
Top 5 operational best practices
- Maintain a searchable knowledge base of past incidents and resolutions.
- Run periodic diagnostics drills to ensure playbooks and scripts work as expected.
- Automate common fixes where safe and include manual approval for risky actions.
- Review and update troubleshooting guides after each migration wave.
- Train on diagnostic tools and ensure access for on-call responders.
💻
PowerShell Quick Commands (Azure Monitor / Sentinel / Defender)
Sample Az PowerShell and module commands to automate monitoring, Sentinel and Defender tasks for DR and migrated servers.
Azure Monitor (10)
Connect-AzAccount
Get-AzMetric -ResourceId <resourceId> -MetricName "Percentage CPU"
Get-AzOperationalInsightsWorkspace -ResourceGroupName <rg>
Set-AzDiagnosticSetting -ResourceId <resId> -WorkspaceId <workspaceId>
New-AzMetricAlertRuleV2 -Name "ReplicationHealth" -ResourceGroupName <rg> -TargetResourceId <resId> -Condition <condition>
Get-AzActionGroup -ResourceGroupName <rg>
New-AzActionGroup -Name <name> -ShortName <short> -Receiver <receiver>
New-AzScheduledQueryRule -ResourceGroupName <rg> -Name "DR-Log-Alert" -Query <kql> -Trigger <trigger>
Get-AzMetric -ResourceId <resId>
Get-AzLog -StartTime (Get-Date).AddHours(-1)
Microsoft Sentinel (10)
Install-Module -Name Az.SecurityInsights
Connect-AzAccount
Get-AzSentinelAlertRule -ResourceGroupName <rg> -WorkspaceName <ws>
New-AzSentinelAlertRule -ResourceGroupName <rg> -WorkspaceName <ws> -Name "DR-Failover-Alert" -Query <kql> -Severity "High"
Get-AzSentinelDataConnector -ResourceGroupName <rg> -WorkspaceName <ws>
New-AzSentinelDataConnector -ResourceGroupName <rg> -WorkspaceName <ws> -Name "AzureActivity"
Get-AzSentinelIncident -ResourceGroupName <rg> -WorkspaceName <ws>
New-AzSentinelWatchlist -ResourceGroupName <rg> -WorkspaceName <ws> -Name "MigratedServers" -Source <csv>
Invoke-AzSentinelPlaybook -ResourceGroupName <rg> -WorkspaceName <ws> -PlaybookName <name>
Get-AzSentinelAutomationRule -ResourceGroupName <rg> -WorkspaceName <ws>
Microsoft Defender (10)
Install-Module -Name Az.Security
Connect-AzAccount
Enable-AzSecurityCenter -SubscriptionId <subscriptionId>
Get-AzSecurityAlert -ResourceGroupName <rg>
Set-AzSecurityPricing -Name "Default" -PricingTier "Standard"
Get-AzSecurityAssessment -ResourceGroupName <rg>
Invoke-AzSecurityAssessment -Name <assessmentName> -ResourceGroupName <rg>
Get-AzSecurityContact
Set-AzSecurityAutoProvisioningSetting -AutoProvision "On"
Get-AzSecurityTask -ResourceGroupName <rg>
Get-AzSecuritySecureScore -SubscriptionId <subscriptionId>
Notes
- Replace placeholders (<rg> <ws> <resId> <subscriptionId>) with your values.
- Use modules Az.Monitor, Az.SecurityInsights and Az.Security for full automation.