cloudconsulting.agustin

Azure Monitoring + Microsoft Sentinel + Defender Top 25

Checklist reordered for monitoring disaster recovery servers and migrated servers: planning, instrumentation, alerting, detection, response and validation.

📈

1. Define Monitoring Goals for DR and Migrated Servers

Define availability, RPO/RTO monitoring, telemetry retention and incident response SLAs for DR and migrated servers.

Top 5 design recommendations

  1. Classify servers by criticality and map to monitoring retention and alert severity.
  2. Choose Log Analytics workspace strategy (per environment, per subscription, or central) for DR visibility.
  3. Define metrics and logs required for RPO/RTO verification (replication health, recovery points, failover tests).
  4. Plan network and private endpoint access for monitoring agents on DR/migrated servers.
  5. Design tagging to link monitored resources to migration waves and DR runbooks.

Top 5 operational best practices

  1. Baseline pre-migration telemetry and compare post-migration to validate parity.
  2. Schedule synthetic availability tests for DR endpoints and migrated app endpoints.
  3. Automate onboarding of migrated servers to Log Analytics and Defender policies.
  4. Run periodic failover drills and validate monitoring alerts trigger expected incidents.
  5. Keep runbooks and escalation contacts in sync with migration waves.

💻

2. Instrumentation: AMA, Extensions and Defender Agents

Standardize on Azure Monitor Agent (AMA) and Defender onboarding for consistent telemetry on DR and migrated servers.

Top 5 design recommendations

  1. Use AMA (Azure Monitor Agent) for logs/metrics and migrate from legacy agents before cutover.
  2. Plan Log Analytics workspace sizing and retention for DR test logs and long-term forensic needs.
  3. Define required VM extensions and Defender for Servers plan for hybrid machines.
  4. Use managed identities or service principals for agent onboarding at scale.
  5. Design DCRs (Data Collection Rules) to separate DR telemetry from normal operations if needed.

Top 5 operational best practices

  1. Automate agent deployment via ARM/Policy/Automation runbooks during migration waves.
  2. Validate agent connectivity to workspace and Sentinel ingestion after failover tests.
  3. Monitor agent health and extension update status centrally.
  4. Document rollback steps to remove agents from decommissioned DR VMs.
  5. Use deployment tags to track which servers were instrumented pre- and post-migration.

👥

3. Microsoft Sentinel Workspace Design

Design Sentinel workspaces and data connectors to ensure DR and migrated servers feed security telemetry and alerts.

Top 5 design recommendations

  1. Decide central vs per-environment Sentinel workspace for incident correlation across DR and production.
  2. Enable required connectors (Azure Activity, Security Events, Syslog, Windows Event) for migrated servers.
  3. Plan retention and storage for long-term forensic needs from DR tests and failovers.
  4. Map analytics rules to DR scenarios (replication anomalies, unexpected failovers, post-cutover anomalies).
  5. Define playbooks for automated response during failover and post-migration incidents.

Top 5 operational best practices

  1. Validate connectors ingesting DR server logs during test failovers.
  2. Tune analytics rules to reduce noise from expected DR activities.
  3. Use watchlists to track migrated server inventories and asset owners.
  4. Run hunting queries after cutover to detect configuration drift or suspicious activity.
  5. Document incident playbook steps specific to DR failover and failback events.

📥

4. Alerts, Action Groups and Escalation Paths

Define alert rules, action groups and escalation for DR and migrated server incidents to meet RTO targets.

Top 5 design recommendations

  1. Classify alerts by severity and map to action groups with on-call rotations for DR waves.
  2. Use metric alerts for replication health and log alerts for failover events.
  3. Integrate alerts with ITSM and incident management for automated ticketing during failovers.
  4. Design alert suppression windows for planned failovers and tests to avoid noise.
  5. Define SLA-driven escalation timelines and automated runbook triggers.

Top 5 operational best practices

  1. Test action groups and escalation during DR drills to confirm notifications reach responders.
  2. Use SMS/voice/email and webhook channels for multi-channel escalation.
  3. Maintain an on-call roster aligned to migration waves and DR owners.
  4. Document alert tuning changes made for migration windows and revert after stabilization.
  5. Audit alert delivery and incident creation after each failover test.

5. Dashboards and Workbooks for DR Validation

Create workbooks and dashboards that surface replication health, failover status and post-migration validation metrics.

Top 5 design recommendations

  1. Build a DR overview workbook showing replication health, RPO, last recovery point and test failover results.
  2. Create per-application dashboards that compare pre- and post-migration performance metrics.
  3. Expose Sentinel incidents and Defender alerts in a single operations dashboard for SOC/IT ops.
  4. Design role-based dashboards for app owners, security, and platform teams.
  5. Include runbook links and remediation steps directly from dashboards for rapid response.

Top 5 operational best practices

  1. Keep dashboards lightweight and focused on actionable metrics for DR decision makers.
  2. Refresh dashboards after each migration wave and DR test to reflect current baselines.
  3. Share dashboards with stakeholders and embed in runbooks and post-mortems.
  4. Automate snapshot exports of dashboards for audit and compliance evidence after failovers.
  5. Use workbook templates to standardize DR reporting across applications.

🔍

6. Test Migrations and Failover Validation

Validate monitoring, detection and response during test failovers and post-migration cutovers.

Top 5 design recommendations

  1. Plan test failovers with monitoring validation checkpoints (alerts, logs, metrics, Sentinel incidents).
  2. Define success criteria and automated checks for application health post-failover.
  3. Ensure Log Analytics retention covers test artifacts for forensic review.
  4. Include Defender scans and Sentinel hunting queries as part of validation steps.
  5. Design rollback triggers if monitoring shows critical regressions during cutover.

Top 5 operational best practices

  1. Run full monitoring validation during each test failover and record results.
  2. Execute Sentinel hunting queries immediately after tests to detect anomalies.
  3. Perform Defender vulnerability scans post-migration to detect configuration drift.
  4. Update runbooks with lessons learned and monitoring tuning changes.
  5. Communicate test outcomes to stakeholders and schedule remediation for findings.

📄

7. Log Analytics Workspace Strategy

Design workspaces for scale, retention and cross-subscription correlation for DR and migrated servers.

Top 5 design recommendations

  1. Choose between central workspace or per-environment workspaces based on data sovereignty and query performance.
  2. Plan retention tiers for DR test artifacts and long-term forensic evidence.
  3. Use workspace-linked resources and private endpoints for secure ingestion from DR networks.
  4. Define workspace naming and tagging to map to migration waves and application owners.
  5. Consider workspace capacity planning for peak ingestion during failover tests.

Top 5 operational best practices

  1. Monitor ingestion and query costs and adjust retention or sampling for DR test windows.
  2. Archive critical test logs to long-term storage if required for compliance.
  3. Use workspace permissions and RBAC to limit access to sensitive forensic logs.
  4. Automate workspace provisioning for new migration waves.
  5. Document workspace mapping to applications and DR plans.

🛠

8. Data Collection Rules and DCRs

Use Data Collection Rules to control which logs and metrics are collected from DR and migrated servers.

Top 5 design recommendations

  1. Define DCRs per workload type to limit noise and control costs during failover tests.
  2. Use DCRs to route sensitive logs to dedicated workspaces with stricter access controls.
  3. Plan DCR versioning and change control for migration waves.
  4. Map DCRs to agent configuration and extension deployment processes.
  5. Include sampling and filtering rules for high-volume telemetry sources.

Top 5 operational best practices

  1. Test DCRs in a staging workspace before applying to production DR servers.
  2. Monitor DCR application and agent compliance across migrated servers.
  3. Keep DCRs in source control and document changes tied to migration waves.
  4. Use automation to apply DCRs during cutover and remove or adjust after stabilization.
  5. Audit DCR effectiveness and tune filters to reduce unnecessary ingestion.

💻

9. Agent Lifecycle and Management

Manage agent deployment, updates and health for Azure Monitor and Defender agents on DR/migrated servers.

Top 5 design recommendations

  1. Standardize agent versions and define maintenance windows for updates.
  2. Use Azure Policy to enforce agent installation and configuration.
  3. Plan rollback procedures for agent failures during cutover.
  4. Design monitoring for agent heartbeat and extension status.
  5. Use managed identities for agent onboarding to reduce credential sprawl.

Top 5 operational best practices

  1. Automate agent patching and monitor for failed updates.
  2. Alert on missing heartbeats and remediate via runbooks.
  3. Keep a secure inventory of agented servers and their workspace assignments.
  4. Test agent reinstallation procedures during DR drills.
  5. Document agent configuration baselines for migrated workloads.

🔒

10. Defender for Cloud Policies and Plans

Enable Defender plans and policies to protect migrated servers and DR environments from threats.

Top 5 design recommendations

  1. Choose Defender plans (Servers, SQL, Storage) appropriate for migrated workloads.
  2. Define policy assignments and exclusions for DR test environments to avoid noise.
  3. Map Defender recommendations to remediation runbooks and automation.
  4. Plan secure score targets and compliance baselines for migrated servers.
  5. Design integration points between Defender alerts and Sentinel incidents.

Top 5 operational best practices

  1. Enable auto-provisioning for Defender agents on new migrated VMs.
  2. Review Defender recommendations after each migration wave and remediate high-risk items.
  3. Use secure score to track improvement across migration waves.
  4. Integrate Defender alerts with ITSM and Sentinel playbooks for automated response.
  5. Document exceptions and maintain an approval log for policy deviations.

📊

11. Analytics Rules and Threat Detection

Create and tune Sentinel analytics rules to detect threats specific to DR and migrated servers.

Top 5 design recommendations

  1. Map detection rules to DR-specific scenarios (unexpected replication stops, unauthorized failover triggers).
  2. Use machine learning and UEBA rules where applicable to detect anomalous post-migration behavior.
  3. Design rule severity and suppression to avoid alert storms during planned failovers.
  4. Leverage built-in Microsoft rules and customize with KQL for environment specifics.
  5. Plan rule testing and validation during migration test windows.

Top 5 operational best practices

  1. Run rule tuning sessions after each migration wave to reduce false positives.
  2. Document rule rationale and expected outcomes for auditability.
  3. Use test incidents to validate playbooks and response actions.
  4. Monitor rule performance and adjust thresholds based on baseline changes.
  5. Keep a change log for analytics rule updates tied to migration events.

🔧

12. Playbooks, Runbooks and Automated Response

Automate common remediation and DR-specific responses using Logic Apps and Automation runbooks.

Top 5 design recommendations

  1. Design playbooks for common DR incidents (failed replication, failed test failover, post-cutover anomalies).
  2. Integrate playbooks with action groups and Sentinel automation rules.
  3. Use runbooks for environment remediation tasks (agent re-install, workspace reassignment).
  4. Plan secure credentials and managed identities for playbook actions.
  5. Include manual approval steps for high-impact automated actions.

Top 5 operational best practices

  1. Test playbooks during DR drills and validate expected outcomes.
  2. Keep playbooks in source control and use CI/CD for updates.
  3. Monitor playbook run history and failures for continuous improvement.
  4. Limit playbook permissions and audit actions taken by automation.
  5. Document rollback steps if automated remediation causes unintended effects.

📈

13. Baseline Performance and Health Metrics

Establish baselines for performance and health to detect regressions after migration or failover.

Top 5 design recommendations

  1. Capture historical metrics for CPU, memory, disk I/O and network to set expected ranges.
  2. Define SLA thresholds and alerting bands for migrated workloads.
  3. Include application-level metrics and synthetic transactions in baselines.
  4. Plan for seasonal or batch-driven load patterns when defining baselines.
  5. Store baseline snapshots for comparison after each migration wave.

Top 5 operational best practices

  1. Recompute baselines after major migrations and update alert thresholds accordingly.
  2. Use automated reports to compare pre- and post-migration performance.
  3. Run load tests post-migration to validate capacity planning.
  4. Document deviations and remedial actions taken during cutover.
  5. Share baseline dashboards with application owners for acceptance.

📶

14. Network and Connectivity Monitoring

Monitor connectivity, latency and private endpoint health for DR and migrated servers.

Top 5 design recommendations

  1. Instrument network paths with Network Watcher, connection monitors and synthetic tests.
  2. Monitor private endpoints and VPN/ExpressRoute links used by DR replication.
  3. Design alerts for latency, packet loss and route changes that impact RTO.
  4. Include NSG flow logs and Azure Firewall logs in Sentinel for security correlation.
  5. Plan network baselines for expected throughput during failover tests.

Top 5 operational best practices

  1. Run connection monitors before and after failovers to validate connectivity.
  2. Alert on VPN/ExpressRoute status changes and automate remediation where possible.
  3. Collect and retain NSG flow logs for post-incident analysis.
  4. Include network checks in cutover runbooks and dashboards.
  5. Validate private DNS resolution for migrated servers and endpoints.

📦

15. Storage, Backup and Replication Monitoring

Monitor replication health, backup jobs and storage performance for migrated and DR servers.

Top 5 design recommendations

  1. Monitor replication health metrics (ASR) and backup job success/failure rates.
  2. Instrument storage accounts and disks for IOPS, latency and capacity alerts.
  3. Plan retention and recovery point verification for compliance and RPO validation.
  4. Integrate backup and replication alerts into Sentinel for security correlation.
  5. Design dashboards showing backup coverage and last successful recovery point.

Top 5 operational best practices

  1. Validate backup and replication jobs after migration and record recovery points.
  2. Alert on failed backups and automate ticket creation for remediation.
  3. Run periodic restore tests and include monitoring validation steps.
  4. Monitor storage capacity and forecast growth after migration.
  5. Document backup/replication ownership and escalation paths.

👤

16. Identity and Access Monitoring

Monitor identity events, privileged access and conditional access changes for migrated and DR servers.

Top 5 design recommendations

  1. Ingest Entra ID (Azure AD) sign-in and audit logs into Sentinel for correlation with server events.
  2. Monitor privileged role assignments and Just-In-Time elevation events.
  3. Design alerts for suspicious sign-ins on migrated or DR admin accounts.
  4. Include conditional access policy changes in monitoring scope.
  5. Map identity events to application owners for rapid response.

Top 5 operational best practices

  1. Review privileged access logs after migration and revoke unnecessary roles.
  2. Test conditional access policies post-migration to ensure expected behavior.
  3. Alert on anomalous admin activity and integrate with Sentinel playbooks.
  4. Keep identity owner contact info up to date for escalations.
  5. Run periodic access reviews tied to migration waves.

📝

17. Compliance, Policies and Secure Score

Use Defender secure score and policy assessments to track compliance for migrated and DR servers.

Top 5 design recommendations

  1. Map compliance controls to Defender recommendations and policy initiatives.
  2. Define acceptable secure score thresholds for production and DR environments.
  3. Plan policy exemptions for temporary DR test activities with approval tracking.
  4. Integrate compliance evidence collection into monitoring workbooks.
  5. Design automated remediation for common policy violations.

Top 5 operational best practices

  1. Review secure score changes after each migration wave and remediate high-impact items.
  2. Document policy exceptions and their expiration dates.
  3. Use automated evidence collection for audits after failover tests.
  4. Schedule periodic compliance reviews with stakeholders.
  5. Track remediation progress and report to governance boards.

🚨

18. Incident Response and Playbook Testing

Prepare incident response plans and test playbooks specifically for DR and migrated server incidents.

Top 5 design recommendations

  1. Define incident types and response steps for DR-specific scenarios.
  2. Map playbooks to incident severity and required automated actions.
  3. Include forensic collection steps and evidence preservation in playbooks.
  4. Design communication templates for stakeholders during failovers.
  5. Plan post-incident reviews and remediation tracking.

Top 5 operational best practices

  1. Run tabletop exercises and live playbook tests during migration windows.
  2. Validate forensic collection and chain-of-custody steps after tests.
  3. Record lessons learned and update playbooks accordingly.
  4. Ensure incident roles and contacts are current for each migration wave.
  5. Automate incident post-mortem data collection for continuous improvement.

📑

19. Forensics, Evidence Collection and Retention

Plan log retention, export and forensic evidence collection for DR tests and post-migration investigations.

Top 5 design recommendations

  1. Define retention policies for logs and alerts required for compliance and forensics.
  2. Plan secure export paths to immutable storage for critical incident evidence.
  3. Include forensic collection steps in playbooks and runbooks.
  4. Map retention to regulatory requirements for migrated workloads.
  5. Design access controls for forensic evidence to preserve chain of custody.

Top 5 operational best practices

  1. Automate export of critical logs after major failover tests to long-term storage.
  2. Validate integrity of exported evidence and document checksums.
  3. Limit access to forensic stores and log all access events.
  4. Include forensic evidence steps in incident post-mortems.
  5. Review retention policies annually and after major migrations.

💰

20. Cost Management and Data Ingestion Controls

Control ingestion costs and monitor data volume spikes during migration and DR tests.

Top 5 design recommendations

  1. Estimate ingestion and retention costs for Log Analytics and Sentinel during migration waves.
  2. Use sampling, filtering and DCRs to limit high-volume telemetry during tests.
  3. Design workspace tiers and retention to balance cost and forensic needs.
  4. Plan alerts for unexpected ingestion spikes during failovers.
  5. Map cost centers to migration waves and tag resources for chargeback.

Top 5 operational best practices

  1. Monitor ingestion and query costs daily during migration windows.
  2. Apply temporary retention reductions for non-critical logs during tests.
  3. Use budgets and alerts to prevent runaway costs from unexpected telemetry.
  4. Report cost variances after each migration wave and adjust plans.
  5. Archive older logs to cheaper storage when forensic needs allow.

🛠

21. Multi-Subscription and Multi-Tenant Monitoring

Design monitoring across subscriptions and tenants to maintain visibility for DR and migrated servers.

Top 5 design recommendations

  1. Decide central monitoring subscription vs distributed model for cross-subscription correlation.
  2. Use Lighthouse or cross-tenant connectors for managed monitoring scenarios.
  3. Plan RBAC and least-privilege access across subscriptions for monitoring teams.
  4. Design workspace linking and data export strategies for multi-subscription queries.
  5. Map subscription boundaries to migration waves and ownership.

Top 5 operational best practices

  1. Automate workspace and connector provisioning across subscriptions for each migration wave.
  2. Validate cross-subscription queries and dashboards after cutover.
  3. Audit cross-subscription access and remove stale permissions post-migration.
  4. Use consistent tagging and naming across subscriptions for reporting.
  5. Document multi-subscription monitoring topology for operations teams.

📱

22. Third-Party SIEM and ITSM Integrations

Integrate Sentinel, Monitor and Defender with existing SIEMs and ITSM tools for unified incident handling.

Top 5 design recommendations

  1. Plan connectors and data flows to external SIEMs and ITSM platforms for DR incident workflows.
  2. Define which alerts and incidents should be forwarded and which remain in Sentinel.
  3. Design secure webhook and API integrations with proper authentication and throttling.
  4. Map ticketing fields to Sentinel incident properties for consistent triage.
  5. Include integration tests in migration validation plans.

Top 5 operational best practices

  1. Test end-to-end ticket creation and closure during DR drills.
  2. Monitor integration health and queue lengths for forwarded incidents.
  3. Keep mapping documentation current for incident fields and priorities.
  4. Use retries and dead-letter handling for failed webhook deliveries.
  5. Audit integration access and rotate credentials regularly.

📄

23. Reporting, Dashboards and Audit Evidence

Provide stakeholders with regular reports and audit evidence for DR tests and migrated servers.

Top 5 design recommendations

  1. Design standardized reports for DR readiness, test outcomes and security posture.
  2. Include secure score, incident trends and remediation status in executive reports.
  3. Automate report generation and distribution after each migration wave.
  4. Store audit evidence and runbook execution logs in a central, immutable store.
  5. Map reporting cadence to governance and compliance requirements.

Top 5 operational best practices

  1. Distribute post-test reports to stakeholders and include remediation plans.
  2. Keep an audit trail of who ran tests and who approved exceptions.
  3. Archive reports and evidence for compliance retention periods.
  4. Use automated dashboards for near-real-time status during cutovers.
  5. Review reporting templates annually and after major migrations.

📚

24. Training, Runbooks and Knowledge Transfer

Train operations and security teams on monitoring, playbooks and DR runbooks for migrated servers.

Top 5 design recommendations

  1. Create runbooks for onboarding, failover validation and incident response specific to migrated servers.
  2. Design role-based training for app owners, platform ops and SOC analysts.
  3. Include monitoring and Sentinel query examples in training materials.
  4. Plan knowledge transfer sessions aligned to migration waves.
  5. Maintain a runbook library with versioning and approval history.

Top 5 operational best practices

  1. Run hands-on training and tabletop exercises before each migration wave.
  2. Keep runbooks updated with lessons learned from tests and incidents.
  3. Record training sessions and make them available for on-demand refreshers.
  4. Validate runbook steps during live DR drills and update as needed.
  5. Track training completion for critical roles and enforce refresh cycles.

🔧

25. Troubleshooting, Diagnostics and Playbooks

Maintain troubleshooting guides and playbooks for common monitoring and DR issues on migrated servers.

Top 5 design recommendations

  1. Document common failure modes and diagnostic steps for replication, agent, and connectivity issues.
  2. Design playbooks for automated diagnostics (collect logs, run health checks, restart agents).
  3. Include escalation paths and contact lists in troubleshooting guides.
  4. Map diagnostics outputs to runbook remediation steps for rapid resolution.
  5. Keep troubleshooting artifacts and scripts in a secure, versioned repository.

Top 5 operational best practices

  1. Maintain a searchable knowledge base of past incidents and resolutions.
  2. Run periodic diagnostics drills to ensure playbooks and scripts work as expected.
  3. Automate common fixes where safe and include manual approval for risky actions.
  4. Review and update troubleshooting guides after each migration wave.
  5. Train on diagnostic tools and ensure access for on-call responders.

💻

PowerShell Quick Commands (Azure Monitor / Sentinel / Defender)

Sample Az PowerShell and module commands to automate monitoring, Sentinel and Defender tasks for DR and migrated servers.

Azure Monitor (10)

Connect-AzAccount
Get-AzMetric -ResourceId <resourceId> -MetricName "Percentage CPU"
Get-AzOperationalInsightsWorkspace -ResourceGroupName <rg>
Set-AzDiagnosticSetting -ResourceId <resId> -WorkspaceId <workspaceId>
New-AzMetricAlertRuleV2 -Name "ReplicationHealth" -ResourceGroupName <rg> -TargetResourceId <resId> -Condition <condition>
Get-AzActionGroup -ResourceGroupName <rg>
New-AzActionGroup -Name <name> -ShortName <short> -Receiver <receiver>
New-AzScheduledQueryRule -ResourceGroupName <rg> -Name "DR-Log-Alert" -Query <kql> -Trigger <trigger>
Get-AzMetric -ResourceId <resId>
Get-AzLog -StartTime (Get-Date).AddHours(-1)
            

Microsoft Sentinel (10)

Install-Module -Name Az.SecurityInsights
Connect-AzAccount
Get-AzSentinelAlertRule -ResourceGroupName <rg> -WorkspaceName <ws>
New-AzSentinelAlertRule -ResourceGroupName <rg> -WorkspaceName <ws> -Name "DR-Failover-Alert" -Query <kql> -Severity "High"
Get-AzSentinelDataConnector -ResourceGroupName <rg> -WorkspaceName <ws>
New-AzSentinelDataConnector -ResourceGroupName <rg> -WorkspaceName <ws> -Name "AzureActivity"
Get-AzSentinelIncident -ResourceGroupName <rg> -WorkspaceName <ws>
New-AzSentinelWatchlist -ResourceGroupName <rg> -WorkspaceName <ws> -Name "MigratedServers" -Source <csv>
Invoke-AzSentinelPlaybook -ResourceGroupName <rg> -WorkspaceName <ws> -PlaybookName <name>
Get-AzSentinelAutomationRule -ResourceGroupName <rg> -WorkspaceName <ws>
            

Microsoft Defender (10)

Install-Module -Name Az.Security
Connect-AzAccount
Enable-AzSecurityCenter -SubscriptionId <subscriptionId>
Get-AzSecurityAlert -ResourceGroupName <rg>
Set-AzSecurityPricing -Name "Default" -PricingTier "Standard"
Get-AzSecurityAssessment -ResourceGroupName <rg>
Invoke-AzSecurityAssessment -Name <assessmentName> -ResourceGroupName <rg>
Get-AzSecurityContact
Set-AzSecurityAutoProvisioningSetting -AutoProvision "On"
Get-AzSecurityTask -ResourceGroupName <rg>
Get-AzSecuritySecureScore -SubscriptionId <subscriptionId>
            

Notes

  1. Replace placeholders (<rg> <ws> <resId> <subscriptionId>) with your values.
  2. Use modules Az.Monitor, Az.SecurityInsights and Az.Security for full automation.