SDx2 unavailable due to DNS issues

Incident Report for HxGN SDx2

Postmortem

What happened?

Between 15:45 UTC on 29 October and 00:05 UTC on 30 October 2025, customers and Microsoft services leveraging Azure Front Door (AFD) may have experienced latencies, timeouts, and errors.

Affected Azure services include, but are not limited to: App Service, Azure Active Directory B2C, Azure Communication Services, Azure Databricks, Azure Healthcare APIs, Azure Maps, Azure Portal, Azure SQL Database, Azure Virtual Desktop, Container Registry, Media Services, Microsoft Copilot for Security, Microsoft Defender External Attack Surface Management, Microsoft Entra ID (Mobility Management Policy Service, Identity & Access Management, and User Management UX), Microsoft Purview, Microsoft Sentinel (Threat Intelligence), and Video Indexer.

Customer configuration changes to AFD remain temporarily blocked. We will notify customers once this block has been lifted. While error rates and latency are back to pre-incident levels, a small number of customers may still be seeing issues and we are still working to mitigate this long tail. Updates will be provided directly via Azure Service Health.

What went wrong and why?

An inadvertent tenant configuration change within Azure Front Door (AFD) triggered a widespread service disruption affecting both Microsoft services and customer applications dependent on AFD for global content delivery. The change introduced an invalid or inconsistent configuration state that caused a significant number of AFD nodes to fail to load properly, leading to increased latencies, timeouts, and connection errors for downstream services.

As unhealthy nodes dropped out of the global pool, traffic distribution across healthy nodes became imbalanced, amplifying the impact and causing intermittent availability even for regions that were partially healthy. We immediately blocked all further configuration changes to prevent additional propagation of the faulty state and began deploying a ‘last known good’ configuration across the global fleet. Recovery required reloading configurations across a large number of nodes and rebalancing traffic gradually to avoid overload conditions as nodes returned to service. This deliberate, phased recovery was necessary to stabilize the system while restoring scale and ensuring no recurrence of the issue.

The trigger was traced to a faulty tenant configuration deployment process. Our protection mechanisms, to validate and block any erroneous deployments, failed due to a software defect which allowed the deployment to bypass safety validations. Safeguards have since been reviewed and additional validation and rollback controls have been immediately implemented to prevent similar issues in the future.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail and will share findings within 14 days. Once we complete our internal retrospective, generally within 14 days, we will publish a final Post Incident Review (PIR) to all impacted customers.

To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts

For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs

The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring

Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness

Posted Oct 30, 2025 - 11:06 UTC

Resolved

This incident has been resolved.

Posted Oct 30, 2025 - 11:04 UTC

Monitoring

The hosting provider has completed the rollback to the last known good state. You should begin to see a gradual return to normal functionality. However, please note that you may experience slower performance during this time.

Posted Oct 29, 2025 - 21:12 UTC

Update

Our hosting platform is still experiencing DNS-related issues, and the rollback to the last known good state is still in progress. The hosting provider continues to work diligently to restore services as quickly as possible. We will provide another update within the next 60 minutes or as soon as we have more information to share.

Posted Oct 29, 2025 - 20:12 UTC

Update

Posted Oct 29, 2025 - 18:57 UTC

Identified

Our hosting platform continues to experience issues related to DNS which has resulted in a loss of availability of some services. The hosting provider is currently performing a rollback to the last known good state. We currently do not have an ETA for when the rollback will be completed. We will provide another update within 60 minutes.

Posted Oct 29, 2025 - 17:58 UTC

This incident affected: West Europe (SDx2, Visualization Services, Integration 2.0, Data Take On), Southeast Asia (SDx2, Visualization Services, Integration 2.0, Data Take On), Australia East (SDx2, Visualization Services, Integration 2.0, Data Take On), Central Canada (SDx2, Visualization Services, Integration 2.0, Data Take On), Central US (SDx2, Visualization Services, Integration 2.0, Data Take On), and UAE North (SDx2, Visualization Services, Integration 2.0, Data Take On).