Back

Microsoft's defences "amplified" a DDoS attack to cause global Azure "usage spike" outage

Posted By

The Stack

On

July 31, 2024

An implementation error in Microsoft's protective mechanisms worsened the effect of a cyberattack, sparking a worldwide incident.

It started with a DDoS attack - which is exactly where the story should also have ended.

But on June 30th at about 11.45 am, something went wrong with Microsoft's cybersecurity defences.

When an attack triggered Microsoft's protection mechanisms, an implementation error meant the DDoS wasn't deflected.

Instead, it was "amplified", sparking a global outage that brought down Azure App Services, Application Insights, Azure IoT Central, Azure Log Search Alerts, Azure Policy, as well as the Azure portal itself and a subset of Microsoft 365 and Microsoft Purview services.

The worldwide Azure outage came less than two weeks after the Crowdstrike debacle, an incident that hit 8.5 million machines and was quickly hailed as the biggest IT outage of time.

At first, the Azure outage was blamed on a mysterious "usage spike". Security researcher Kevin Beaumont was then among the first to call it.

"The Azure outage today is due to a DDoS attack," he wrote. "Microsoft needs to be more transparent about customer impacting DDoS attacks, since they haven't told you again."

Microsoft later admitted that he was right.

"While the initial trigger event was a Distributed Denial-of-Service (DDoS) attack, which activated our DDoS protection mechanisms, initial investigations suggest that an error in the implementation of our defences amplified the impact of the attack rather than mitigating it," it wrote in a status update and mitigation report.

"An unexpected usage spike resulted in Azure Front Door (AFD) and Azure Content Delivery Network (CDN) components performing below acceptable thresholds, leading to intermittent errors, timeout, and latency spikes."

How did Azure respond to the outage?

As soon as Microsoft understood the nature of this usage spike, it implemented networking configuration changes and carried out failover plans, switching to "alternate networking paths to provide relief."

These initial network configuration changes successfully mitigated the majority of the impact at just after 2pm UK time, although some customers reported less than 100% availability and had to wait until about 6pm until Microsoft began mitigating their challenges.

A fix was first rolled out across regions in Europe and Asia Pacific. Once it had been validated and proven to have "eliminated the side effect impacts of the initial mitigation," it was rolled out across the Americas.

Failure rates returned to normal, pre-incident levels by 19:43pm.

"After monitoring traffic and services to ensure that the issue was fully mitigated, we declared the incident mitigated at 20:48 UTC," Microsoft wrote. "Some downstream services took longer to recover, depending on how they were configured to use AFD and/or CDN."

Azure is now completing an "internal retrospective to understand the incident in more detail" and will publish a Preliminary Post Incident Review (PIR) within roughly 72 hours, which will give more details on the incident and how it responded.

A Final Post Incident Review with "additional details and learnings" will be published in 14 days.

Microsoft's DDoS defences in detail

Microsoft describes its DDoS protections as "unique". "The cornerstone of Microsoft's DDoS strategy is global presence," it wrote. "Microsoft engages with Internet providers, peering providers (public and private), and private corporations all over the world. This engagement gives Microsoft a significant Internet presence and enables Microsoft to absorb attacks across a large surface area

"As Microsoft's edge capacity has grown over time, the significance of attacks against individual edges has substantially diminished. Because of this decrease, Microsoft has separated the detection and mitigation components of its DDoS prevention system. Microsoft deploys multi-tiered detection systems at regional datacenters to detect attacks closer to their saturation points while maintaining global mitigation at the edge nodes. This strategy ensures that Microsoft services can handle multiple simultaneous attacks."

It describes reducing service attack surfaces as "one of the most effective and low-cost defences" and drops unwanted traffic at the network edge instead of "analysing, processing, and scrubbing data inline."

Microsoft uses special-purpose security devices for firewall, network address translation, and IP filtering functions at the interface with the public network, as well as global equal-cost multi-path (ECMP) routing - which is a network framework that ensures there are multiple global paths to a service.

Microsoft's Azure DDoS Protection is "designed not only to withstand external attacks, but also attacks from other Azure tenants". It uses detection and mitigation techniques such as SYN cookies, rate limiting, and connection limits to protect against DDoS attacks.

This time around, those defences fell short. The outage is a timely reminder that even rudimentary attacks can now have global consequences, particularly in a concentrated market. Azure is estimated to hold about 24% of global market share and is therefore piled high in the stacks of businesses in all sorts of industries and locations.

“The Microsoft outage demonstrates the ease at which DDoS actors can wreak havoc against critical business services," said Donny Chong, Director of the DDoS-focused security firm Nexusguard. "Anyone can carry out an attack of this magnitude from their own bedroom if they have the right equipment.

Data from the DDoS-focused security firm NexusGuard found that attack sizes increased by an average of 183% last year. It also suggested 81% of attacks are now shorter than 90 minutes in duration.

"This shows both the scale of the task at hand for stretched cybersecurity teams and that attacks are now more efficient than ever when inflicting disruption on businesses," Chong added.

“The tech community would benefit from more transparency on how many DDoS attacks companies thwart and how they mitigate them."

Source: https://www.thestack.technology/microsoft-ddos-attack-azure-outage/