Wednesday, December 16, 2015

Microsoft's December 3 Office 365 outage: What went wrong

Just under two weeks ago, a number of Office 365 customers in Europe were hit by an outage that lasted several hours.
cloudproblems.jpg
Microsoft officials recently shared some of the behind-the-scenes details regarding what went wrong via an incident report available from the Office 365 Dashboard.I first discovered the report, PIR IS3496, thanks to a blog post from Tony Redmond on the Windows IT Pro site. (I wasn't able to find the incident report in my own Office 365 Dashboard, for whatever reason, but I did get to look at a full copy of it.)
According to that report, the December 3 Office 365 outage lasted approximately four hours, starting at roughly 9 a.m. UTC. The report acknowledged that "many customers served from the European region were affected by this issue." And some customers from other regions who authenticated through Europe also may have experienced problems that day."
Approximately 1% of Outlook and 35% of OOTW (Outlook on the Web) requests were impacted," the report noted.
"Affected users were unable to sign in to the Office 365 portal. Additionally, some users were unable to access Office 365 services, including the SharePoint Online service, Power BI, Microsoft Intune, Yammer, and Exchange Online. For Exchange Online, Outlook on the web (OOTW) users experienced the most impact, but impact to Outlook and Exchange ActiveSync (EAS) mobile devices was minimal," the incident report says.

Additionally, access to the Service Health Dashboard also was hit. Even though Microsoft has a backup "Emergency Broadcast System," (EBS) customers from the European region were unable to see updates to this page due to an EBS failure.
At its root, the December 3rd outage was a sign-in/identity problem. The cause was two-fold, the Softies said:
"1. A recent update exposed a configuration problem between the production and pre-production authentication infrastructure. This resulted in some requests being misrouted and creating a backlog of authentication requests on the Azure Active Directory (AAD) front ends.
2. The backlog of misrouted requests in AAD had a cascading effect that resulted in high system resource utilization, which further compounded the problem as traffic increased during normal business hours in the European region. This led to intermittent authentication request failures within the European Data Centers and caused failures in the AAD authentication service, which resulted in impact to Office 365 services."
What's next for Microsoft to prevent similar issues moving forward?
The company plans to add additional fault-injection techniques to improve its testing procedures, as well as additional fallback mechanisms to allow it to use an older version of the authentication service, the report said.
To thwart the potential for mis-routed requests caused by high CPU utilization, Microsoft plans to add more overload detection and recovery mechanisms and to improve isolation across service endpoints to head off cascading failures, the report added.
As an incorrect content delivery network (CDN) link prevented users from seeing updates on http://status.office.com, Microsoft plans to review its switchover options for cases when access to the Office 365 portal is impacted.
Microsoft's report lists the completion date for all of these next steps as "December 2015."
I asked Microsoft officials if users affected by the December 3 outage will be compensated in some way and was told the company had no comment.
I'm also curious why Microsoft made this post-mortem available as a dashboard report rather than as a publicly facing blog post, as it has in previous cases of Office 365 and Azure outages. Again, no comment from the company on that, either.

No comments:

Post a Comment