My blogs are all about basic Software Testing concepts.I am Trying to contribute some knowledge to freshers\Graduates who aspire to have a career in Software Testing
Wednesday, December 16, 2015
Microsoft's December 3 Office 365 outage: What went wrong
Just under two weeks ago, a number of Office 365 customers in Europe were hit by an outage that lasted several hours.
Microsoft officials recently shared some of the behind-the-scenes
details regarding what went wrong via an incident report available from
the Office 365 Dashboard.I first discovered the report, PIR IS3496,
thanks toa blog post from Tony Redmond on the Windows IT Pro site.
(I wasn't able to find the incident report in my own Office 365
Dashboard, for whatever reason, but I did get to look at a full copy of
it.)
According to that report, the December 3 Office 365 outage
lasted approximately four hours, starting at roughly 9 a.m. UTC. The
report acknowledged that "many customers served from the European region
were affected by this issue." And some customers from other regions who
authenticated through Europe also may have experienced problems that
day."
Approximately 1% of Outlook and 35% of OOTW (Outlook on the Web) requests were impacted," the report noted.
"Affected users were unable to sign in to the Office 365 portal.
Additionally, some users were unable to access Office 365 services,
including the SharePoint Online service, Power BI, Microsoft Intune,
Yammer, and Exchange Online. For Exchange Online, Outlook on the web
(OOTW) users experienced the most impact, but impact to Outlook and
Exchange ActiveSync (EAS) mobile devices was minimal," the incident
report says.
Additionally, access
to the Service Health Dashboard also was hit. Even though Microsoft has
a backup "Emergency Broadcast System," (EBS) customers from the
European region were unable to see updates to this page due to an EBS failure.
At its root, the December 3rd outage was a sign-in/identity problem. The cause was two-fold, the Softies said:
"1. A recent update exposed a configuration problem between the
production and pre-production authentication infrastructure. This
resulted in some requests being misrouted and creating a backlog of
authentication requests on the Azure Active Directory (AAD) front ends.
2. The backlog of misrouted requests in AAD had a cascading
effect that resulted in high system resource utilization, which further
compounded the problem as traffic increased during normal business hours
in the European region. This led to intermittent authentication request
failures within the European Data Centers and caused failures in the
AAD authentication service, which resulted in impact to Office 365
services."
What's next for Microsoft to prevent similar issues moving forward? The company plans to add additional fault-injection
techniques to improve its testing procedures, as well as additional
fallback mechanisms to allow it to use an older version of the
authentication service, the report said.
To thwart the potential
for mis-routed requests caused by high CPU utilization, Microsoft plans
to add more overload detection and recovery mechanisms and to improve
isolation across service endpoints to head off cascading failures, the
report added.
As an incorrect content delivery network (CDN) link prevented users from seeing updates on http://status.office.com, Microsoft plans to review its switchover options for cases when access to the Office 365 portal is impacted.
Microsoft's report lists the completion date for all of these next steps as "December 2015."
I
asked Microsoft officials if users affected by the December 3 outage
will be compensated in some way and was told the company had no comment.
I'm also curious why Microsoft made this post-mortem available
as a dashboard report rather than as a publicly facing blog post, as it
has in previous cases of Office 365 and Azure outages. Again, no comment from the company on that, either.
No comments:
Post a Comment