The outage that crashed Microsoft’s cloud services earlier this month was caused by one corrupted file in the company’s DNS service.
The problem stemmed from a load balancing tool that went awry, with users around the world unable to sign into mail and document accounts at Microsoft's Office 365, Hotmail and Skydrive.
“A tool that helps balance network traffic was being updated and the update did not work correctly. As a result, configuration settings were corrupted, which caused a service disruption,” said Arthur de Haan, vice president for Windows Live Test and Service Engineering on a company blog.
According to de Haan, services were restored within an hour and a half, but he admitted the fix took longer to replicate around the globe, so some users had to wait longer.
“The file corruption was a result of two rare conditions occurring at the same time. The first condition is related to how the load balancing devices in the DNS service respond to a malformed input string - the software was unable to parse an incorrectly constructed line in the configuration file.
“The second condition was related to how the configuration is synchronised across the DNS service to ensure all client requests return the same response regardless of the connection location of the client. Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service.”
The company said it had reviewed the failure and would be “hardening the DNS service” and improving the recovery process that “will decrease the time it takes to resolve outages”.
This article originally appeared at pcpro.co.uk