Home > News content

SRE Overload System Causes Google to Publish Downtime Accident Analysis Report

via:博客园     time:2019/3/15 17:02:54     readed:154

Google released a 12-day analysis of large-scale service interruption accidents, pointing out that SRE overload system makes Google cloud storage error rate increased. Many users around the world reported problems with Gmail, YouTube, Google Drive, Google Music and other Google services on December 12.

data-original=https://oscimg.oschina.net/oscnet/5f62d873d824a216442adc4af18f779c689.jpg

Some parts of North America, South America, Europe and Asia were affected, and Google subsequently admitted that there had been a failure. The Google Cloud Status Dashboard showed that the failure affected all areas of Google's cloud storage.

On the 14th, local time, Google released an analysis of the incident.

Google said its internal blob storage service experienced four hours and ten minutes of service interruption. The root cause is analyzed. It points out that on March 11, the storage resources of metadata used by Google SRE's internal blob service increased significantly. On March 12, in order to reduce the use of resources, SRE made configuration changes. Its side effect is to overload the key parts of the system to find the location of blob data, and the increased load eventually led to cascading failure.

More specifically, from 18:40 to 22:50 on December 12, the error rate of Google internal blob storage service increased, with an average error rate of 20% and an error rate of 31% when events occurred. The error rate of user-visible Google services, including Gmail, photos and Google cloud hard disk using blob storage service, also increased. Without these services, the built-in caching and redundancy mechanism would be greatly reduced. User impact, then the consequences will be more serious.

The major impacts of the accident include: Google cloud storage has a high long tail delay, with an average error rate of 4.8%. All storage bucket locations and storage classes are affected, and Google cloud platform services that rely on cloud storage are also affected; Stackdriver Monitoring has a high error rate of 5% when retrieving historical time series data, and recent time series data can be used. The alarm was unaffected. App Engine's Blobstore API has a high latency and error rate, reaching a peak of 21% when accessing blob data, and up to 90% errors in App Engine deployment. The error rate of providing static files from App Engine will also increase.

Google said that non-Google cloud platform services will be affected by a separate event report.

Google apologizes to the service and application customers affected by the incident and says it is taking measures to improve availability and prevent such interruptions from happening again.

See for details:

Https://status.cloud.google.com/incident/storage/19002

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments