On Sunday, 5th June 2016 at 15:25 an outage at Amazon Web Services in Sydney (Zone: ap-southeast-2) impacted on all Agileware hosted websites. Website services were down during the outage. Agileware apologise to all our customers for the interruption to services.

The majority of our customers were back on-line at 17:20. The remaining customers were restored from our backup as Agileware was later advised by AWS that the server could not be recovered normally. All services are now operational again.

We expect the reason for the outage was the massive storm that hit the Eastern States and also Sydney particularly hard. Amazon have not confirmed the exact reason for the outage other than a “loss of power” and “hardware failure”. Given how resilient the Data Centres are built these days, this must have been a very serious event.

There are lessons to be learnt from this outage and we will be reviewing our procedures with the view to improve our resilience and ability to respond quicker.

If you have any questions or concerns, then please call us or submit an issue to discuss, https://projects.agileware.com.au

Related coverage in the IT News media:
http://www.smh.com.au/technology/technology-news/amazon-web-services-outage-causes-australian-website-chaos-20160605-gpc41p.html
http://www.itnews.com.au/news/aws-sydney-outage-downs-big-name-web-companies-420478
https://au.news.yahoo.com/technology/a/31770525/amazon-web-services-outage-catches-sydney-customers-short/
http://www.theregister.co.uk/2016/06/05/aws_oz_downed_by_weather/
http://www.lifehacker.com.au/2016/06/amazon-web-services-outage-brings-down-a-ton-of-websites-across-australia/

Details of the events and actions performed by Agileware are provided below.
05/06/2016, 15:25:
Outage started with no prior warning. Outage was immediate. Our monitoring alerted Agileware to issue.
Agileware requested updates from AWS Tech. Support on status of the services.
05/06/2016, 17:20:
Majority of servers were back on-line
Two servers refused to restart
AWS Tech. Support advised that they were continuing to restart servers
05/06/2016, 19:35:
Two servers were still off-line
Agileware requested updates from AWS Tech. Support on status of last server
Agileware notified affected customers
05/06/2016, 20:05:
Server was restarted and came back on-line within 5 minutes
Agileware notified affected customers
Requested updates from AWS Tech. Support on status of last server
AWS Tech. Support advised that they were continuing to restart servers
05/06/2016, 20:30:
Agileware started the restore process from backup on 04/06 @ 10pm.
Agileware notified affected customers
06/06/2016, 5:00:
Agileware requested updates from AWS Tech. Support on status of last server
AWS Tech. Support advised that they were continuing to restart servers
06/06/2016, 5:30:
Agileware declared server dead
Replacement server deployed
Websites re-installation process started
Agileware notified affected customers
06/06/2016, 8:15:
Agileware completed the installation process and all remaining were websites back on-line
Agileware notified affected customers

Amazon Web Services provided the following updates via their Status Page, http://status.aws.amazon.com
10:47 PM PDT We are investigating increased connectivity issues for EC2 instances in the AP-SOUTHEAST-2 Region.
11:08 PM PDT We continue to investigate connectivity issues for some instances in a single Availability Zone and increased API error rates for the EC2 APIs in the AP-SOUTHEAST-2 Region.
11:49 PM PDT We can confirm that instances have experienced a power event within a single Availability Zone in the AP-SOUTHEAST-2 Region. Error rates for the EC2 APIs have improved and launches of new EC2 instances are succeeding within the other Availability Zones in the Region.
Jun 5, 12:31 AM PDT We have restored power to the affected Availability Zone and are working to restore connectivity to the affected instances.
Jun 5, 1:17 AM PDT We continue to work on restoring connectivity to all instances in the affected Availability Zone. Launches of new EC2 instances are now starting to succeed in the affected Availability Zone but we are experiencing delays in updating instance state data.
Jun 5, 2:47 AM PDT We continue to make progress in restoring connectivity to all instances in the affected Availability Zone. Launches of new EC2 instances are now succeeding in all Availability Zones and instance state data is no longer delayed in the affected Availability Zone.
Jun 5, 4:43 AM PDT Connectivity has been restored for the majority of the instances in the affected Availability Zone and the APIs continue to operate normally. The remaining impacted instances are taking longer to recover. For immediate recovery, we recommend customers that are able to, launch replacement instances.
Jun 5, 11:50 AM PDT On June 4th at 10:25 PM PDT a significant number of EC2 instances and EBS volumes within a single Availability Zone in the AP-SOUTHEAST-2 Region experienced a loss of power. Beginning at this same time, EC2 API calls in the AP-SOUTHEAST-2 Region experienced increased error rates and latencies as well as delays in propagation of instance state data in the affected Availability Zone. Instances and volumes in the other Availability Zones were unaffected. At 11:46 PM PDT, power was restored to the facility and instances and volumes started to recover. By June 5th at 1:00 AM PDT, 80% of the affected instances and volumes had been recovered by our automated systems. At 2:45 AM PDT the increased error rates and latencies for the EC2 APIs and the delayed propagation of instance state data were fully resolved. A couple of unexpected issues prevented our automated systems from recovering the remaining instances and volumes. The team was able to fix these issues, and by 8:00 AM PDT, all but a small number of instances and volumes were recovered. Since 8:00 AM PDT our teams have been working to recover these remaining instances and volumes. Most of these instances are hosted on hardware which was adversely affected by the loss of power. While we will continue to work to recover any affected instances or volumes, we recommend replacing any remaining affected instances or volumes if possible.

Image by WiredforLego, CC-BY-2.0, https://flic.kr/p/aH79bV


About the author Justin rotate

Director at Agileware. Justin has been developing and supporting software since the 90s. A strong advocate free software and consumer rights.

Contact us

The Agileware office is located in Canberra, providing services locally and around the world. Talk to us today and we'll help you find a solution that works for your organisation.