AWS critical outage
Incident Report for Callwell
Postmortem

The procedure we had in place to recover from complete failure of our Availability Zone was successful. We can still do more reduce the downtime on restoring to a different availability zone.

Posted Mar 01, 2021 - 13:47 GMT

Resolved
Duration
Start time: 2021-02-01 11:11:00 GMT
Resolution time: 2020-02-01 12:35:00 GMT

Summary
There was a critical failure on our system architecture. AWS technical support confirmed immediately that this was a serious issue affecting the Availability Zone our database and server instances were sitting on. The instances were immediately restarted on a different Availability Zone and our procedure to move our database to our back up Availability Zone was initiated. This move took 40 minutes followed by re configuring the system to use the new database location before the system was put back online.

Customers impacted
All customers

Services impacted
All services
All original emails to branches were unaffected
No enquiries were lost however most will have taken up to 5 hours after the event to get from Mailgun to Callwell
Branch email notifications temporarily disabled

Incident details
There was a critical failure on eu-west-2b on AWS that took down the RDS Callwell database and stopping it from being rebooted, along with restoring any new databases from backups.

An investigation with the AWS engineering team was started immediately which identified the issue before it was made public on the AWS status support page.

As this was the same situation last year that we had planned for, we immediately activated our procedure to restore our database to a different data center. This processed takes approximately an hour and means we can be back up and running in that time if there is no resolution from AWS. After a restore of 40 minutes to the new database we reconfigured the infrastructure to use the new location.

The Callwell primary database was restored from backup onto eu-west-2b which brought the system back online for all customers.
Posted Feb 01, 2021 - 11:11 GMT