Brief overview:
We had a down time of 6 hours and 47 minutes in total which impacted all US hosted customers and took all communities completely offline.
This occurred due to a critical failure in a US based infrastructure. Mitigating actions were taken late, due to lapses in the incident-procedures in place.
The end result was that between 00:00 UTC and 06:47 UTC (01:00-07:47 CET; 4pm-10:47pm Pacific) all users were faced with an unbranded, technical 502 error message.
This post mortem will give an overview of the timeline, incident breakdown, actions taken and action planned.
Incident Detailed Timeline (CET):
01:00 : Start - The certificate expired and the Alarm went off
01:01 : Noticed by non-engineering staff, incident-procedure was kicked off immediately via phone call to the on call engineer
03:20 : First response from engineering - problem was diagnosed immediately. However due to lack of sufficient platform permissions, rectification of the problem was not possible and the responding engineer had to escalate once more.
07:47 : Senior engineer responds to the escalation and was able to resolve the incident by revalidating the certificate.
Incident breakdown:
Problem 1: All communities in US went down at 1AM
At 1:00am the SSL certificate for a loadbalancer expired; which caused all services to be down and incapable to connect to each other.
Why:
Problem 2: Our Incident Response mechanism did not work and the on-call engineers did not respond to call
Non-engineering staff noticed the problem and initiated the Incident Response procedure. Due to misconfigured phones and systems, the on-call engineers were not reached.
Problem 3: The Status page was not updated
Even though the issue was noted by staff within minutes, the status page was not updated until the incident was resolved.
None of the people involved in the initial escalation process had the right permissions to publish anything on the status page. None of the Support team received any notifications as there is no out of office notification procedure for support staff, only engineers.
Actionable Items:
We will enact the following actionable items in the next few days. This is not an exhaustive list of measures that will be taken in the weeks following.
1. Quarterly tests of this out of hours incident process - to ensure that the process is performing as expected
Summary:
We are dearly sorry for this unacceptable lapse in service. While we at inSided realise that incidents happen - in this case our out-of-office hours response was just not good enough. US customers were left with no service and no acknowledgement from us or any way to contact us. We are taking this seriously with the highest possible priority and will of course learn from this to prevent any form of repetition in the future. We have outlined a number of points we can act upon quickly to reduce a significant point of failure, however we need to improve our internal incident escalation process should anything like this happen again out of hours.