[US communities] Platform outage in US region (resolved)

Incident Report for CC Status Page

Postmortem

Brief overview:

We had a down time of 6 hours and 47 minutes in total which impacted all US hosted customers and took all communities completely offline.

This occurred due to a critical failure in a US based infrastructure. Mitigating actions were taken late, due to lapses in the incident-procedures in place.

The end result was that between 00:00 UTC and 06:47 UTC (01:00-07:47 CET; 4pm-10:47pm Pacific) all users were faced with an unbranded, technical 502 error message.

This post mortem will give an overview of the timeline, incident breakdown, actions taken and action planned.

‌

Incident Detailed Timeline (CET):

01:00 : Start - The certificate expired and the Alarm went off

01:01 : Noticed by non-engineering staff, incident-procedure was kicked off immediately via phone call to the on call engineer

03:20 : First response from engineering - problem was diagnosed immediately. However due to lack of sufficient platform permissions, rectification of the problem was not possible and the responding engineer had to escalate once more.

07:47 : Senior engineer responds to the escalation and was able to resolve the incident by revalidating the certificate.

‌

Incident breakdown:

Problem 1: All communities in US went down at 1AM

At 1:00am the SSL certificate for a loadbalancer expired; which caused all services to be down and incapable to connect to each other.

Why:

Caused by the renewal not being acknowledged on time
Starting 30 days in advance, 7 notifications were received, none were read.
The renewal notifications are sent to a specific email inbox that not enough people have access to and that wasn’t checked regularly
Because: AWS gives us only a specific subset of accounts that we can forward renewal alarms to.
As the load balancer domain is on an insided.com domain, this means that email notifications must also go out to the insided.com email addresses associated with the load balancer. As an organisation we have recently switched our email domains from insided.com to gainsight.com which resulted in a lapse in checking this older insided.com inbox.
Forwarding of email in this inbox was not set up

Problem 2: Our Incident Response mechanism did not work and the on-call engineers did not respond to call

Non-engineering staff noticed the problem and initiated the Incident Response procedure. Due to misconfigured phones and systems, the on-call engineers were not reached.

Why: Internal Alarm triggered by this outage was not acknowledged

Because: > Root Cause: We have a misconfigured phone system + our platform would not allow you to acknowledge the alarm correctly.
Because: > Root Cause: The escalation path is not correctly configured in our alert monitoring system, meaning that the out of hours escalation that informed the senior engineers did not take place.

Problem 3: The Status page was not updated

Even though the issue was noted by staff within minutes, the status page was not updated until the incident was resolved.

None of the people involved in the initial escalation process had the right permissions to publish anything on the status page. None of the Support team received any notifications as there is no out of office notification procedure for support staff, only engineers.

‌

Actionable Items:

We will enact the following actionable items in the next few days. This is not an exhaustive list of measures that will be taken in the weeks following.

Move the SSL certificate renewal away from email notifications to an automatic process with DNS validation (one that mirrors other critical SSL certificates on other pieces of infrastructure). This will ensure that any certificate renewals are done automatically and eliminates a significant point of failure.
Ensure that more people involved in the diagnosis and escalation process can update the status page to keep customers informed.
Fix the escalation process in our alert monitoring system to ensure that the correct out of office engineers are notified and that escalations take place if any one of them are uncontactable for any reason.

New escalation path that involves more senior engineers
New out of hours call rotation that involves more engineers for coverage
If alerts are not acknowledged on time, the system will be set up to automatically escalate 24/7 to MT members.

1. Quarterly tests of this out of hours incident process - to ensure that the process is performing as expected

Summary:

We are dearly sorry for this unacceptable lapse in service. While we at inSided realise that incidents happen - in this case our out-of-office hours response was just not good enough. US customers were left with no service and no acknowledgement from us or any way to contact us. We are taking this seriously with the highest possible priority and will of course learn from this to prevent any form of repetition in the future. We have outlined a number of points we can act upon quickly to reduce a significant point of failure, however we need to improve our internal incident escalation process should anything like this happen again out of hours.

Posted Nov 20, 2022 - 09:59 CET

Resolved

This incident has been resolved - the post mortem is available here: https://status.insided.com/incidents/25j27dc1n57d

Posted Nov 18, 2022 - 07:47 CET

Monitoring

Brief overview:

We had a down time of 6 hours and 47 minutes in total which impacted all US hosted customers and took all communities completely offline.

This occurred due to a critical failure in a US based piece of infrastructure. Mitigating actions were taken late, due to lapses in the incident-procedures in place.

The end result was that between 00:00 UTC and 06:47 UTC (01:00-07:47 CET; 4pm-10:47pm Pacific) all users were faced with an unbranded, technical 502 error message.

This post mortem will give an overview of the timeline, incident breakdown, actions taken and action planned - this can be found here attached to this status update.

We are dearly sorry for this unacceptable lapse in service. At inSided, we realize that incidents happen - however, in this case, our out-of-office-hours response was not good enough.

We are taking this seriously with the highest possible priority and will, of course, learn from this to prevent any form of repeat in the future. We have outlined several points we can act upon quickly to reduce a significant point of failure.

Posted Nov 18, 2022 - 01:00 CET

This incident affected: Status of our US Community Infrastructure (Status of our US Community Infrastructure).