Communities are currently not accessible

Incident Report for CC Status Page

Postmortem

September 13th - Platform Outage (all regions) RCA

‌

Issue

Every user accessing the platform community pages were greeted with an error message instead of the requested community page.

This then resulted in a situation where no community functions could be performed or the content viewed which then corresponded to a complete global outage across both regions.

‌

Cause

A code change to the platform related to another issue was deployed which then caused this issue to occur.

In short; a code comment was not being parsed correctly by the production server and as a result it created a fatal front end error for every customer and community across the EU and US region.

Why did our automated tests not pick this up?

End to End tests are normally run against a Test Instance, which did not contain the problematic change.
We identified insufficient automatic tests/lints against these particular front end templates (tests that would have picked this issue up) and as a result this issue managed to slip through and be deployed to the production servers.

‌

Resolution

The issue was swiftly mitigated by reverting the code deploy that was causing the issue, however this led to a 27 minute period where the communities were not available and this error message displayed to all users instead.

‌

Prevention steps

Add linter for template files to be able to sense these rendering issues in future - ensure that we are receiving a status 200 (ok) code for page loads.
Additional health checks for most critical staging resources
- Standard 5-10 checks on the most critical pages to ensure that the correct response codes are being received back before giving the option to approve for production
- Add these checks to the current deployment pipeline that is currently being worked on
Reinforce checking staging consistently before approving deployment to production
- More stringent visual checks e.g on staging before the button for production deployment is pressed.
- More stringent logging checks e.g on staging before the button for production deployment is pressed.

‌

Timeline

16:42 this was escalated to critical incident internally and a status page update was posted
16:43 the incident was recognised by engineering and the process for initiating a rollback was then started immediately
16:52 the rollback was triggered once the specific deployment causing the issue was located
17:07 EU rollback completed and communities were back online
17:09 US rollback completed and communities were back online

Posted Sep 26, 2023 - 13:37 CEST

Resolved

This incident has been resolved.

Posted Sep 13, 2023 - 18:06 CEST

Monitoring

[EU+US] Communities should be coming back online. We will be monitoring the status.

Posted Sep 13, 2023 - 17:11 CEST

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 13, 2023 - 16:47 CEST

Investigating

We are currently investigating this issue.

Posted Sep 13, 2023 - 16:44 CEST

This incident affected: Status of our EMEA Community Infrastructure (Status of our EMEA Community Infrastructure) and Status of our US Community Infrastructure (Status of our US Community Infrastructure).