US Communities - Currently experiencing issues with API calls

Incident Report for CC Status Page

Postmortem

Public API Downtime in US Region - Root Cause Analysis

Date of Incident: June 4, 2025
Region Impacted: United States
Duration: Approximately 3 hours

Summary

On June 4, 2025, our Public API in the US region experienced an unexpected outage beginning around 14:00 UTC. The root cause was due to the database underlying the API exhausting its allocated storage resources, which led to the API becoming inaccessible. Service was fully restored after intervention from our cloud provider.

Impact

During the outage, customers in the US region were unable to access the Public API for approximately 3 hours. This may have affected real-time data operations and integrations. No data loss occurred, and all data integrity was maintained post-recovery. However, we recognize the disruption in service impacted customer experience and system reliability.

Timeline (UTC)

  • 12:10 - Initial internal alert for high database connections triggered
  • 14:00 - First customer-facing alert of API unavailability
  • 14:00-14:15 - Preliminary investigation into logs
  • 14:20 - Engineering team escalated the issue
  • Multiple recovery attempts initiated, including expanding storage and restoring from backup
  • 17:00 - Cloud provider intervened and successfully restored database service
  • ~17:00-17:15 - API services resumed normal operation

Root Cause Analysis

The issue was caused by the exhaustion of allocated storage on the database instance supporting the US Public API. This instance was operating under an outdated configuration with limited storage and without automatic scaling. In addition, no specific monitoring alert for storage exhaustion was in place, contributing to delayed detection and response.

Preventative Actions

To prevent recurrence of similar issues, we are implementing the following actions:

Monitoring Improvements

  • Introduce storage-specific monitoring and alerting for all production databases
  • Enable automatic storage scaling on all critical database instances

Infrastructure Enhancements

  • Upgrade legacy database instances to more scalable configurations
  • Migrate to more reliable M-series instances
  • Audit and validate configurations for all production infrastructure

These improvements are planned during a scheduled maintenance window over the weekend of June 21-22, 2025, with minimal expected downtime.

Lessons Learned

  • Relying on legacy instance types for production workloads increases risk
  • Clear and specific alerting is essential for early detection of infrastructure issues
  • Automatic scaling and infrastructure validation should be standard across services
  • Operational coverage during holidays must be more robust to ensure timely responses

We sincerely apologise for the inconvenience this incident may have caused. Ensuring the reliability and performance of our services is our highest priority, and we are committed to learning from this event to prevent a repeat in future.

If you have any further questions or concerns, please do not hesitate to reach out to our support team at ccsupport@gainsight.com.

Posted Jun 10, 2025 - 11:31 CEST

Resolved

The problem has been resolved and normal service has been restored. Thank you for your patience while we worked on this. If you experience any further issues, please reach out to our support team.
Posted Jun 02, 2025 - 20:09 CEST

Identified

We have identified an infrastructure issue which is causing the problem, we are in the process of fixing it now and hope to be back to normal very soon.
Posted Jun 02, 2025 - 17:13 CEST

Investigating

We are currently investigating an issue with some API calls not working for US hosted communities, the API credentials page is also displaying an error 500 message.
Posted Jun 02, 2025 - 16:28 CEST
This incident affected: Status of our US Community Infrastructure (API).