GCP Outage Incident Review
Summary
Starting on 19 July 2022, a cooling related failure in a Google Cloud data centre impacted multiple cloud services, including connectivity to VALR’s primary cloud-hosted database.
The data centre incident automatically triggered a failover from our primary database to a standby instance, however, the failover entered a stuck state, and never successfully completed. Being a cloud-managed service, there was no manual way for us to cancel the failover and resume normal operation.
Following the failure of this first level of redundancy, and only after the relevant cloud services were restored in the region, we opted to switch over operations to a second replica database. This was a carefully considered decision we only opted for once we received assurance from the cloud provider that no further outages were expected in the region. Our primary concern has and always will be ensuring the integrity of our customer funds, personal information, and other data. In a process like this, it’s critical that we move in a systematic way with many checks and balances to ensure integrity is upheld.
At a point during our manual switch over to the second replica, our cloud provider brought the original database back online and assured us that no further impact to running services were expected. Based on this we made the decision to revert our secondary failover procedure and continue normal operations as before.
We took great care in consolidating customer accounts and balances in resuming normal operations. Due to the extended downtime, we took small incremental steps with focused verification to safely and securely bring up the system in a progressive way.
Current Status
All normal operations have been resumed. We’re still working on restoring missing transaction history records and order status records from internal state and logs that occurred just before the outage. We expect to have this completed by the end of the week.
Impact
The outage lasted for a total of 12 hours, during which time none of our customers had the ability to trade, or deposit and withdraw funds. There was never any risk to funds, personal information, or other data of our customers, and these did not see any impact.
What we’ll add going forward
We continually work to improve our redundancy and ability to recover from adverse events. We’re looking to add additional regional redundancy, as well as improved internal tooling that would allow for faster recovery with consistency.