Technical Resilience
Last updated
Was this helpful?
Last updated
Was this helpful?
Our AWS stack is designed with two primary failure modes: Failover and Disaster Recovery. Failover is catered for entirely within a single geographic region using a highly available primary environment. In this primary environment data is replicated synchronously between multiple database servers and redundant systems are used to ensure the maximum possible continuity of service.
These redundant systems are distributed across several AWS Availability Zones (AZs) in a single geographic region (Dublin, Ireland). AWS have multiple AZs per geographic region, but each AZ has discrete power and internet connectivity. We use two availability zones simultaneously for web traffic, reducing the effect of any failure on the availability of the service, and three availability zones for hosting database services.
Disaster Recovery functionality is provided from a secondary geographic region (Frankfurt, Germany) and this mode is intended to meet a 4 hour RTO in case of total loss/failure of the primary environment. This is facilitated by shipping backups on a regular basis to encrypted storage in the region.
Configuration management and automation allows spin up of the other platform components in this region to support a deployment of the system in the absence of our primary geographic location.
The existing technical environment is designed to be resilient but there are always risks that could impact the availability of our service. These known risks are recorded on a risk register in accordance with our risk management framework and monitored for change in status. Opportunities for improvement are sought as part of the ongoing risk management process and the strategic development of the business.