Secret of CubeFS｜ Multi-AZ Erasure Coding Disaster Tolerance
Multi-AZ disaster tolerance refers to the deployment of applications and data in multiple availability zones (AZs) in a cloud computing environment to achieve high availability and disaster tolerance. In a cloud computing environment, a single AZ may experience failures or become unavailable, so deploying data in multiple AZs can improve system availability and disaster tolerance, ensuring business continuity and stability. By deploying data in multiple AZs, data backup and recovery can be achieved, ensuring data security and integrity. In addition, in a cloud computing environment, users require increasingly higher levels of system availability and stability. Through the use of multi-AZ disaster tolerance technology, system availability and stability can be improved, enhancing user experience and satisfaction.
This article introduces the AZ-level disaster tolerance scheme of the Blobstoreerasure coding subsystem in CubeFS. Due to the disadvantages of high storage cost and low efficiency in the multi-replica storage mode, high network bandwidth consumption in the multi-AZ case, and the need to ensure data consistency in a timely manner, we will only discuss the erasure coding (EC) storage mode. Blobstoreuses erasure coding to encode and calculate user data and persistently store it in multiple AZs, achieving high availability and disaster tolerance. The main content of the article includes the EC calculation principle in multiple AZs, as well as how to maintain high availability for writing, efficient reading, and reducing cross-AZ data recovery. The reliability for EC data is not elaborated here, and can be found in our related articles.
In cloud computing environments, data centers are divided into multiple regions, each with independent power, network, and physical facilities to achieve high availability and disaster tolerance, and provide cloud services similar to those provided locally by users.
EC is a data redundancy encoding and decoding technology that divides data into blocks and generates multiple encoded blocks, which are then stored on different storage nodes to achieve high availability and disaster tolerance of data. It is precisely because of the redundancy characteristic of EC that multi-AZ disaster tolerance can be achieved at low cost and high efficiency.
- Disaster Tolerance Capability
Disaster tolerance refers to the ability of multi-AZ data centers to allow one or more AZs to be unavailable, and even if the unavailable AZs cannot provide any help for data recovery. The simple relationship between the number of AZs, minimum redundancy, and disaster tolerance capability is shown in the following diagram:
|AZ tolerance capability
Advantages and Disadvantages
- When one AZ fails or becomes unavailable, other AZs in the system are not affected in any way, ensuring the continuity and stability of data.
- When a certain AZ experiences a disastrous and unrecoverable failure, it ensures the security and integrity of data.
- Under the premise of high reliability, lower redundancy storage capacity can be provided at a lower cost.
- As shown in Table-1, to obtain better disaster tolerance, data redundancy must be increased by a factor of two, which essentially conflicts with cost efficiency. Therefore, we generally only discuss the disaster tolerance situation of a single AZ.
- Multi-AZ requires the generation of limited adaptive encoded blocks, and connections to multiple data nodes are required during writing and reading, which requires more computing resources and increases network bandwidth consumption.
- Writing Quorum of no AZ failure During data encoding and writing, we do not consider the local data blocks in LRC as valid data because in some scenarios, local data blocks do not help with data recovery (as in the case of lost L31 in AZ-3 in Figure-1).
The minimum Quorum for data writing is D + (D+P)/AZ = 12+(12+9)/3 = 19; in extreme cases, to ensure the disaster tolerance of one AZ, the EC (12,9) model can only allow up to 2 effective data blocks to fail to be written, otherwise data loss may occur during AZ failure (as in the case of AZ-3 failure in Figure-1).
In actual generation, the probability of AZ failure is very low, and the write Quorum can be appropriately adjusted to reduce some long-tail effects and improve user experience.
- Situation on one failed AZ If a certain AZ fails to write, the system can still maintain its availability and write data successfully. However, in this situation, the minimum Quorum for data writing cannot be satisfied. Therefore, it is necessary to ensure that all effective data in other AZs is written successfully (as in the case of AZ-3 failure in Figure-1, where all 14 effective data in AZ-1 and AZ-2 need to be written), and the data recovery process should be initiated as soon as the failed AZ recovers.
- EC-Read: Since there are not enough data blocks within a single AZ, it is necessary to read the corresponding data blocks across AZs to complete the EC recovery. The cross-AZ data traffic is equal to DataSize * (1 - (N+P)/AZ/N) = DataSize * 5/12.
- Range-Read: When reading partial data (which requires reading less data than a data block), the corresponding data block in the corresponding AZ should be directly read to avoid wasting AZ bandwidth due to data recovery. If the corresponding data block is damaged, some of the effective data blocks can be read to perform the EC-Read process.
- RS-Code: The cross-AZ bandwidth for ordinary EC data recovery is the same as the EC-Read mentioned above.
- LRC: In an EC group, the probability of damaging a single data block is much higher (1000x) than the probability of damaging multiple data blocks. If a bad block can be detected in a timely manner, it is usually sufficient to recover only one block in most cases. At this time, enabling LRC can complete data recovery within the AZ, avoiding cross-AZ bandwidth and IO consumption.
- AzureLRC+1: Based on Azure LRC, global check blocks are also locally check-blocked, and combined with a certain placement strategy, better revovery performance can be achieved. In extreme cases (single AZ failure + additional single block failure), the system still has the ability to repair data.
Optimization for fragmented File
Figure-2，1K-data with EC
In erasure coding, the problem of fragmented files being read and written inefficiently due to fragmentation is particularly prominent. To address this issue, we stipulate that each data block in EC has a minimum data size, and the data block size can be defined according to situation (e.g., 4K). When the actual data is less than D*4K, it is padded with zeros and encoded and written to each AZ. Taking Figure-2 as an example, the advantages of this optimization are discussed as follows:
- File systems generally use 4K as a Block size.
- During reading, the first D-data blocks can be read directly, or the recovery read process can be started from any P-data block, reducing the IO path and not consuming cross-AZ bandwidth.
- When data is damaged, it can be recovered from any parity block.
- The data redundancy is increased to (P+1), improving reliability. However, some space is wasted. In the cloud computing scenario, the performance requirements for small files are higher than storage costs. The idea of trading space for time effectively solves this problem and makes the storage abstraction layer more unified and complete.
 Erasure Code
 CubeFS Blobstore