< Back

Secret of CubeFS Technology | Metadata Subsystem Design

2023-08-02Zhixiang Tang

About the author: TangZhixiang, one of the CubeFS maintainers, is currently working for OPPO and is responsible for the erasure coding system.

introduction

The previous article CubeFS Storage Technology Revealed - Metadata Management open in new window[1] introduced the design concept of the CubeFS metadata system. The article mentioned that the traditional file system has a single-point bottleneck problem of metadata nodes, and the scalable metadata of CubeFS The design effectively solves the single-point bottleneck problem. This article will introduce some implementations and considerations of metadata systems in terms of reliability, scalability, and access performance from a more detailed perspective. Before reading this article, you need to understand the system architecture of CubeFS, as well as the responsibilities of each key component and the noun concept of the system.

图片

Metadata System Design

This chapter will introduce the design idea of CubeFS metadata service, and you can learn the following information:

1. Design Principles of CubeFS's Scalable Metadata Service

2. Key data structures of CubeFS metadata service

3. Implementation scheme of CubeFS metadata reliability

metadata sharding

Each mounted file system corresponds to a volume, and a volume represents a virtual file system. All file metadata information of the user is stored in the metadata partition (meta partition, referred to as mp) **,**and the metadata of CubeFS adopts the range method. Fragmentation, each mp is responsible for a range of inodes (inodes are used to identify files and directories, each file or directory has a unique inode in the file system), and the last mp can support splitting, which is CubeFS supports metadata Key design for horizontal scaling.

The following shows the splitting process of mp. Each volume creation will initially allocate 3 mps. Assume that each mp is responsible for 1000 inodes (each mp is responsible for 16 million inodes in the production environment), and the last mp will be responsible for the inode .

When the memory usage of a MetaNode node in the cluster reaches the threshold (generally set to 75% of the total memory), the background task of the Master component will periodically check the last mp of each volume. If the last mp falls on the threshold where the memory usage exceeds the threshold On the MetaNode, the MetaNode where the last mp is located is read-only, and the inodes responsible for the last mp are allocated to a certain number, etc., the Master will split the last mp into two mps; suppose the mp distribution of the cluster at a certain moment is as follows Shown:

图片

The result after the split is shown in the figure below. The original mp3 is responsible for the inodes of [2001, +∞), and a new mp4 will be created after the split. The new mp4 is responsible for the inode allocation of the range [3001, +∞), while the original mp3 is responsible [2001,3000] inodes. The newly created mp falls on that MetaNode, detailed in the load balancing section.

图片

It is worth noting that the allocation process of inodes can be understood as a polling strategy. For a volume, it contains multiple mps; whenever the client creates a new file, the client will poll the request to allocate inodes on different mps. Assuming that two files A and B are created successively, file A will allocate inode number 100 on mp1, and file B will allocate inode number 1100 on mp2. If all the inodes managed on an mp have been consumed, it will automatically switch to the read-only state.

data structure

The metadata of CubeFS is stored in BTree, and two BTree are maintained in each mp, namely inodeTree and DentryTree. The implementation of BTree adopts Google's open-source golang version [2] . This version of BTree will use the COW method to improve data writing performance when writing, and use the recycling list to allocate new tree nodes to reduce memory overhead. If you want to dig into other features, you can view the source code.

The schematic diagram of the inodeTree is as follows, using the inode id as the index of the BTree.

For the convenience of readers, the structure shown in the figure below is shown here. In fact, BTree is implemented using slices and does not need to use pointers. In addition, inode id is a field in the inodeInfo structure. 图片

Each file corresponds to a unique inode , which records the detailed metadata of the file, including inode id, file type, creation time, etc.

The following inodeInfo provides detailed metadata information for each file, and the relevant comments have been explained. CubeFS supports both multi-copy and erasure code storage, where Extents indicates the location information of the file in the multi-copy, and ObjExtents indicates the location information of the file in the erasure code engine

type inodeInfo struct {
	sync.RWMutex
	inode      uint64 // inode id
	Type       uint32 // file type and permissions
	Uid        uint32 // user id
	Gid        uint32 // group id
	Size       uint64 // File size
	Generation uint64 // The version number, incremented for each revision
	CreateTime int64  // creation time
	AccessTime int64  // interview time
	ModifyTime int64  // Change the time
	LinkTarget []byte // Save the path name of the source file corresponding to the soft link
	NLink      uint32 // number of hard links
	Flag       int32  // Whether the mark can be deleted
	Extents    *SortedExtents // The location information of the file in the three copies
	ObjExtents *SortedObjExtents // The location information of the file erasure code
}

Extents is the location information in the multi-replica system. It is an array of [] ExtentKey , which records files in multiple different Extents (the structure for storing file data is managed by DP on the DataNode node. For details, please read the replica subsystem article) location information, a file will be written into different Extents in fragments.

// ExtentKey defines the extent key struct.
type ExtentKey struct {
    FileOffset   uint64   // offset within the shard file
    PartitionId  uint64   // The id of the data shard
    ExtentId     uint64   // extent id
    ExtentOffset uint64   // The location of the fragment in the extent
    Size         uint32   // File size
    CRC          uint32   // crc check value
 }

ObjExtents is the location information in the erasure code system, and it is an array of [] ObjExtentKey . Similarly, a file will be cut into multiple blobs (the size of the encoding object defined by the erasure code system) and written into the erasure code system, so there will be Multiple ObjExtentKey information.

// ExtentKey defines the extent key struct.
type ObjExtentKey struct {
	Cid        uint64 // The clusterId of the erasure code cluster, the erasure code supports multiple clusterid deployments	CodeMode   uint8  // 纠删码编码模式
	BlobSize   uint32 // The size of each blob in an erasure coded system
	BlobsLen   uint32 // The number of bids is the length of the blobs	Size       uint64 // 文件大小
	Blobs      []Blob // The position information of each bid in the erasure code will not be introduced here
	FileOffset uint64 // offset in the file
	Crc        uint32 // crc check value
}

Each dentry represents a directory entry, which contains information such as the inode of the parent directory and the name of the file (or directory). The detailed structure of dentry is as follows

// Dentry defines the dentry struct.
type Dentry struct {
    ParentId  uint64 // inode of the parent directory
	Name      string // The name of the current directory entry
	inode     uint64 // inode of the current file
	Type      uint32 // File Types and Permissions
}

The main function of DentryTree is to speed up file queries. Its index consists of parentId and name together. When querying, first compare the parentId, and if they are the same, continue to compare the name. The dentry of each file is saved in the partition where the inode of the parent directory is located, so that the dentry information of all child files in a directory will be saved in the same mp. This design takes certain affinity into account, and the files in the query directory will be concentrated in one mp, without accessing all mps in the entire cluster. 图片

think:

Here you can also think about whether you can use B+tree to save InodeTree and DentryTree

reliability

Multiple mps will be allocated to three MetaNodes, and the reliability of metadata is guaranteed through the Raft algorithm. The Mutil-Raft algorithm is used to reduce the heartbeat overhead between nodes. There will be multiple mps on each MetaNode node, and a master node will be elected to handle read and write requests.

图片

Every five minutes, MP takes a snapshot to persist metadata. Each snapshot will generate a new file and replace the old file if there is no exception (but in order to prevent data corruption, the old file will be backed up first). All metadata changes between two snapshot operations will be recorded in Raft's WALlog to ensure that after a node failure, it can quickly recover to the state at the time of the failure. The snapshot recovery process is as follows: after a failure occurs, first load the data at the previous snapshot moment, and then replay the logs in the WAL log according to the apply indexvalue to restore the data in the memory.

Note: The snapshot here is not a snapshot of Raft, but a snapshot of a single mp itself The details of the snapshot file record are as follows (the article is based on the version before V3.3.0, and the snapshot after V3.3.0 adds transaction-related content):

  • apply: Indicates the maximum apply index value of Raft
  • inode: file metadata information, using BTree structure
  • dentry: file directory item dentry information, using BTree structure
  • cursor: the current mp has allocated the largest inode ID
  • extent: Extended attribute, currently used for object storage, adopts BTree structure
  • multipart: store object storage fragment upload information, using BTree structure
  • .sign: It is the crc32 value of dentry, extend, inode, and multipart, which is used for verification.

图片

load balancing

mp balance creation

In order to ensure that the number of mps on each MetaNode is balanced, the newly created mp will select a suitable MetaNode node. The strategy for selecting a MetaNode is based on the weight distribution of the nodes. The weight value is positively correlated with the remaining unused memory of the MetaNode. The larger the remaining memory, the greater the weight. In addition, a certain random strategy is added to the selection process to avoid the newly expanded MetaNode. A large amount of mp is allocated in a short period of time. This strategy can effectively improve the stability and reliability of the system.

allocation strategy

  • First select the zone, if the volume does not span zones, then zoneNum = 1, otherwise if the current number of zones in the cluster is less than the number of replicas 3, then zoneNum = 2, otherwise zoneNum = 3, here zone can be understood as an availability zone;
  • Then select nodeset from zone, nodeset is a set of nodes, which can be used to distinguish fault domains;
  • Obtain all writable MetaNodes in the nodeset, requiring the MetaNode to be in an active state (MetaNode will periodically heartbeat to the master, and if the heartbeat interval is exceeded, it will be set to false, the default is 18s), the memory usage does not exceed 0.75, and the number of mp does not exceed the threshold and is not read-only;
  • Filter all nodes of MetaNodes, the maximum value of available memory is used as maxTotal, which is used as the benchmark value;
  • Each node initialization will bring an InitCarry, which is between [0.0 - 1.0];
  • Each node has a weight value, the weight of the unused node is 1, and the used weight=(maxTotal-used)/maxTotal;
  • Finally, each node carry=InitCarry+weight, arranged according to the carry value from large to small;
  • Select several Metanodes with the largest carry value to return, and the carry value of the selected MetaNode will be --, reducing the probability of being selected next time. As the scale of the cluster expands, new MetaNode nodes will continue to expand in capacity, and the number of mps on the old MetaNode must be greater than the number of mps on the new MetaNode. When the mp tilt is serious, the mp on the old node can be taken offline through the mp offline strategy, and the system will recreate a new mp copy, which will be allocated to the newly expanded MetaNode node with a high probability.

Balanced distribution of main mp

As mentioned earlier, each mp will have three copies, and the three copies will select a leader mpthrough the Raft protocol . The leader mp will undertake the file read and write requests ( follower readcan be enabled for scenarios with low consistency requirements ), and Raft of CubeFS uses lease readmethod optimizes the read request to reduce the overhead generated by each readIndex read.

All read and write requests will be concentrated on the leader mp, which means that on a Metanode node, if the number of leader mps is larger, the read and write pressure on the node will be greater. Assume that the distribution of mp fragments in the cluster at a certain moment is as follows: MetaNode1 has 3 primary mps, but there are no primary mps on MetaNode3 and MetaNode4, which will lead to uneven load among multiple MetaNodes.

图片

Through the active balancing strategy, the leader mp can be evenly distributed to different nodes. Under normal circumstances, the leader mp of the Raft group is balanced among multiple MetaNode nodes in the cluster. However, due to various reasons (starting and stopping of processes, excessive load, hardware failure, etc.) of the MetaNode nodes, the distribution of leader mp in the cluster may occur. Gradually tilt; through the active balancing mechanism, it can effectively respond to the cluster stability problem caused by the unbalanced distribution of Leader mp.

图片

summary

This chapter introduces the idea of metadata design of CubeFS, and provides the extended features of metadata through splitting. In addition, it also introduces the detailed data structure of metadata and how to ensure the reliability of metadata through snapshots and Raft. I believe that through the introduction of this chapter, you will have a deeper understanding of the metadata system of CubeFS.

Key Processes for File Access

Assuming that the mounted path is CubeFS/mnt, the initialization volume will allocate 3 mps, and it is assumed that each mp is responsible for 1000 inodes. Initialization will create a root directory "/", and at the beginning, an mp will be randomly selected to write. Here, it is assumed that mp1 is selected, and the corresponding inode: 1 will be allocated. Every time a new file is created, the client will maintain an auto-incremented epoch value, which will auto-increment every time the epoch is written, and select mp to allocate inodes by %count (mp). If the directory A is further created, it will allocate inode: 1001 on mp2, and the creation of directory B will allocate inode: 2001 on mp3.

If files are created in the order of [directory A, directory B, file1, directory C, file2, file3, file4], the final inode distribution is as follows.

图片

Create a file

Now you need to create a new file file5 under the directory /A/, the core process is as follows:

1. After the fuse client is started, it will periodically pull the information of the mp under the volume, including the inode range that each mp is responsible for, the members and ip addresses of the MetaNode corresponding to the mp.

2. The inode of the root directory is 1 by default, and the client can quickly locate the inode: 1 on mp1 according to the inode range that each mp is responsible for. The index of DentryTree records {ParentId: 1, Name: "A", inode: 1001} in mp1 can quickly find the inode value of directory A as 1001. The client will cache the Dentry value that has already been found for quick location next time.

3. Check whether /A/file5 exists in mp2. The DentryTree of mp2 does not record the index of {ParentId: 1001, Name: "file5"}, and an error that the file does not exist will be returned at this time.

4. The client requests to create a new inode for file5. According to the polling strategy, this request will be sent to mp3, and mp3 will create a file metadata with inode 2003 and store it in the Inode B-Tree of the memory.

5. After the inode is successfully created, a dentry record will be inserted in the mp corresponding to the parent directory A of file5. If the dentry insertion fails, the inode created last time will be deleted synchronously to ensure the atomicity of the operation.

图片

Finally, the newly created inode and dentry will be inserted into the BTree corresponding to mp.

图片

read file

Suppose you need to read the file /A/file5, the process of reading the file is the same as the process of creating the file, the key process is as follows:

1. Search for the parent directory A of file5 from the root directory. The dentry corresponding to directory A has been cached to the client during the process of creating file5. You can directly find directory A on mp2

2. Find the dentry information corresponding to file5 in mp2 {ParentId: 1001, Name: "file5", Inode: 2003}

3. According to the mp information cached by the client, the inode can be quickly located: 2003 on mp3

4. A Streamer structure will be created to obtain the latest metadata information of the file from mp3, and update it to the cache of the Streamer, and obtain the location information ExtentKey or ObjExtentKey of the file data storage according to the file metadata. The figure below assumes that the data of file5 is stored in DataNode1 dp1.

图片

File List

If you need to obtain the file list under directory A, the key process of metadata access is as follows:

1. Find directory A from the root directory /. Since the inode of the root directory is 1, the client also saves the inode information managed by each mp, and can quickly find the dentry information corresponding to directory A in mp1 {ParentId: 1, Name:" A", inode:1001}

2. List all file information under directory A in mp2, because the dentryTree of mp2 records the dentry information of all subfiles, the dentry{ParentId: 1001, Name: "C", inode: 1002} of directory C, the dentry{ParentId of file2 : 1001, Name: "file4", inode: 1003}, dentry of file5 {ParentId: 1001, Name: "C", inode: 2003},

3. Access the mp where each sub-file is located to obtain the corresponding metadata information; for example, obtain the detailed metadata of directory C and file4 in mp2 according to the inode:1002 of directory C and inode:1003 of file4, and obtain the detailed metadata of directory C and file4 according to the inode:2003 of file5 in Get detailed metadata of file5 on mp3

The structure of the dentryTree in mp2 shown in the figure below, the process of finding the corresponding mp according to the inode is similar to the process of reading the file, which is no longer shown in the drawing.

图片

Delete Files

图片

Suppose you need to delete the /A/file5 just created, the key process is as follows:

1. Also start searching from the root directory, and find the corresponding parent directory of the file. Here, because the client will cache the dentry information of directory A, it can directly find the mp where directory A is located.

2. Delete the dentry value corresponding to file5 in mp2 {ParentId: 1001, Name: "file5", Inode: 2003}

3. Initiate the deletion of the inode to the mp3 where file5 is located. After the mp3 reaches a consensus agreement through raft, the NLink corresponding to the inode--

4. If the NLink value of the inode is 0, the standard inode is in the Delete state, and it is added to the freeList list of each mp, and the background is used to asynchronously release the content data of the file

5. mp will start a background coroutine to regularly clean up the inode data in freeList

a. Take out a batch of inodes from freeList each time, and execute the deletion process. For the deletion process of file data, please read CubeFS Source Code Interpretation - Copy Subsystem open in new window[3]

b. Before executing the deletion, write the deleted inode number to the INODE_DEL file in the mp root directory for persistence (recording)

c. The ExtentKey of each inode is aggregated according to the DP, so as to reduce the deletion request sent to each DP.

Example display of metadata node storage

The directory structure stored by the MetaNode node includes two large directories meta and Raft, and meta stores the metadata information corresponding to all mps on each MetaNode. Raft records the Raft information corresponding to each mp. Among them, expired_patition indicates the mp that is migrated to other MetaNodes, and there will be a background task to periodically clean up the mp directory corresponding to expired_partition_. The expired_raft_ correspondence is added after V3.3.0, and may not be visible before V3.3.0.

├── meta
│ ├── constcfg
│ ├── expired_partition_357 // Partition migrated to other MetaNodes
│ ├── expired_partition_358
│ ├── partition_308
│ ├── partition_309
│ ├── partition_310
│ ├── partition_314
└──Raft
├── expired_raft_357
├── expired_raft_358
├── 308
├── 309
├── 310
├── 314

meta directory

constcfg

Save the monitoring, replication and heartbeat port information of Raft, using mutil-Raft, all mps on the MetaNode send heartbeat through a unified port

{"listen":"17210","Raft ReplicaPort":"17240","Raft HeartbetPort":"17230"}

partition

The records of each mp corresponding to the metadata directory are as follows:

├── EXTENT_DEL_V2_0
├── inode_DEL
├── .snapshot
├──.snapshot_backup
├── meta
├──.meta
└── snapshot
├── apply
├── dentry
├── extend
├── inodes
└── multipart
└── .sign

inode_DEL record deleted inode number

EXTENT_DEL_V2_0

Record the deleted Extent information, which may be ExtentKey or ObjExtentKey.

meta

meta records the metadata of the current mp, mainly records the inode managed by mp, the volume information it belongs to, and the node information of the copy group corresponding to mp

{
    "end": 9223372036854775807,//The maximum inode value, the last partition is infinite
    "partition_id": 418,
    "partition_type": 0,
    "peers": [
        {
            "addr": "127.0.0.1:17210",
            "id": 4
        },
        {
            "addr": "127.0.0.2:17210",
            "id": 8
        },
        {
            "addr": "127.0.0.3:17210",
            "id": 116
        }
    ],
    "start": 33554433, // minimum inode value
    "vol_name": "smux-test"
}

start-end is an inode interval. The inodes in the same volume correspond to mps are continuous. The figure below shows the metadata information that three mps are responsible for on a certain volume. 图片

snapshot

Some information saved by the snapshot

  • apply : 70466|33554433, the front of thesign indicates the maximum apply index value of Raft, andthe cursor value after the sign indicates the allocated inode id.
  • If the cursor is greater than the last end value of the meta, it means that the inode of the current mp is exhausted, and it will be switched to read-only
  • The cursor will be persisted periodically (one minute by default), and subsequent versions will be optimized to be saved together with Raft snapshots
  • dentry: dentry information, using BTree structure
  • inode: inode information, using BTree structure
  • extend: extended attribute, currently used for object storage, adopts BTree structure
  • multipart: Object storage uploads in pieces, using BTree structure
  • .sign: It is the crc32 value of dentry, extend, inode, and multipart, which is used for verification.
├── apply //an instance 70466|33554433
├── dentry
├── extend
├── inodes
└── multipart
└── .sign // an instance 206325109 0 3523407757 3523407757

Raft directory

Save Raft metadata and WAL logs

  • l META is to save term, vote, commit, snapTerm, snapIndex and other information
.
├── 0000000000000001-0000000000000001.log
└── META

references

【1】 Secret of CubeFS Storage Technology - Metadata Managementopen in new window

【2】 Google open source version BTreeopen in new window

【3】 Interpretation of CubeFS source code - copy subsystemopen in new window