Preface

Today, I would like to introduce a new member of the Treasure Storage family—the kernel-mode client. The Linux kernel is a beautifully elegant piece of code. Let’s briefly discuss the relationship between the kernel and the client; experts in the Linux kernel can skip this article.

A core module of the Linux kernel is the Virtual File System (VFS), which provides a set of interface functions. Storage drivers only need to implement these interface functions to handle data reading and writing. In simple terms, the Linux kernel is like a finely tuned machine. To become a part of its internal components, you must precisely implement the VFS interfaces it exposes.

To put it simply, the kernel client of Cubefs essentially interfaces with the VFS and sends data to the storage server. Of course, this is a very simplified explanation. In reality, the author of the kernel client (who is not me) has implemented the replica mode based on the design of the user-mode client, which is truly impressive. This work involves handling JSON packets, arrays, linked lists, and kernel-specific function interfaces, making it quite complex.

I want to emphasize two points here.

  1. Implementing the VFS interface to integrate with the elegant Linux kernel.
  2. Connecting to other components such as the master, metanode, and datanode through the network.

Framework

When introducing a software, we can't overlook the overall framework, as it provides a comprehensive top-down understanding. However, for the Cubefs kernel client, it is similar to other file systems. Just like ext3 and NFS, the difference is that the Cubefs kernel client stores and retrieves data on the Cubefs cluster. So, you can skip this introduction.

图片

In the diagram above, the green box labeled "CubeFS Client" indicates the position of the CubeFS kernel client within the kernel system. Similar to other Linux kernel file systems, it serves as a submodule beneath the VFS.

Modules

Next, we'll dive into the various modules of the kernel client. Given the limited length of this article and the general preference for concise discussions, we'll focus on the key modules involved in data flow. If you're interested in other aspects, feel free to explore the source code on your own. Reading the source code is undoubtedly a deep way to understand a software, though it can be quite daunting.

The CubeFS kernel client module consists of the following main components:

图片

Let's focus on the several modules highlighted in color in the diagram. Of course, a complete driver also includes various other details, such as JSON parsing and B-tree operations, among others. You read that correctly: the kernel client is essentially a driver. It serves as the Linux driver for the Cubefs storage cluster.

Detailed Design

Initialization

The module initialization of the kernel client is similar to that of other modules; the difference lies in the implementations tailored to the functionalities of Cubefs. Readers who are not interested can feel free to skip this section.

When introducing a kernel module, it's essential to mention the module's init and exit functions. This part of the code is identical to that of other Linux modules and serves as the best entry point for beginners looking at the code. The initialization and unloading code for the CubeFS kernel client is located in cfs_core.c. The initialization function is cfs_init, and the unloading function is cfs_exit.

This section of code consists of initialization routines that other modules also require at the start, so we won't elaborate on it further.

It is important to note that the RDMA module requires cache that users may not necessarily need. We can only enable RDMA and create the cache during user mount operations when enable_rdma is activated. Therefore, the implementation of RDMA cache initialization is placed within the function that creates the mount.

VFS Interface

The VFS interface shouldn't be the main focus of our discussion. However, when mentioning storage drivers, it's essential to touch on VFS. Its core function is to implement a series of VFS interfaces, allowing it to be registered and mounted to the kernel. This is akin to building a car—having four wheels is essential. In this analogy, the VFS interfaces are like the four wheels of the car.

Readers can skip the technical details of this section.

Like other file systems that interface with VFS, the CubeFS kernel client implements a series of function interfaces. Through these interfaces, the CubeFS kernel client is integrated into the Linux kernel, becoming a submodule beneath the VFS. This section of function definitions for mounting does not require the implementation of all functions; only the main parts need to be implemented. Since updates to the Linux kernel source code can lead to variations in the mount definitions, we will introduce the definitions corresponding to the 3.10 kernel. Other versions have similar definitions, with only minor differences.

First, we have the function interfaces for read and write operations, as well as direct I/O.

const struct address_space_operations cfs_address_ops = {
    .readpage = cfs_readpage,
    .readpages = cfs_readpages,
    .writepage = cfs_writepage,
    .writepages = cfs_writepages,
    .write_begin = cfs_write_begin,
    .write_end = cfs_write_end,
    .set_page_dirty = __set_page_dirty_nobuffers,
    .invalidatepage = NULL,
    .releasepage = NULL,
    .direct_IO = cfs_direct_io,
};

Next, we have the file operation interfaces.

const struct file_operations cfs_file_fops = {
    .open = cfs_open,
    .release = cfs_release,
    .llseek = generic_file_llseek,
    .aio_read = generic_file_aio_read,
    .aio_write = generic_file_aio_write,
    .mmap = generic_file_mmap,
    .fsync = cfs_fsync,
    .flush = cfs_flush,
};

And the interfaces for file inodes.

const struct inode_operations cfs_file_iops = {
    .permission = cfs_permission,
    .setattr = cfs_setattr,
    .getattr = cfs_getattr,
};

Finally, we have the interfaces for the superblock and the definition of the CubeFS file system.

const struct super_operations cfs_super_ops = {
    .alloc_inode = cfs_alloc_inode,
    .destroy_inode = cfs_destroy_inode,
    .drop_inode = cfs_drop_inode,
    .put_super = cfs_put_super,
    .statfs = cfs_statfs,
    .show_options = cfs_show_options,
};

struct file_system_type cfs_fs_type = {
    .name = "cubefs",
    .owner = THIS_MODULE,
    .kill_sb = cfs_kill_sb,
    .mount = cfs_mount,
};

We will skip over other definitions such as directory operations, directory inode interfaces, link file interfaces, special file interfaces, and directory cache. In summary, implementing a kernel file system involves creating this series of VFS interfaces. By configuring these interface functions during module initialization and integrating them into the kernel, the file system connects seamlessly with the Linux kernel file system.

Mounting

The mounting process is also very standard. When the Cubefs module is loaded into the kernel, it registers a file system called Cubefs. The superuser (root) can then use the mount command to mount the network storage. This process is no different from that of other file systems. Just like there are many car brands, the driving method remains the same. In this analogy, the mounting process is akin to the driving method of a car.

The kernel client registers the CubeFS file system type within the initialization function cfs_init. The code for this is as follows:

    ret = register_filesystem(&cfs_fs_type);
    if (ret < 0) {
        cfs_pr_err("register file system error %d\n", ret);
        goto exit;
    }

The definition of cfs_fs_type is as follows:

struct file_system_type cfs_fs_type = {
    .name = "cubefs",
    .owner = THIS_MODULE,
    .kill_sb = cfs_kill_sb,
    .mount = cfs_mount,
};

Here, cfs_mount is the function called during the mount process, while cfs_kill_sb is the function called during unmounting. Below, we will briefly describe the mounting process. 图片

Among these, cfs_fs_fill_super is the function responsible for filling the superblock. Once the superblock information is available, further operations can be performed on the mounted directory.

After setting up the superblock, operations on the mounted directory are handled through VFS interfacing functions, such as:

  • cfs_open for opening files
  • cfs_release for closing files
  • cfs_readpages for reading file contents
  • cfs_write_pagescfs_write_begin, and cfs_write_end for writing file contents
  • cfs_direct_io for direct read and write operations These functions facilitate the interaction between the CubeFS kernel client and the VFS.

IO Process

Data writing and reading can be considered the core paths of a storage file system. It's similar to a fuel vehicle, where fuel flows through pipes into the engine, igniting and generating power. In the kernel client, the last two detailed design sections are quite complex. If anyone is interested in this part of the process, I encourage you to read the corresponding code for a deeper understanding.

In simple terms, there are two ways to write data to the kernel client.

  1. The first method is when the data is already stored in a page.
  2. The second method is when the data is still in user space, with the driver being informed of the memory address and length. This second method is commonly referred to as the direct write storage mode. Regardless of the method used, the driver calculates the maximum length for each transmission (1 MB for tiny mode and 128 KB for normal mode). It then sends the data in chunks accordingly.

The difference lies in how the data is sent:

  1. In the first method, only a single 4 KB page is sent at a time.
  2. In the second method, a temporary memory space is allocated to copy the user data into the kernel, which is then sent out according to the maximum length. For those who are not interested in the technical details, understanding the above description is sufficient. There's no need to delve into the lengthy data module flowchart below.

The I/O read and write process in CubeFS varies based on factors such as whether direct I/O is enabled, the type of read/write (random or sequential), and whether the files are regular or small. However, the basic implementation principles remain consistent.

The general steps are as follows:

  1. Update Cache: Start by updating the cache.
  2. Generate Extent List: Create a list of extents.
  3. Segment Length: For small files, segment to a maximum length of 1MB; for regular files, segment to 128KB.
  4. Send Request: Send the read or write request.
  5. Receive Response: Finally, receive the response packet and update the cache and metadata (for the write process). This structured approach ensures efficient data handling within the CubeFS kernel client.

Since these processes are quite similar, we will focus on the direct write process.

The CubeFS kernel client uses cfs_direct_io to mount to the structure struct address_space_operations cfs_address_ops. Its implementation is as follows:

const struct address_space_operations cfs_address_ops = {
    .readpage = cfs_readpage,
    .readpages = cfs_readpages,
    .writepage = cfs_writepage,
    .writepages = cfs_writepages,
    .write_begin = cfs_write_begin,
    .write_end = cfs_write_end,
    .direct_IO = cfs_direct_io,
};

The entry function for both read and write operations is cfs_direct_io. The actual implementation of this function is cfs_extent_direct_io. In cfs_extent_direct_io, the cache is first updated, and then an extent information list is created based on the cache data. Depending on the type of file operation, data is either written or read sequentially.

For the write process, the data flow involves copying data from user space to kernel space, and then sending it out through the network socket.

The user space address for the data being written is represented by the parameter iov, the file offset is specified by offset, and the file information is contained within the iocb. The length of the data is calculated using iov_length(iov, nr_segs).

static ssize_t cfs_direct_io(int type, struct kiocb *iocb,
                 const struct iovec *iov, loff_t offset,
                 unsigned long nr_segs)

In kernel versions that support iov_iter, the parameters are more streamlined. The file information is still in iocb, while the offset is specified by offset. The starting address of the user space and the length are contained within the iter structure.

static ssize_t cfs_direct_io(struct kiocb *iocb, struct iov_iter *iter,
                 loff_t offset)
{
    struct file *file = iocb->ki_filp;
    struct inode *inode = file_inode(file);

    return cfs_extent_direct_io(CFS_INODE(inode)->es, iov_iter_rw(iter), iter, offset);
}

The function cfs_extent_write_iter implements the main functionality for direct writes. Direct writes and sequential writes are similarly categorized into three types: random write, small file mode, and regular mode. In summary, the processes for direct reads/writes and sequential reads/writes share many similarities. The main difference is that direct reads/writes involve directly copying user space data to memory allocated in the kernel, while sequential reads/writes first cache the data in pages through VFS. This results in direct read/write socket packets potentially reaching a maximum length of 1MB, whereas sequential read/write data packets are limited to 4KB (the size of a page).

图片

Flow Design

This section is intended as a brief explanation for those who enjoy technical details; others can feel free to skip it. Just like there are many car enthusiasts, only a small number of people study aerodynamics to design car shapes.

Sequential writes and reads follow a streamlined process, and since these two share many similarities, we will only describe the flowchart for sequential writes.

Each cfs_extent_writer represents a pipeline for sequential writes. Its important member variables include:

struct cfs_extent_writer {
    struct list_head tx_packets; //send list
    struct list_head rx_packets; //receive list
    atomic_t write_inflight; // processed packets number
    volatile unsigned flags; //writer status flag
    struct cfs_extent_writer *recover; // recover writer
    //others are omitted
};

The functions it includes and their functionalities are as follows:

cfs_extent_writer_new //create writer
cfs_extent_writer_release //release writer
cfs_extent_writer_flush //flush
cfs_extent_writer_request //input packet into writer
extent_writer_tx_work_cb //request packet processer
extent_writer_rx_work_cb //reply packet processer
extent_writer_recover //recover writer

The schematic diagram of its main functionalities is as follows: 图片

By concurrently processing the two pipelines, extent_writer_tx_work_cb and extent_writer_rx_work_cb, the overall pipeline handling of packets is achieved. The primary functionality involves the first pipeline sending packets, while the second pipeline waits for the response to those packets.

Due to the introduction of a recovery process, the pipeline has become more complex. To simplify the flow, we no longer use the pipeline approach in the recovery process. Instead, we wait for each packet to be successfully returned after sending it. While this method is less efficient than the pipeline approach, it simplifies the design and code, thereby reducing potential errors. In the recovery process, which occurs in only a few scenarios, this design choice is considered reasonable.

Execise

To truly understand a software, reading the code is the most in-depth method, but it can be time-consuming and exhausting. A more convenient approach is to set up an environment and use the software, allowing for hands-on experience. Using the software is always the fastest way to learn.

It's similar to driving a car: we don't need to study the scientific principles behind its mechanics; we just need to know how to drive it to reach our destination.

Before compiling and loading, interested readers can refer to the documentation in the source code, specifically Chinese_readme and readme, which provide detailed instructions on compiling and using the kernel module. For the sake of completeness in this article, we will briefly describe the compilation, loading, and testing processes.

Compiling

I encourage those using the kernel-mode client to compile it themselves. Different kernels have variations in details, and our software only supports a subset of kernel versions. Unfortunately, we don't have the resources to adapt to many kernel versions. This approach is consistent with other kernel storage clients, and I hope for your understanding.

Of course, the validated Linux kernels are documented in detail in the README file.

In the cubefs/client_kernel directory, run the command ./configure. This command will generate a macro definition file named config.h based on the current system configuration, which contains several macros used for compilation. This step is only necessary during the first compilation; if the compilation environment remains unchanged, there is no need to run it again. This process is similar to that of other Linux function compilations.

Next, run make to generate the kernel module cubefs.ko. If RDMA interfaces are not needed, you can use the command make no_rdma to obtain a module that does not include RDMA support. The current kernel client RDMA interface is not connected to the released version release-3.4.0-beta_rdma, but to an internal, unreleased version.

Clean environments:make clean。

Loading

To load the CubeFS module with a master address of 10.177.182.171:17010,10.177.80.10:17010,10.177.80.11:17010, a volume name of whc, and a user of whc, the command would be:

mount -t cubefs -o owner=whc,dentry_cache_valid_ms=5000,attr_cache_valid_ms=30000 //10.177.182.171:17010,10.177.80.10:17010,10.177.80.11:17010/whc /mnt/cubefs

Make sure to replace /mount/point with the actual directory where you want to mount the file system.

Usage

After successfully running the command above, readers can perform various file operations in the directory /mnt/cubefs. However, this assumes that the entire server-side system has been properly set up.

Conclusion

The kernel client is a new addition to the Cubefs software, currently offering limited functionalities and applications. In the future, other developers may enhance its EC mode implementation and address any bugs. It will take some time and refinement for the software to mature.

Building upon the kernel client, we can further provide kernel RDMA functionality, and ultimately, GDS functionality. Additionally, the kernel client has performance advantages because it does not require FUSE for data forwarding.

While the use of 4K page sizes for sending packets means that the tested speed may not show significant improvement, it does offer advantageous performance for direct read and write operations on large files.

Author

Wu Huocheng, CubeFS Maintainer, is currently involved in the development of the kernel-mode client and RDMA for the CubeFS project.

Abstract

This article briefly introduces the implementation mechanism of the CubeFS kernel-mode client, serving as a straightforward technical read. We only provide a concise overview of the entire data logic framework. Students who are interested can refer to the pipeline diagram for a deeper study of the code.

Edit on GitHub