Ceph* is a widely used distributed-storage solution. The performance of Ceph varies greatly in different configuration environments. Many clusters in production environments are deployed on hard disks. For various types of workloads, performance requirements are also different.
We built a Ceph cluster based on the Open-CAS caching framework. We made some adjustments to the characteristics so the system could cope with workloads that support large-scale sequential access. The adjustments provided better support for trading-system applications based on random access with a small block size.
The random access of HDD is restricted by the seek time of the magnetic head, which causes the performance of random access to drop very severely compared with an SSD. For a 10,000 RPM mechanical hard disk, the IOPS (input/output operations per second) of random read and write is only about 350. Ceph clusters based on mechanical hard drives cost less and are suitable for sequential access loads of large-scale data but are not suitable for small-block data access in OLTP (On-line transaction processing) workload. How can we improve the access performance of small random operations with competitive costs? We propose the Open-CAS caching framework to accelerate Ceph OSD nodes.
The baseline and optimization solutions are shown in Figure 1 below.
Figure 1: Ceph cluster performance optimization framework based on Open-CAS
- Baseline configuration: An HDD is used as a data partition of BlueStore, and metadata (RocksDB and WAL) are deployed on Intel® Optane™ SSDs.
- Optimized configuration: An HDD and NVMe* SSD are combined into a new CAS device through Open-CAS software as the data partition of BlueStore, and metadata (RocksDB and WAL) are deployed on Intel Optane SSDs.
Open-CAS is an open-source cache acceleration solution initiated by Intel. By loading the modules into the kernel, the high-speed media device(s) used as cache, “merge” with slow devices into a new CAS device, thereby improving the overall device read and write performance of the system.
The hierarchical structure of Open-CAS in the system is shown in Figure 2 below. OCF (core of Open-CAS, a high-performance block storage cache library written in C) is the filesystem layer showing as below. It is the IO request processing based on the block device layer.
Figure 2: Open-CAS framework
This article focuses on how to configure and optimize Open-CAS and build a Ceph cluster with it.
We built a distributed storage cluster based on Ceph. On the server side, the Ceph cluster is deployed by CeTune to provide storage services. On the client side, we deployed the Vdbench benchmark to verify our expectation. Vdbench is a popular IO benchmark tool written in Java*, which supports multi-process concurrent read and write tests on block devices in various modes.
Table1: Ceph Cluster Configuration
|– CPU||Server: 2 * Intel® Xeon® Platinum 8260L CPU @ 2.40GHzClient: Intel® Xeon® Gold 6252 CPU @ 2.10GHz|
|– Memory||Server: 3 * 192 GB|
|– Network Switch||Bandwidth 100 Gb|
|– Intel Optane SSD||2 * Intel P4800X 375 GB|
|– NVMe SSD||6 * Intel P4510 1 TB|
|– HDD||12 * 1 TB 1W SAS HDD|
|– OS||Ubuntu* 18.04.5 LTS|
|– Kernel Version||Linux* version 5.4.0-48-generic|
|– Ceph Version||Nautilus 14.2.11|
|– Ceph Cluster||2 Server node, 1 Client node, replica = 2|
|– Open-CAS Version||20.09.0.0362.master|
|– Vdbench Verson||5.04.07|
|· For each Ceph Node|
|– Intel Optane SSD||2 * 375 GB|
|– NVMe SSD||3 * 1.0 TB (6 * 250G)|
|– HDD||6 * 1.0 TB|
|– OSD Number||6 (HDD:OSD = 1:1)|
OPTIMIZED CONFIGURATION OF OPEN-CAS
For the optimized solution, we used Open-CAS for the data partition. Open-CAS can be used after compilation and installation: https://open-cas.github.io/guide_introduction.html.
The optimized configuration in Open-CAS is shown below:
# casadm -S -d /dev/nvme2n1p1 -c wb --force // Create a new cache device, and return cache ID # casadm -A -i -d /dev/sda // Add the backend device to the cache device and “merge” into a new CAS device # casadm -L // View current Open-CAS configuration information type id disk status write policy device cache 1 /dev/nvme2n1p1 Running wb - └core 1 /dev/sda Active - /dev/cas1-1 cache 2 /dev/nvme2n1p2 Running wb - └core 1 /dev/sdb Active - /dev/cas2-1 cache 3 /dev/nvme3n1p1 Running wb - └core 1 /dev/sdc Active - /dev/cas3-1 cache 4 /dev/nvme3n1p2 Running wb - └core 1 /dev/sdd Active - /dev/cas4-1 cache 5 /dev/nvme4n1p1 Running wb - └core 1 /dev/sde Active - /dev/cas5-1 cache 6 /dev/nvme4n1p2 Running wb - └core 1 /dev/sdf Active - /dev/cas6-1 // The configuration for each cache ID # casadm -X -n seq-cutoff -i 1 -j 1 -p always -t 16 // seq-cutoff always and threshold 16KB # casadm --io-class --load-config --cache-id 1 -f ioclass-config.csv // only cache with request_size <=128K # cat ioclass-config.csv IO class id,IO class name,Eviction priority,Allocation 0,unclassified,22,0 1,request_size:le:131072,1,1 # casadm -X -n cleaning-alru -t 1000 -i 1 // clean policy alru with activity threashold to 1s
We made several optimizations for the above configuration:
- Sequential read and write cut-off: When the IO stream of sequential read and write reaches a certain threshold, the cut-off is turned on. All subsequent read and write requests go directly to backend storage until random access causes the cut-off to be terminated.
- IO classification and priority definition: We hope to classify the random access by request size, and only cache the data with blocks less than 128K. This requires adding a category based on request_size and setting the eviction priority to high while avoiding too easy to be evicted from the cache.
- Parameter adjustment of clean policy: Set the default alru policy and shorten the background cleaning reaction time. There are two types of cleaning policies: alru and acp. acp is a more active strategy, but alru is more suitable for when there is free space in the cache, it does not consume too much bandwidth for background cleaning.
CEPH CLUSTER SETUP AND CONFIGURATION
We deployed a three-node Kubernetes* cluster through the CeTune tool; two nodes are for Ceph storage nodes, and one node is for the client.
CeTune is a framework for deployment, benchmarking, and configuration and adjustment of Ceph cluster performance. It integrates some benchmarking tools and provides various parameter data for system indicators.
You can refer to the official documentation to compile and install. Before the deployment, we need to take care multiple configuration items:
- Storage node configuration OSD according to the following format: osd:data:db_wal. Each OSD requires three disks, corresponding to the information of the OSD, the data partition of OSD, and metadata partition of OSD.
- Network configuration. There is a public network, a cluster network, and a separated Ceph monitor network.
The configuration file is in conf/all.conf; the main contents are shown below:
head=CephCAS1 # Head node list_server=CephCAS1,CephCAS2 # OSD nodes list_client=CEPHCLIENT-01 list_mon=CephCAS1 disk_format=osd:data:db_wal CephCAS1=/dev/cas1-1p1:/dev/cas1-1p2:/dev/nvme1n1p5,/dev/cas2-1p1:/dev/cas2-1p2:/dev/nvme1n1p6,/dev/cas3-1p1:/dev/cas3-1p2:/dev/nvme1n1p7,/dev/cas4-1p1:/dev/cas4-1p2:/dev/nvme1n1p8,/dev/cas5-1p1:/dev/cas5-1p2:/dev/nvme1n1p9,/dev/cas6-1p1:/dev/cas6-1p2:/dev/nvme1n1p10 CephCAS2=/dev/cas1-1p1:/dev/cas1-1p2:/dev/nvme1n1p5,/dev/cas2-1p1:/dev/cas2-1p2:/dev/nvme1n1p6,/dev/cas3-1p1:/dev/cas3-1p2:/dev/nvme1n1p7,/dev/cas4-1p1:/dev/cas4-1p2:/dev/nvme1n1p8,/dev/cas5-1p1:/dev/cas5-1p2:/dev/nvme1n1p9,/dev/cas6-1p1:/dev/cas6-1p2:/dev/nvme1n1p10 … public_network=192.168.10.0/24 # Based on 100Gb NIC monitor_network=192.168.10.0/24 cluster_network=192.168.11.0/24
Then we executed the deployment script and created a storage pool.
# python run_deploy.py redeploy --gen_cephconf // Deploy Ceph cluster and wait for finish … # ceph osd pool create rbd 512 512 // 12 OSDs, Set PG num & PGP num to 512 is appropriate # ceph osd pool set rbd size 2 // Set pool replicated = 2 # ceph osd pool application enable rbd rbd --yes-i-really-mean-it # rbd create --thick-provision --size 65536 rbd/rbd1 --image-format 2 // Create rbd and fill in content atumaticaly, it will take more time … # rbd map rbd/rbd1 // Map rbd block device on the client side
PERFORMANCE AND COMPARATIVE ANALYSIS
VDBENCH BENCHMARK ANALYSIS
TEST ENVIRONMENT AND REPORT
The test configuration of Vdbench on the Ceph RBD block device is listed below.
|– DUT number||1|
|– Baseline Environment||12 * 1 TB HDD as OSD|
|– Optimization Environment||12 * 1 TB HDD and 12 * 250 GB NVMe SSD combination devices as OSD|
|– RBD volume size||20 * 64 GB|
|– Read and write mode||4K Random Read4K Random Write4K Random Read(70%) and Random Write(30%)512K Sequential Read512K Sequential Write|
|– Queue depth||16|
|– Duration||600 seconds|
The test report is shown in the graphs below.
Figure 3: Raw performance results for random I/O
Figure 4: Raw performance results for sequential I/O
Figure 5: Open-CAS configuration comparison
As you can see from the figure above, in the Open-CAS optimization configuration, the IOPS of random reads (100% cache hits) and random writes has increased by 119.54 times and 86.761 times. The average latency has been reduced to 0.8% and 1.2%, respectively.
For sequential reads and writes, we pass the large block of sequential writes directly to the backend and evaluate the performance loss of sequential read and write at the CAS layer. The bandwidth loss of sequential read is 2.8%, and for sequential writes is 5.6%.
CONFIGURATION STRATEGY OF OPEN-CAS
The performance improvement of random access mainly comes from the characteristics of random access of NVMe SSDs. Open-CAS is the equivalent of glue, which combines the advantage of random access of NVMe SSDs with the advantage of HDDs large capacity.
For sequential access, multiple HDDs can be used to increase parallelism in a production environment. The more HDDs in a node, the better the performance of sequential access. This parallelism can offset the impact of limited performance of a single HDD.
Figure 6: R/W performance comparison
|SAS HDD RandWrite||SAS HDD RandRead||NVMe SSD RandWrite|
|NVMe SSD RandRead|
|Average Latency (us)||95640.00||107043.07||125.32||130.99|
The mechanism used by Open-CAS to accelerate the write process mainly depends on whether there is free or clean cache space in the cache pool. When the cache pool is completely “contaminated” by dirty data, the cache pool is invalid for writes, which is also a common characteristic of the cache system.
The figure below shows the percentage of “dirty” cache during random writes. As the time goes by, more and more data are written to cache. When the buffer is “dirty” at all, for data consistency consideration, it must be flushed back to backend storage to release the cache space.
Figure 7: Open-CAS dirty rate % increase
There are two solutions: increasing “income” and reducing “expenditure”.
- Increasing “income” means to improve the cache flushing policy. If you flush the cache data to the backend storage faster, then the “contaminated” cache space can be released sooner. The improvement is not obvious since the bottleneck is the low performance of random write to HDD when flushing.
- Reducing “expenditure” can make for optimal use of cache resources. Several configurations, such as cache-line settings, sequential access cut-off, and IO classification can be accounted into here. The configuration: seq-cutoff bypasses the cache and directly writes to backend storage when sequential IO stream is detected. By this way, only random accessed data will be cached.
The performance improvement of random read mainly depends on the read hit rate (read_hit_rate = read_cache_hit_number / total_read_access_number). We verified that 100% hit rate is the ceiling for performance improvement. In the actual environment, the read hit rate depends on many factors, such as the cache capacity and the data access mode. The read hit rate improvement requires a comprehensive design from the application to the cache layer.
The following are read and write conditions in the Ceph OSD device (Open-CAS device combined with cache device and backend device) with different access modes, which is collected by the dstat tool:
- Due to the setting of sequential cut-off, the sequential access of large block will store to the backend storage directly.
Figure 8: Sequential read on OSD
Figure 9: Sequential write on OSD
- For random writes, if the write hits or there is a free or clean cache block, the data can be cached. The following figure shows the ideal situation for all writes to be cached.
Figure 10: Random write on OSD
- In the case of random read (100% hit rate), all data is read from the cache and uses the full cache capability.
Figure 11: Random read on OSD
CONCLUSIONS AND RECOMMENDATIONS
The configuration described in this article is applicable to general scenarios. It is applicable to the workload with large-scale sequential access and concentrated random access to the Ceph cluster.
For the sequential access pattern, the performance advantage of SSD over HDD is not obvious. We set Open-CAS to sequential access cut-off and to directly access backend storage, which can save valuable cache pool resources, and maximizes its capacity for small random-access data. Of course, some scenarios require short-term high performance and low latency. For these scenarios, try full read and write caching. The specific configuration depends on the number and concurrency of HDD and SSD (cache) devices.
For random write access with a small block size, the performance depends on the write hit rate and the capacity of clean or free cache blocks in the cache pool. Under ideal conditions (such as clean and free cache blocks or a very high write hit rate), in write-back mode, the write request directly returns from the cache to the application. When the cache pool is full of dirty cache blocks and the write hit rate is very low, the write performance will drop sharply because new write access data needs to be promoted to cache blocks and must wait for the old dirty blocks to be flushed to the backend storage.
For random read access with a small block size, the performance depends on the read hit rate. In case of a read miss, the cache needs to access the backend storage to fetch the data and promote it to the cache. In extreme cases, additional flushing of data to backend storage is required.
Here are several ways to improve the read hit rate on the cache side:
- Use different cache modes, such as write around mode, write invalid mode, etc.
- The promotion strategy can be set according to the application access pattern
- Use application pre-heating data
- Optimize the application access data model
For general scenarios, Open-CAS reference configuration is write-back mode, cache-line is 4KB, sequential access cut-off is always on, set IO classification and only cache small random block (request size <=128KB), and use clean policy with default alru while more active parameters adjustment. In an ideal situation, random read IOPS is increased by 119.54 times, random write IOPS is increased by 86.761 times, and access latency is reduced to 0.8% and 1.2% respectively.
SUMMARY AND OUTLOOK
In the HDD-based storage environment, the Ceph cluster with Open-CAS on NVMe SSD as a storage node, cache has significantly improved the performance of the Ceph client block storage for small block random read and write. The replication mechanism in the Ceph storage node ensures the reliability of cached data and write-back mode is suitable for a Ceph storage cluster.
Open-CAS supports multiple caching modes and has corresponding preferred configurations for read-only, write-only, and read-write modes. Using the Open-CAS caching framework can take advantage of the high speed random read and write of NVMe SSDs and the large capacity of HDDs.
There are several business scenarios that have reference value:
- Random read and write scenarios of small data blocks with low latency requirements, such as online transaction systems and banking services, which can take advantage of the high-speed random read and write capabilities of NVMe SSD.
- Large throughput scenarios with massive data, such as video on demand, big data analysis and processing, etc. The multiple concurrent HDD pass-through mode can ensure stable bandwidth and sequential read and write capabilities.
OPEN-CAS PARAMETERS DESCRIPTION
There are many parameters of Open CAS; Based on our experience and the testing environment, we evaluated and verified several key parameters that effect performance.
There are multiple cache modes supported by Open-CAS. We expect to figure out the mode on mixed read/write operations and majorly to improve random access with small block size. Eventually, we choose write-back mode.
On the client side, only kept one copy on the local disk. If the disk is physically damaged, the data will be lost permanently. We deploy the caching solution on the Ceph storage cluster with replication guaranteed which avoids the single point failure.
The read/write flow of the write-back mode is shown in the figure below. In addition to the read-write process, Open-CAS flushes dirty data (the cached data is inconsistent with backend storage data) to backend storage according to the cleaning policy.
Figure 12: IO request flow
A cache line is the smallest portion of data that can be mapped into a cache. Every mapped cache line is associated with a core line, which is a corresponding region on a backend storage. The relationship between a cache line and a core line is illustrated on the figure below.
Figure 13: Cache mappings
The default size of the Open-CAS cache line ranges from 4K up to 64K, as supported by the current system. The setting is specified when the cache device is created and cannot be modified at runtime.
For cache devices with NVMe SSDs, there is no seek time impact of using HDD. There is not much difference in bandwidth for reading small blocks versus large blocks. The larger the cache line setting, the larger the cache space occupied by read/write requests; for access with a small block size, it is also a waste of cache space. In some high throughput scenarios, such as with Intel® Optane™ Persistent Memory, you may need to increase the cache line to increase bandwidth.
SEQUENTIAL ACCESS CUT-OFF
When the sequential IO stream reaches a certain threshold, the cut-off is turned on. All subsequent sequential read and write requests are sent directly to the backend storage, until cut-off is terminated.
For small random blocks, the IOPS of SSD has obvious advantages over HDD. However, for large block sequential read and write, the advantage of SSD as a cache is not obvious, especially as one SSD acts as a cache for multiple HDDs. Therefore, many cache solutions choose to bypass read/write of sequential large blocks and send them directly to backend HDD devices. It can save cache space and accommodate for more data of random access with small block size.
We evaluated the performance of SAS HDD disks in different modes and block sizes by using the FIO (jobs=4, queue_depth=8) benchmark. We found that the random read and write performance bandwidth increased with the block size (IOPS did not change much), and sequential read and write performance is less affected by block size.
Figure 14: SAS HDD performance with FIO
In the cache system, Open-CAS provides the flushing strategies and according parameters to have data flushed from cache to backend storage.
The default policy, alru, periodically refreshes dirty data. Using a modified least-recently-used algorithm, the refresh priority is relatively low. Another cleaning policy, acp, clears dirty cache lines as quickly as possible to maintain the maximum bandwidth of backend storage. The acp policy aims to stabilize the data refresh time of write-back mode and maintain more consistent cache performance.
We found that the effect of the acp policy is not satisfactory. When performing normal IO operations in the foreground, the acp mode is more active and the flush continues indefinitely. As a result, the performance of read and write operations in the foreground is seriously affected. We set the refresh policy to alru and adjusted the parameters to make the flush operation more active. When the system has free or clean cache space, the alru policy is better.
Open-CAS also supports many-to-one mode, in which multiple backend storage devices can be cached by a cache device and the data of these backend storage devices share a cache pool. The advantage is the data access can be balanced. In Ceph, the data distribution of each OSD is not necessarily completely balanced. Sharing a cache pool offsets the cache space waste caused by data imbalance (a typical OSD data distribution is shown below). But the disadvantage in the write-back cache mode is that a single point of failure is unavoided. If multiple OSDs share data in a storage pool, and the storage pool contains all copies of this data, the situation becomes complicated.
Therefore, we use a one-to-one model, where one cache device corresponds to one backend storage device.
In addition to these parameters, Open-CAS also provides some media related configurations, such as the support of trim and atomic write, but this requires a specific cache medium and kernel version