Abstract: the database storage engine is a core component used in the database management system to process storage tasks and manage data. LSM-TREE and BTREE/HEAP are not perfect for both distributed databases and traditional general-purpose databases. LSM-TREE is suitable for high-concurrency writing, but the read performance may be affected. However, the BTREE/HEAP structure performs better in data reading and complex queries, but the concurrent writing performance has bottlenecks. However, for different types of database systems, such as standalone centralized databases, distributed databases, and shared storage databases with multi-read and multi-write architectures, there are always some defects behind them, currently, Oracle RAC, a shared storage peer-to-peer read/write architecture, is still the most application-friendly. Huawei's open-source Cantian engine is a middleware that enables a standalone database to have similar Oracle RAC capabilities. Currently, it supports InnoDB storage engines. Shentian provides a convenient way for databases to implement RAC-like architectures.
Text
the database storage engine is the core component used in the database management system to process different types of storage tasks and manage data. It is responsible for data storage, retrieval, update and deletion. Different storage engines have different features and performance to meet the needs of various application scenarios. The storage engine is the core component of the database. Some characteristics of the storage engine determine some basic performance characteristics of the database.
According to the underlying data structure and data organization, the most widely used storage engine structure in databases is no more than two series or three types. Distributed databases prefer to use LSM-TREE storage, and BTREE/HEAP storage commonly used in general-purpose databases. BTREE and HEAP storage are two very similar storage engine structures, which are considered to be the same storage structure in many cases. Databases such as Oracle,PostgreSQL, and MySQL all use the BTREE/HEAP structure. BTREE is a generic name. In fact, the data structure is an enhanced B TREE.
The large volume of database storage engines
this is a list of some database products from the two online camps. We can see that traditional databases and centralized databases use more B- TREE structures, while distributed databases often use LSM-TREE. LSM-TREE is short for Log-Structured Merge Tree, which is a typical immutable storage structure, to reduce the cost of writing data in a variable structure storage structure, you must first find the existing PAGE and then modify it, so as to achieve higher concurrent writing. Therefore, databases that use LSM-TREE are naturally friendly to highly concurrent write/modification operations. However, the LSM-TREE structure is not perfect, and the read performance needs to be coordinated among multiple replicas, so the read performance will be affected to some extent.
Oracle block structure, typical slotted page structure (HEAP)
although the HEAP/BTREE storage structure is similar in structure, there are still some differences. In our common databases, Oracle, PostgreSQL, and so on use the HEAP structure. Pages in this structure are different from different BTREE pages and use the slotted page. Databases such as Mysql and Dameng use the traditional BTREE storage structure.
MySQL BTREE storage structure
the preceding is a logical diagram of the MySQL innodb storage engine, which is a typical BTREE storage engine. The storage engines of the BTREE structure are not exactly the same. The main difference is whether the leaf node segment is stored in the same segment as the data. Unlike MySQL, Dameng is obviously separated. In fact, the BTREE structure stores all data in the order of primary keys.
In fact, this is not very accurate. It can only be said that the default storage engines of these databases use this data organization method. In fact, Oracle has heap tables, cluster tables, and mixed column compression tables. The cluster table is BTREE-structured. Dameng's storage engine has a variety of data organization methods: B- tree data, heap table data, column storage data, and bitmap index. B- tree data is the default organization method for common Dameng tables.
When it comes to the structure of the storage engine, there is a saying that the structure technology of so-and-so is more advanced than that of traditional databases. In fact, no matter how advanced the storage engine is, it also has its disadvantages. The advanced in some people's eyes only refers to the late appearance. After decades of development, the storage engine of the database has not seen many genres so far.
The massive use of LSM-TREE is that after the memory cost is greatly reduced, the performance of highly concurrent writes can be greatly optimized based on memtab. At the same time, SSTABLE can make full use of persistent storage, which is more conducive to data compression. LSM-TREE is applied to OLTP relational databases on a large scale only after SSD is gradually popularized, because of the huge overhead, it may even lead to unstable background compression of the system, which makes OLTP systems that are very sensitive to latency unbearable. SSD can alleviate this problem.
The B- TREE/HEAP storage structure (hereinafter referred to as the B- TREE engine) generally uses IN-PLACE-REPLACE(PostgreSQL ASTORE is a special case), so its performance is relatively stable, although some waste is caused by page fragments, it is generally balanced. Compared with LSM-TREE, the performance of concurrent writes is subject to bottlenecks due to the characteristics of B- tree. In the test, hundreds of millions of data records are written concurrently every 5 minutes and real-time statistical analysis is required. After 10 million data records are written per second in Oracle databases, it is obviously difficult to improve, in distributed databases that use LSM-TREE, it is easy to exceed 20 million/second. Although the performance of the B- tree storage structure cannot be comparable to that of the LSM-TREE, the B- tree storage structure has obvious advantages in data reading. In terms of the implementation of MVCC and the complex query of data, B- tree has obvious advantages over LSM-TREE.
In fact, some advantages and disadvantages of the two storage structures are not easy to distinguish clearly. If you stand on a certain position, there will be different analysis conclusions. For example, on the issue of write amplification, LSM-TREE supporters will say that the structure of the LSM-TREE is more compact and there are no fragments of data. However, the B- tree storage engine often shows that only a small portion of data is written in a certain PAGE, the write is amplified. In fact, although LSM-TREE does not have this write amplification, the underlying SSTABLE may be merged frequently in the case of high throughput writes, which may also cause write amplification. However, the ability of modern hardware to write IO has been greatly improved, which will not affect most scenarios.
The factors that determine the technical capability of the storage engine do not only depend on the way the underlying data is organized. As one of the core engines with large amounts of data in the database, in a broad sense, the main functions of the storage engine include:
- data storage: physically stores data on disks or other storage media.
- Data retrieval: quickly and efficiently retrieve data based on query criteria.
- Data Update: processes data insertion, update, and deletion operations.
- Transaction Management: ensures data Consistency and integrity, and implements Atomicity, Consistency, Isolation, and Durability (ACID) properties by supporting transactions.
- Concurrency control: ensures data consistency and integrity when multiple users access concurrently.
- Data Backup and Recovery: supports data backup and recovery to ensure data security.
Therefore, the storage engine requirements of an excellent database are not limited to the storage organization form. Currently, there are single-server centralized databases, shared storage, read/write splitting databases, shared storage, multi-read and multi-write databases, distributed databases, and other forms. Different forms of database systems have their own storage engines.
The standalone centralized database architecture is relatively simple, suitable for a wide range of scenarios, and simple O & M, which is popular with users. However, due to its stand-alone architecture, it can only SCALE UP and cannot SCALE out horizontally. For customers with fast business development, sometimes they dare not choose. In addition, a single-machine centralized database can only build high-availability replication groups through master-slave replication. When the master database fails, switching is troublesome, the degree of automation is low, and the switching speed cannot support some core business systems.
Although distributed databases solve some of the pain points of standalone centralized databases, due to the large-scale and complex problems caused by their distributed architectures, in addition, the impact of cluster network latency on the performance of applications with certain characteristics (such as the performance degradation of stored procedures), so it can only be used in some systems with relatively sufficient funds.
In fact, databases with shared storage, multi-read and multi-write architecture similar to Oracle RAC are currently the most widely used databases. Multiple compute nodes share a centralized storage system without wasting too much resources. By expanding cluster nodes horizontally, the Business load capacity can be expanded linearly. When a node fails, the application can automatically switch to other active nodes in seconds, these features provide sufficient capability guarantee for enterprise applications. At the same time, because the database is essentially a centralized database, it is also friendly to application development and is not difficult to maintain.
However, because the shared storage multi-read and multi-write architecture has high overall requirements on the storage engine, it needs to solve major problems such as buffer fusion, global lock management, global resource directory, global transaction management, and global consistency recovery, therefore, in addition to Oracle, excellent database products that can provide this architecture are very scarce.
On DTCC 2023, the Huawei team released a Mysql shared storage concurrent read/write solution. Its core technology is Huawei Cantian engine, which has been open source in Mulan community, the name is openEuer/Cantian. It is described in the simplest sentence: "Cantian is a middleware that can transform an ordinary standalone database into a database with similar Oracle RAC capabilities. Currently, innodb is supported, more database storage engines will be supported in the future." The Cantian engine adopts the storage and computing separation architecture, and adopts key technologies such as distributed cache technology, transaction MVCC mechanism, and high availability of multi-master clusters, this enables common standalone databases to have multi-read and multi-write capabilities similar to Oracle RAC, and can be loaded and run by databases such as MySQL in a non-intrusive manner without modifying the implementation of existing databases. It has been more than a year since the launch of Cantian, leading customers in the financial industry and operators have conducted business verification. With its business innovation, Cantian engine won the IT168 2024 Technology Excellence Award.
The above figure is a logical architecture diagram of Cantian. Based on Huawei's OceanStor Dorado high-end storage system, Cantian engine builds an intermediate layer between MySQL SQL engine and innodb storage engine, this middle layer can simulate innodb behavior, so MySQL SQL engine can be easily connected with it. Because innodb is closely related to transaction control, Cantian includes not only the MySQL storage layer but also the MySQL transaction management layer. The advantage of introducing the Cantian engine into MySQL is that after adding this intermediate layer, MySQL has the ability to read and write concurrently on multiple nodes, and becomes MySQL RAC. As an open-source engine, Cantian has largely referred to the successful practices of Oracle RAC to better solve the problem of consistent read and write across instances, it can be an important reference for domestic database manufacturers who want to build a shared storage multi-read and multi-write architecture.
From the above analysis, as the core component of RDBMS, the storage engine does determine many features of the database. However, there is no perfect storage engine at present. Each engine has its own advantages and disadvantages. The database system is no longer a simple software system. What we need to do is actually to make use of the characteristics of modern hardware to make up for its shortcomings as much as possible and develop its advantages, so that the advantages of the storage engine can be brought into play and the shortcomings can be covered up. In this respect, Cantian's thinking can be used for reference. On the other hand, when selecting database products for their applications, users also need to fully understand the advantages and disadvantages of a storage engine, so as to determine whether the database is suitable for their application scenarios.