A disk failure, timeout, and slow disk diffusion cause O database nodes to switch quickly in 5 seconds, causing complex manual recovery operations. Is it worth it?
毕须说  2024-07-30 18:28   published in China

Recently, I communicated with an organization to learn about the architecture and failover mechanism of the distributed database O.

 

NVMe SSDs are recommended for improved performance. However, with no NVMe RAID controller cards available and concerns about the performance of software RAID, a pass-through mode is adopted without configuring RAID for local disks. A total of nine copies are configured: three in the production center, three in the DR center in the same city, and three in a remote location. Each server is equipped with six 3.84 TB NVMe SSDs. A volume group (VG) is created for every six disks (LVs), and an XFS file system is created in a logical volume (LV). Nodes work in active/standby mode. Multi-copy replication between the primary and secondary databases and synchronous replication from the production center to the DR center in the same city are implemented based on table partitions.

 

Hundreds of database servers have been deployed to handle less than 30 TB effective data, resulting in an unimaginably low utilization rate!

 

During a fault simulation test, if an NVMe SSD is removed, the database will fail to detect I/Os within 5 seconds and initiate a node failover that takes about 8 seconds. Besides, if any disk times out or a bug causes the system to hang, a quick node failover is triggered and completed in 5 seconds. In total, the failover takes 13 seconds. It seems really fast. In other words, the database cannot tolerate an I/O detection failure of more than 5 seconds, otherwise it will trigger a node failover.

 

At this point, if a new disk is inserted, the OS will assign a new drive letter to the disk. As a result, the system cannot be restored. A new server must be used to replace the faulty one and be added to the cluster through a capacity expansion process, and the original server is deemed unavailable due to a faulty disk. Because the new server has no RAID configured, the VG, LVs, and XFS file systems need to be rebuilt, as well as the copy data. How long will this process take?

 

Is it acceptable for a server to be removed just because a disk or memory becomes faulty, requiring the customer to perform manual restoration? If such an issue occurs at 3:00:00 a.m., would staff be able to confidently perform these operations while still groggy?

 

It's hard to imagine how complex O&M will be for the customer. Especially as servers age over two to three years and component failure rates rise, will the customer be able to handle frequent failovers? Will a server vendor agree to replace the entire server with a new one simply because a disk or memory fails? This is all about costs.

 

In addition, if a disk frequently becomes slow, for example, with an I/O latency of 20 ms, how does the database address the issue? How do we quickly back up a large database cluster, especially the core database, on a daily basis? How can restoration be fast?

 

To deal with these challenges, a financial institution applies a database external storage solution, in which slow or unresponsive disks are quickly isolated. If any disk shows frequent I/O latency of dozens of milliseconds within 30 seconds and exceeds a certain threshold, the disk is marked as sub-healthy. The system will then degrade the RAID level to read/write service responses, while monitoring I/O latency in the background. If the latency remains abnormal, the system may even restart the server. If the issue persists, the disk will be rapidly isolated. If an I/O timeout occurs and second-level I/O latency is detected, the storage system automatically performs RAID reconstruction to restore services. The services remain unaware of the timeout, and database node failover is not triggered, which simplifies O&M management.

 

According to a user from a major bank, if a failover occurs due to server reliability issues, and a new server is brought in, they only need to map the external storage LUN resources to the new server, add the new server to the cluster, and synchronize incremental changes. In contrast, if the local disk solution is used, they would need to rebuild the copies from scratch, which takes much longer and affects the replication performance of the production network. The management logic of compute resources is different from that of storage resources. Resource management must be refined, rather than the coarse management seen in integrated compute and storage solutions. Storage-compute decoupling emerged in 1990 and has proven successful in practice. Decoupling storage from compute not only conserves resources but also enhances management efficiency and simplifies O&M.

By BIXUSHUO

毕须说公众号二维码.jpg

Replies(
Sort By   
Reply
Reply