Construction and practice of the new generation disaster recovery system of Weizhong Bank
社区小助理  2025-05-15 18:10   published in China

Text: Hu Panpan | director of Weizhong Bank database platform

 

i. Core summary

as the first digital bank in China, Weizhong Bank has served more than one individual user after ten years of development. 4100 million person-times, the total number of small, medium and micro enterprise users applying for loans exceeds 500 tens of thousands, the data volume is growing rapidly. In the bank IT in the system, data backup is a crucial part and the last defense line in disaster failure scenarios. With the rapid development of banking business, banks need to innovate and iterate data backup and recovery systems to continuously reduce data backup and storage costs and improve backup data recovery efficiency.

To this end, Weizhong Bank is based on its self-developed backup control platform and data archiving platform, combined with Huawei OceanProtect storage products, domestic Blu-ray storage system, etc. have built a new generation of ultra-large-scale database disaster recovery system for bank core systems. The system implements backup through core backup technologies such as native copy mount, permanent incremental backup, patent re-deletion and compression, etc. / improved recovery efficiency 50%, 2 restore within minutes TB level data, while saving storage space 75%; Reuse the disaster recovery system environment to realize the quasi-Production Information Laboratory and explore the business value of backup data. In large-scale recovery verification scenarios such as business version verification, disaster recovery drills, year-end settlement drills, and quarterly interest settlement, original 4.5 to 7 the recovery time of the day is compressed 1 day, efficiency improved 77%~ 85%; It adopts all-domestic core databases and disaster recovery software and hardware, and takes the lead in realizing the full-stack localization of the financial disaster recovery system to ensure the security of the financial supply chain.

 

II. Background and requirements of data disaster recovery

1. National and regulatory data backup requirements

Key industries such as finance, government affairs and telecommunications explicitly require enterprises to implement effective data protection measures in the construction of information systems to ensure data confidentiality, integrity and availability. This includes clear provisions on disaster recovery and data backup capabilities to ensure rapid and effective recovery of critical data in the face of emergencies. China has established a set of national standards and regulations for disaster recovery construction and data security, mainly including: GB/T20988-2007 information system disaster recovery specifications, GB/T22239-2019 information Security Technology basic requirements for information system security level protection, network security law, data security law, etc. These standards and regulations together form the basis of disaster recovery and data security in China's information systems, aiming at strengthening data protection and ensuring the safe and stable operation of key information infrastructure.

At the same time, in order to strengthen data security and integration, the financial industry has formulated the following key standards in accordance with national guidance: Guidelines for the implementation of network security level protection in the financial industry JR/T0071-2020, distributed database technology financial application specification disaster recovery requirements JR/T0205-2020, "banking information system disaster recovery management specification" JR/T0044-2008 and so on. These norms aim to improve the level of data management and security protection in the financial industry and ensure the security and recovery capability of financial data.

2. Pain points and requirements of Weizhong Bank's data disaster recovery system

the data volume of Weizhong Bank is huge, and it is close 900 database instance, daily full backup data up 600TB, Daily incremental backup data 50TB, day binlog data size 30T, the total number of backup data in stock reaches 12PB.

How can the backup system meet the requirements of high reliability and fast recovery in critical operation times such as business version verification, day-end batch processing, quarterly interest settlement, and year-end settlement drills in emergency situations, it has become a challenge before us. In actual operation, we encountered several key pain points:

backup Recovery efficiency cannot match business growth: As the business volume increases sharply, full backup takes more time 20 hours, single database recovery required 2 more than hours, average time for disaster recovery drills 7 days, unable to meet the requirements of fast business version verification and data recovery.

high construction costs caused by the surge in backup capacity: distributed file systems CEPH the cluster three-replica policy results in low disk utilization, while the storage utilization is only 33%, lack of data re-deletion capability and effective cold and hot data separation strategy, resulting in total cost of ownership ( TCO) remain high.

extortion incidents occur frequently, lacking systematic protection: in recent years, extortion software attacks in the industry are frequent, resulting in high data security risks. Weizhong Bank also needs to introduce systematic protection measures including ransomware detection, tamper prevention, data encryption and automated response mechanisms to ensure data security and compliance.

To meet these challenges, Weizhong Bank is based on its self-developed backup control platform and data archiving platform, combining Huawei storage products and domestic Blu-ray storage systems, we have built a new generation of ultra-large-scale database backup and recovery system for the bank's core system, and realized a full set of data backup and recovery systems such as disaster recovery, backup, archiving, security, and data recovery, improve the efficiency of data backup and recovery, reduce backup and storage costs, and ensure business continuity and recoverability.

 

III. Solution Architecture

1. Architecture and disadvantages before upgrading

As shown in the figure 1 as shown in, in the old data backup architecture, we introduced an open source distributed file system. CEPH, as a storage system for backup data, including using common X86 server and HDDSATA hard disk construction CEPH distributed Storage clusters use three-copy storage to ensure high reliability and availability of data. In terms of backup policies, we have a full backup every Sunday and an additional backup from Monday to Saturday. binlog log 5 real-time backup is performed every minute. In terms of data retention policies, 3 retain full backup within months / incremental data and full data binlog log backup, 3~6 full backup and full backup for the last week of each month are retained within months. binlog log backup, 6 keep the last full backup every month before months. Currently, the number of backup data in stock reaches 12PB, the daily incremental backup data is also Hundreds TB left and right, occupying up 7 number CEPH A storage cluster with hundreds of server resources. As mentioned above, with the continuous increase of backup data, the old architecture system faces the problem that the backup recovery efficiency cannot match the business growth, the backup capacity surge leads to high construction costs, frequent ransomware incidents lack several major pain points of systematic protection, which urgently requires reconstruction and optimization.

 

2. Upgraded architecture

in the new data disaster recovery solution, as shown in the following figure 2 as shown in, we introduced Huawei OceanProtect professional storage devices and systems such as backup clusters, archive storage, and Blu-ray storage are managed and scheduled through our self-developed unified database backup management and control platform. Through the application of new technologies such as data compression, data re-deletion, real-time mounting, and multi-file hierarchical storage, the upgraded next-generation data disaster recovery system has greatly improved the efficiency of data backup and recovery, reduces the storage cost of backup data. At the same time, a security anti-ransomware system based on physical isolation mechanism is introduced to realize data backup " gold copy " to ensure data security in extreme scenarios.

 

introduction to core modules

The following is a brief introduction to the key modules of the next-generation disaster recovery system architecture. :

backup Control platform: integrates key system functions, such as backup scheduling, recovery, archiving, policy update, data management, monitoring, and permission control, and supports monitoring and reporting through detailed backup records, helps Enterprises audit compliance and planning. The real-time monitoring engine tracks tasks, analyzes performance, predicts risks, and optimizes policies. User interface and API improve operational convenience and achieve efficient integration. This solution enhances data security and backup management efficiency. As the management platform of the entire disaster recovery system, the backup management platform implements Resource supervision, job scheduling, and security audit for the entire process of data protection.

Database disaster recovery cluster: is a cross-city disaster recovery cluster of a production database cluster, which provides geographic redundancy and disaster recovery capabilities. The disaster recovery cluster adopts a master-slave architecture. The master node provides read-only access during normal operations, supports querying and reporting requirements, and shares the load on the master database. The secondary node backs up data regularly to ensure that it can be quickly restored to the latest state when needed.

OceanProtect cluster: provides a series of key functions to ensure high performance, high reliability and high security of data backup and recovery. This cluster is mainly responsible for high-performance data backup. / restoration, real-time mounting of backup copies, re-deletion and compression of backup files, encryption, anti-ransoming, restoration of data desensitization, and its own architecture supports horizontal scaling capabilities, the storage capacity and performance can be scaled out by adding storage nodes to meet the increasing demand for data volume.

Quasi-Production Verification Environment: is based on TDSQL the simulation environment created by the quasi-production cluster is used for secure business simulation and verification to ensure the safety of production data. It supports key financial activities such as year-end settlement, quarterly interest settlement, and business version verification to ensure accurate and compliant accounts. Through server pooling and single-node deployment, the lab implements efficient and on-demand resource allocation at low cost to meet business requirements in non-production environments.

Archive Storage: it is self-developed by Weizhong Bank S3 compatible with storage solutions, it features high availability, durability, low cost, and scalability. It is mainly used to archive warm data with low access frequency, thus effectively reducing long-term storage costs.

Blu-ray storage: it is a storage technology suitable for long-term cold data retention and meets regulatory requirements. By migrating historical data from object storage to Blu-ray disc, cost-effective data backup is realized.

Security Anti-ransomware system: through physical isolation AirGap technology to independently store full backup data and full backup data for the past month binlog log backup " gold copy ".

introduction to hierarchical storage policies

for replicas that need to be retained for a long time, the system archives them to low-cost storage media at different levels, saving high-performance backup pool resources for all-in-one backup machines and meeting construction costs and security requirements.

We are OceanProtect in the backup storage cluster, keep the latest 3 continuous backup snapshots of months and full backup snapshots binlog log backup OceanProtect with the high performance and data re-deletion and compression functions, we can keep continuous backup data for the past three months to meet frequent and fast backup and recovery requirements.

For periods 3 to 6 the backup copy of the month. We keep the full backup data for the last week of each month, as well as the full backup data. binlog back up data, through the backup all-in-one machine S3 the protocol is stored in a self-developed archive as a long-term backup copy of warm data.

For 6 for replicas of more than a month, we keep the full backup data of the last week of each month, and transfer the data to Blu-ray storage by all-in-one backup machine and archive storage in different levels as a permanent backup copy of cold data.

mining backup data business value in quasi-Production Verification Environment

traditional backup software often uses private backup sets to back up data. The recovery process often takes several hours and cannot meet the business needs. As a result, the backup system has a low utilization rate and the backup data becomes cold and dead data, become thorough " cost Center ". To improve the utilization rate of backup systems and explore the value of backup data, we use native copy mount and copy desensitization technologies to restore business data in minutes and avoid leakage of user privacy data. Currently, the following application scenarios are met:

production version verification: a copy of the native format is mounted, and the production business environment is pulled up in minutes to quickly replay and verify major business versions and Major Day-cut batches based on real production data.

Quarter / annual / disaster recovery drill: implements batch parallel recovery by using the automatic database rehearsal function, 1 data recovery can be completed within days to quickly verify the reliability of backup copies and output drill reports to meet security regulatory requirements.

Fast Data recovery: in the quasi-production environment 2 data lost due to accidental deletion or other reasons can be recovered within minutes, greatly reducing the business interruption time, improving the real-time availability of data, and ensuring data integrity and business continuity.

 

IV, POC online verification and problem resolution

1.POC launch milestone

① 2023 year 10 in January, conduct in-depth research on the new generation backup solutions in the market; Evaluate the products, technical performance, cost-benefit analysis and compatibility with existing systems of different suppliers; determine the backup technology that best suits current and future needs.

② 2023 year 12 month, within the deployment line DB related environment version, completed based on the production business scenario 9 class 19 the function verification of the item.

③ 2024 year 5 month, complete the pair OceanProtect performance (backup / restore / immediate mount), features (data reduction / active/standby switchover / data security / archive), reliability (hard disk / controller) test, all test items are in line with expectations.

④ 2024 year 9 month, start OceanProtect in the disaster recovery backup environment of Weizhong Bank POC online testing to ensure that the functionality and performance of the new device are not lower than those of the completed device POC effect, and can support more than half of the disaster recovery environment instances (400) high-performance backup and recovery.

⑤ 2024 year 12 month, after more than two months of close work, completed OceanProtect performance and function adaptation of all-in-one backup machine in disaster recovery environment; Smooth access 400 the database instance is backed up and run successfully. Currently, the new and old backup systems are in the parallel running stage. Start the second instance synchronously OceanProtect shelving and deploying devices.

2. Core test cases

from the three dimensions of function, performance, and availability, we have summarized 9 class 19 the core test cases of the item are shown in the following table:

 

3. Key POC data

backup write bandwidth. The write bandwidth determines the overall backup efficiency. The production environment of Weizhong Bank is as high 900 database instances 24 complete full backup within hours. After multiple tests and verification, a single backup OceanProtect the average backup write bandwidth of the device meets the requirements.

the compression ratio of data re-deletion. For database backup scenarios, data files are mainly modified incrementally, and there will be a large amount of duplicate data, so it has a high proportion of duplicate data and compression effect. After many actual measurements, the re-deletion compression ratio of the backup data is 20: 1, in line with expectations.

the average backup time. After actual testing, 40 database instances, the total data size is about 30TB, the average time consumed by concurrent backups is as expected.

the average time consumed for backup and recovery. After actual testing, 40 database instances, the total data size is about 30TB, the average time taken to complete concurrent backups is approximately as expected.

4. Typical problem solving

The database backup and recovery system itself is a complex system scenario 900 when multiple instances are backed up on a large scale, POC some problems will inevitably occur during the process. Here, two typical problems are listed.

Question 1: Scheduling module CPU when the task is full, the scheduling task is stuck.

In the initial stage POC in the test, only registered 40 each database instance is deployed for verification. Agent, used for database instances and OceanProtect communication between, at this time, the scheduling task is normal; Later, gradually increase the number of registered instances 300, Scheduling module CPU when the rush is high, the scheduling module is stuck.

After joint analysis with Huawei, we confirmed that it is because of the data instance Agent to OceanProtect the heartbeat of the device is reported too frequently ( 1 times / minutes), resulting in excessive consumption of the scheduling module, similar " avalanche " the phenomenon. The Huawei R & D team released the patch version to optimize CPU modules and code with high impact Agent report heartbeat from 1min/ next time 5min/ times, in 300 + when you register an instance, the scheduling module CPU the utilization rate is controlled 5% within.

Problem 2: The data deletion rate is low, resulting in the overall backup bandwidth and backup efficiency not meeting expectations.

Access 300 after multiple database instances are randomly selected each time 40 to back up multiple instances, verify the backup bandwidth and backup efficiency, and find that the backup data deletion rate is always low (approximately 2: 1 left and right), resulting in the overall backup bandwidth and backup efficiency can not meet expectations.

After joint analysis with Huawei R & D, we found that each Random extraction 40 all instances are different, resulting in the first full backup of most instances. At the same time, because no historical backup snapshots are used as duplicate data references, the data deletion rate is low, this is also a normal phenomenon that meets expectations.

Later, to verify the backup efficiency in actual scenarios, we fixed 40 when multiple database instances are backed up repeatedly, the data deletion rate is significantly improved, and the backup bandwidth and backup efficiency are also achieved.

 

V. Key technology innovation

1. Full Stack localization solution

The new generation of data disaster recovery system adopts the self-developed backup control platform of Weizhong Bank, TDSQL distributed Database, OceanProtect backup all-in-one machine, self-developed S3 object Storage, Blu-ray archive storage, Luopeng server, and Euler operating system, implementing architecture, software, and hardware 100% fully independent domestic construction, full-link guarantee supply chain security. At the same time, as a standardized backup system architecture, the system also has good replicability and generalization.

2. Native format backup and instant Mount technology

OceanProtect native backup stores backup data in a data format that can be recognized by the application, and reduces data reduction and encryption capabilities to the underlying storage layer. During incremental backup, incremental data and full data are integrated to ensure that each backup copy is a complete copy that can be recognized by the application. TB the recovery time of the level dataset is from the previous 2 hours significantly reduced to only 2 minutes, greatly improving the recovery efficiency of replicas. As shown in the figure 3

 

3. Data re-deletion and compression technology

by pre-processing backup data (separating backup metadata from data), multi-layer Online Variable length and re-deletion, feature compression algorithm (data cleansing, rearrangement, and deredundancy based on data stream feature recognition), compared with traditional data reduction technologies, backup storage space is reduced as a whole. 75% to reduce the total cost of ownership of intra-row backup data.

4. Data security and anti-ransomware technology

the new generation of data disaster recovery system construction solution has completed the long-standing missing data layer protection of the traditional security defense system. When the network layer and the host layer are blocked and the blackmail software encrypts the stored data, the storage layer AirGap, data tampering prevention, detection and analysis technology, in an environment that is not attacked, there is at least one that is not tampered " gold copy ", and this data is " clean " can be used for security recovery to minimize business system losses.

 

VI. Implementation results and benefits

1. Landing results

after nearly a year of research and POC verification: Currently, we have completed the parallel operation of the next generation of disaster recovery system and successfully launched one OceanProtect backup device, access 400 database instance, complete full backup, incremental backup, binlog log backup and other normal operations, and multiple database mount, batch recovery, full recovery and other business scenarios. Currently, we are preparing for the second one. OceanProtect the online deployment and business of backup devices are expected to be 2025 year 2 quarterly, all database instances are backed up and connected, and the new and old systems are fully switched.

2. Project revenue

The construction of a new generation of disaster recovery system has solved the four pain points of the original disaster recovery system, namely, cost, efficiency, backup data utilization and data security.

cost savings: through the application of new technologies such as data compression, data re-deletion, and data hierarchical storage, the compression ratio of backup data re-deletion reaches 20: 1, the overall storage space occupied by backup is saved. 95%; At the same time 3~6 the monthly backup data is stored in the archive, 6 backup data was stored in Blu-ray storage months ago to further reduce storage costs and end-to-end total cost of ownership. (TCO) reduce 50%.

efficiency improvement: through efficient backup control and scheduling, core data backup and recovery architecture, combined with real-time mounting, data hierarchical storage and other technologies, the overall backup / the recovery efficiency is greatly improved. Full Backup duration from 22 reduce hours 10 hours, full recovery time from the original 4.5 to 7 compressed 1 day, efficiency improved 77%~ 85%, emergency data recovery period from 2~4 reduce hours 2 about minutes.

further improve the efficiency of the quasi-production environment and improve the business value of backup data.: due to the application of real-time Mount technology and the improvement of backup data recovery efficiency, the Business verification scenario in the quasi-production environment can restore production data snapshots faster. For example, in the scenario of business version quasi-production environment verification, in the old system, you need to restore the backup data of several database instances to the quasi-production environment through data copy, it usually takes several hours to decompress and copy data. In the new architecture, the real-time Mount technology can be used in 2 the instance is directly mounted to the quasi-production environment within minutes, eliminating the time-consuming process of data decompression and copying. The business can be directly used for version verification. For disaster recovery drills, year-end settlement, for scenarios that require large-scale full backup data recovery, such as quarterly interest settlement, the recovery time is also reduced 7 shorten days 1 about days. Activate backup by improving efficiency " cold " data Business " hot " data value, giving backup systems higher value.

security enhancement and Compliance Assurance: by establishing an anti-ransomware data backup mechanism, data can be retained at most " gold copy ", avoiding the possibility of data loss or damage caused by external virus attacks, and avoiding potential financial losses and compensation costs. After the completion of the system construction, it meets the requirements of the cyber security law and the data security law to avoid fines and other additional costs arising from security risks.

 

* This article is included in the Chinese version of the user special issue of "number of words" in January, 2025

Replies(
Sort By   
Reply
Reply
Post
Post title
Industry classification
Scene classification
Post source
Send Language Version
You can switch languages and verify the correctness of the translation in your personal center.
Contribute
Name
Nickname
Phone
Email
Article title
Industry
Field

Submission successful

We sincerely appreciate your fantastic submission! Our editorial team is working diligently on the review process—please stay tuned.

Should there be any revision suggestions, we'll promptly reach out to discuss them with you!

Contribute
Article title
Article category
Send Language Version
You can switch languages and verify the correctness of the translation in your personal center.