construction practice of 10,000-card cluster: How to greatly improve the utilization rate of computing power
社区小助理  2025-05-14 11:47   published in China

Wen | Xunfei Xu ensong

 

i, application background

with the global fire of ChatGPT, big models have become a hot topic at present. Since xunfei released the Spark model for the first time in May 2023, its capability of the model has been continuously improved. The model parameters are getting larger and larger, and the scale of the training cluster is getting larger and larger. From the previous single-machine single-card training to the multi-machine multi-card training, even the training of the 10,000-card cluster. Therefore, the read/write performance of storage directly affects the availability of computing clusters. With the expansion of the training cluster scale, the increase of cluster failure frequency further aggravates the reduction of the availability of computing clusters, and the industry GPU computing power is used for effective training for less than 50% of the time. The following figure shows the impact of storage read/write performance on the availability of computing clusters.

 

 

 

 

Xunfei HPC training platform has been built since 2015. After 8 years of development, the storage solution has undergone three major iterative evolution and is currently exploring the fourth generation architecture.

 

The first generation of storage architecture uses the traditional disk array solution, which has serious bottlenecks in expansion after data growth.

The second generation of storage architecture uses open source distributed storage software. During use, we have encountered many performance and stability challenges. Due to limited investment, it is difficult to solve in the short term. Then it turned to the third generation architecture evolution.

The third generation storage architecture adopts standard commercial distributed storage solutions, which basically solve performance and stability problems, but there is no effective means in data governance, the low-frequency access data cannot be quickly retrieved from a large amount of data, and data cannot be transferred hierarchically, resulting in higher resource costs.

When the data scale grows to a certain scale, our requirements for storage will become higher and higher, with performance and reliability as the basis. On this basis, we need to explore an effective data lifecycle management solution.

 

II, challenges

the big model enables AI technology to gradually evolve from the world of perception and understanding to the world of generation and creation. It can be said that data quality determines the height of AI intelligence. To improve the model quality, we currently face the following data management and processing problems:

1. Data governance difficulties: AI training sets have tens of billions of files. The current "chimney" storage cluster construction mode forms multiple data Islands. Data needs to be manually migrated, resulting in low efficiency. At the same time, it does not have global data visualization capability, and cannot identify hot and cold data and high-value data. Therefore, data governance is difficult.

2. Low GPU utilization: AI big model training mainly focuses on multi-machine and multi-card tasks, with high failure frequency. When the model is loaded and resumable training CheckPoint reads and writes, it requires high I/O and bandwidth performance of the storage system, on average, clusters with kilocalories fail once a day, and the breakpoint recovery time is as high as 15 minutes +, resulting in hundreds of thousands of losses each time.

3. Cluster dispersion is unreliable: multiple sets of storage "chimney" construction, with a total capacity of dozens of PB, are divided into dozens of PB-level scattered small clusters, which greatly increases the management complexity, the storage cluster is built by separating software and hardware, which reduces the reliability of the storage cluster and reduces the bandwidth capability.

4. Diversified data access problems and complex heterogeneous storage management.

5. The problem of increasing the sample set from TB level to PB level, and the original data may even reach EB level.

6. The problem of increasing training parameters. In model training, the target model parameters can reach hundreds of billions or even trillions.

7. Data security issues: during the entire training and data use process, once data loss or data security compliance issues occur, the quality and efficiency of model training are fatal.

 

III. Construction plan

At present, xunfei and Huawei have joined forces to create the best full stack solution with large model computing power and storage power. Based on the basic hardware and software of shengteng AI, including Atlas series AI chips and servers, Shengsi full-scenario AI framework, and CANN heterogeneous computing framework, we have built xunfei Flint platform to support the training and reasoning of xunfei spark big model. We are also building a unified storage base based on Huawei OceanStor distributed storage, and are carrying out a number of studies, including:

1. The cooperative system research of computing power and storage power improves the target training efficiency by 10% ~ 20%;

2. Accurately identify and manage data values to improve the efficiency of data transfer and linkage with different values;

3. Explore scenario-based data reduction technologies to achieve data lifecycle management and reduce storage costs.

Xunfei has been committed to building a future-oriented, efficient and reliable big model data infrastructure. We believe that domestic big models have a big future only based on the computing power and storage base of independent innovation. The following describes the details of the solution. First, we adopt a storage separation architecture. As shown in the following figure:

 

at the computing and scheduling layer, we have built a platform that supports diversified heterogeneous computing power, including the latest generation of Huawei's rising AI computing power, and built a centralized computing power, superior performance, provide stable and secure large model training computing power clusters;

at the data storage layer, we applied the concept of unified storage platform for the first time. Based on OceanStor distributed storage, we have built an integrated storage system that can solve the following four key problems:

 

First, data from multiple sources and types can be quickly collected without migration. Multiple protocols are provided to allow data to be stored. You do not need to copy and migrate data between storage clusters, which greatly saves data preprocessing time and costs. The second is to provide an adaptive processing mechanism for different I/O models, which solves the storage performance requirements for multi-scenario job training.

The third is data popularity identification without manual intervention. Through policy configuration, massive amounts of data are automatically layered, which not only meets the storage requirements of massive amounts of raw data, but also provides high-performance model training capabilities; finally, the new generation of innovation system includes independent security of infrastructure, high availability of data in the whole process, service level, data usage permission management and external data security threats provide flexible options, providing strong support for our model training.

GPU accounts for the highest value of AI big model IT infrastructure, about 50% or even higher. Improving GPU utilization means saving money. Especially during the loading and recovery of the model Checkpoint, it often means the hourly or even Tiandi waiting time. How to maximize the performance of storage so that storage is no longer a bottleneck in training? There are two main solutions:

 

first of all, the multi-protocol lossless interconnection capability of distributed storage enables zero copy of AI big model data from the whole process of collection, preprocessing, training/reasoning. A set of storage can realize multiple sources, multiple types of data can be quickly collected and read and written without migration. The data in the previous step can be quickly used as the input for the next step. There are two main advantages of this: one is to reduce the time for repeated copying of data, and the other is to avoid the problem of repeated storage of large amounts of data, it also saves storage costs to some extent.

 

The second solution is that distributed storage can adopt different processing mechanisms for size I/O. This mechanism automatically identifies the storage side. If large IO is written from the client, it goes directly to the mechanical disk to reduce the path overhead. If it is a small I/O write, the disk is removed after aggregation at the cache layer, greatly reducing the number of I/O interactions and improving IOPS performance. A set of storage supports both high IOPS and high bandwidth.

Now the AI model has evolved to nearly EB-level data volume, and multiple tasks will be trained in parallel in a training cluster. If the parameters of some training tasks are not configured properly and the storage I/O is consumed too much, performance preemption may occur and the training efficiency of other tasks may be affected. In collaboration with Huawei's data storage team, we creatively build multiple physically isolated storage pools in the same storage system as needed. At the same time, one client can connect to multiple storage pools. In this way, each big model data has its own storage pool that guarantees performance/capacity, implements physical isolation at the file system level, and reduces the impact of preemptible performance, it also reduces the fault domain. Of course, multiple storage pools use the same storage system, which is easy to manage and maintain.

 

IV. Practical Analysis

Resumable training recovery speed increased by 15 times: the cluster provides a maximum of TB of bandwidth, which reduces the read and write time of checkpoints. The recovery time of resumable training is reduced from 15min to 1min, saving hundreds of thousands of RMB per day.

Secure and reliable storage cluster: AI stores multiple StoragePool solutions in a single cluster. The management surface is integrated and the data surface is separated. Data surface isolation is used to avoid the spread of AI cluster faults. At the same time, through sub-health management, A large proportion of EC and others further improve the storage reliability, with a single cluster reliability of 5 9.

TCO reduced by 30% in Lifecycle Management: unified data Lake Management, GFS global file system, lossless multi-protocol interconnection, free of data Islands, globally visible, manageable, efficient flow of data, triple cross-domain scheduling efficiency, zero copy of data, end-to-end acceleration of AI model development; Retrieval of hundreds of billions of metadata in seconds, intelligent identification of data popularity, accurate classification, and balance of storage system performance and capacity.

 

V. Benefits and values


 

xunfei has been deeply engaged in the education track for many years. Since the first release of xunfei spark model, this technology has been applied to educational scenes. Xunfei Changyan Intelligent classroom product is a successful example. The application of multimodal capability in the field of education is very successful. Based on the multimodal capabilities of xunfei Xinghuo version 2.0, such as image description, understanding, reasoning, image recognition creation, image generation, virtual human synthesis, etc., more and more data can be obtained from the real world, in addition, learning, training and promotion are carried out at the product terminal to effectively assist classroom innovation and greatly reduce the production cycle and cost of classroom content.

Replies(
Sort By   
Reply
Reply
Post
Post title
Industry classification
Scene classification
Post source
Send Language Version
You can switch languages and verify the correctness of the translation in your personal center.
Contribute
Name
Nickname
Phone
Email
Article title
Industry
Field

Submission successful

We sincerely appreciate your fantastic submission! Our editorial team is working diligently on the review process—please stay tuned.

Should there be any revision suggestions, we'll promptly reach out to discuss them with you!

Contribute
Article title
Article category
Send Language Version
You can switch languages and verify the correctness of the translation in your personal center.