Facing the huge challenges brought by the outbreak of AI technology, storage media urgently need to think about the evolution direction and build new competitiveness.
Political Commissar Wen yuruan
data as an important "fuel" of artificial intelligence, it the scale and quality directly determine the intelligence level of AI, and constantly drive AI models to accelerate the access to AGI.
As a solid base for the whole AI process, data storage is responsible for data storage, transfer and circulation, providing support for training and reasoning of large models. EMERGING AI application scenarios require higher data storage capacity, data processing speed, mobility, and security.
As a key component of data storage devices and the underlying engine of the new generation of productivity revolution, storage media is undergoing key technological changes. In this intelligent revolution, what challenges and directions will storage media face?
AI scenarios
four Challenges for storage media
the whole business process of AI big model mainly includes four phases: data collection, data preprocessing, model training, and inference deployment. The operation tasks and storage media requirements of each phase are different.
Figure 1 full-service process of AI big model
the author believes that storage media, especially SSD, in the entire business process in AI scenarios face the following challenges:
• Data collection and data preprocessing: SSD capacity challenges. Unstructured multi-modal datasets have soared, with the original datasets reaching a scale of 100 PB. Data preprocessing server local SSD disk the capacity is small, it needs to be completed in multiple batches, and the data preprocessing cycle is long. Ultra -10,000-card clusters require high storage scalability. Due to the small SSD capacity of a single disk, the storage capacity of a single frame is limited, A single computing cluster connects to multiple storage clusters, resulting in large and complex storage clusters and difficult operation and maintenance.
• Model training: data read/write efficiency challenges. The training data set is a small file of 100 million levels, which requires fast loading, reduces XPU waiting, and requires high random read/write capability (IOPS/TB) of storage media. The frequency of computing cluster failures is high, CKPT needs to be stored and loaded frequently, which requires high read/write bandwidth of storage media.
• Model inference: latency, bandwidth, and I/O challenges. Resnet50 model offline inference a single A100 GPU card processes 68994 images per second, and a single P computing power requires a bandwidth of 14 Gb/s, requiring high bandwidth for storage media; edge Inference Service latency is sensitive (Internet recommend <30ms), which requires low storage medium latency and fast vector retrieval. Large batch size and long sequence inference tasks are inefficient, and KV Cache occupies large memory.
• Full process: reliability and security challenges. Private data sets in the industry are at risk of sensitive information disclosure and ransomware attacks. Data security is at risk. Therefore, data encryption, data security, and reliability must be enhanced at the storage media level.
• Full process: model training and energy consumption challenges. The AI big model is a new "energy consumption monster". Research papers show that the training GPT-3 consumes 1.287 GWh of electricity, which is equivalent to the one-year electricity consumption of 120 American families. From the perspective of storage media, it is necessary to increase the density of single Die and SSD bits to reduce TCO.
To sum up, the storage media challenges in AI scenarios can be summarized as follows: capacity Challenges, performance challenges, security challenges, and energy saving challenges, in order to build the new competitiveness of storage media in the AI era, efforts should also be made from these four aspects to solve the underlying fundamental problems.
Figure 2 four challenges of storage media
demand and development trend of storage media in AI era
the challenges brought by AI drive storage media to develop in the direction of larger capacity, higher performance, lower power consumption, lower cost and higher security.
(1) large capacity: storage of EB-level datasets makes it an ideal choice for AI services
from the initial SLC and MLC to the present TLC and QLC, Flash particle technology continues to develop, and the number of NAND particles also continues to increase. In the future, NAND Flash will break through to 300 layers, the storage capacity is also significantly increased. With the breakthrough of 3D NAND technology, the capacity of SSD disks using QLC media is increasing significantly. It will evolve to 128TB and 256TB in the future, and even realize the capacity of 1PB per disk will no longer be a dream.
For example, DapuStor, Memblaze, and Micron have all released PCIe5.0 30.72TB TLC with a read bandwidth of 14 Gb/s and a write bandwidth of 10 Gb/s. In terms of large capacity QLC, Solidigm is the leader. The maximum capacity of QLC SSD using 192-layer 3D NAND technology has reached 61.44 TB(D5-P5336), and its sequential read performance has reached 7 Gb/s, sequential write performance reaches 3 Gb/s; 122.88TB QLC is planned to be mass produced in the first half of 2025; The domestic manufacturer, puwei, also launched 61.44 TB SSD(J5060) based on QLC media.
Figure 3 SSD dashboard manufacturer insight
compared with TLC SSDs, QLC SSDs have the same data reading performance, but reduce energy consumption and space usage, making QLC SSDs more suitable for read-intensive AI inference scenarios, such as CDN and OLAP databases, it is an ideal choice for AI business. AI applications have shifted from training to reasoning, facilitating the transfer of storage requirements to localization. To meet more customization requirements, SSD with higher performance and larger capacity will be introduced. It is reported that SK Hynix is developing a 300 TB ultra-large SSD to meet AI requirements and reduce the overall TCO of the data center.
(2) excellent performance: provides high performance and low latency to accelerate AI business operation.
Due to the limitation of front-end protocol and back-end channel rate, SSD performance cannot linearly increase with capacity. Granular bandwidth has increased tenfold in 5 years, while channel bandwidth has only increased tenfold in 10 years. AI has high requirements on SSD performance in terms of small file loading and large file reading, aiming at reducing the waiting time of XPU and reducing the commercial landing time of large models.
The front-end interface protocol changes from PCIe 3.0 and PCIe 4.0 to PCIe 5.0. Compared with PCIe 4.0, the performance of PCIe 5.0-based SSDs is doubled.
Many mainstream SSD manufacturers, such as SK Hynix, Micron, Huawei, and Dapustor, have mass production of PCIe 5.0 SSDs. For example, PS1010 produced by SK Hynix has a sequential read performance of 15,000MB/s and a sequential write performance of 10200MB/s.
On the other hand, the development of CXL also provides a possibility for faster and more flexible data transmission solutions. Currently, it has evolved to 3.0 with a transmission rate of 64GT/s. CXL connects devices to CPUs and separates storage and computing. At the same time, CXL allows CPU to access larger memory pools on connected devices with low latency and high bandwidth, breaking through the limits of traditional DDR channels, thus expanding memory capacity and bandwidth to expand memory. For Cache scenarios with extremely high performance requirements, such as the AI big model inference KV Cache scenario, which requires extreme bandwidth performance, you can use CXL disks to speed up data loading. Samsung has introduced a memory CMM-D based on CXL protocol, which can realize seamless integration with the existing DIMM and increase the bandwidth by up to 100%.
(3) Green and Low Carbon: storage is efficient and energy-saving, creating a green data center
worldwide, energy conservation and emission reduction have become a common mission, and all walks of life are actively pursuing the goals of "carbon neutralization" and "Carbon Peak. AI business, as a power-swallowing monster, consumes a lot of power resources from the establishment of a data center to the operation of the business. The storage energy consumption of a data center accounts for up to 35%, data centers have changed from computing power competition to energy competition. SSD features high density, high reliability, low latency and low energy consumption, in the AI era, SSD has become an inevitable trend to replace HDD, by deploying all-flash SSDs on a large scale, the energy consumption of AI computing power centers can be greatly reduced, achieving green energy conservation and sustainable development.
Many CSP manufacturers in North America have built large data centers such as xAI, and have deployed a QLC dashboard using Solidigm to build an AI data Lake and reduce the TCO of data centers. Take Alibaba Cloud's 10PB storage solution as an example to compare the energy consumption of HDD and SSD, as shown in figure 4.
Figure 4 HDD solution vs. SSD solution comparison
• AI servers often overconfigure HDD hard disks to meet IOPS requirements in AI scenarios, resulting in increased TCO
compared with HDD, SSD has better power density, which can bring huge cost savings and 46% TCO savings in 5 years.
(4) security and reliability: endogenous security of storage media and protection of core data assets
in the era of AI big models, data reliability determines the accuracy of big models, and industry data is mostly private data, which is an important data asset. At the same time, data has become the most vulnerable value asset. In the AI era, new security attacks such as ransomware, poisoning and theft began to emerge, threatening the reliability of big model training data and the accuracy of results at all times, and causing serious economic losses.
For example, in March 2023, big Meta models were leaked, and similar big models such as Alpaca, ChatLLama, ColossalChat, FreedomGPT appeared one after another in the following week. Meta was forced to announce open source, and the previous investment was in vain, heavy losses.
Security measures: extract behavior patterns by analyzing IO operations on SSD disks; Aggregate feature analysis on multiple disks, and use ML model to detect abnormal behaviors in the engine to prevent ransomware.
The continuous surge of AI is reshaping the evolution track of storage media. Facing the impact of massive data flood, the backward demand for efficient storage and fast reading, the complex evolution of security threats and the rigid constraints of green transformation, storage media has stood at the crossroads of technological revolution. Only by grasping the interweaving evolution direction of the main line of technology and actively exploring and breaking through the innovation bottleneck of storage media can we strive for the initiative in the development of the intelligent era.
Author: Political Commissar Ruan | expert on public number certification of data storage committee. Senior engineer of Huawei, mainly engaged in research and development of storage media and media technology insight/planning in AI scenarios.
This article is reprinted from the WeChat public account of the data storage Professional Committee of China Power Standards Association