2024 Strategic Roadmap for Storage
伏羲Dai  2024-08-13 11:55   published in China

 

Key Findings

A substantial increase in data volume, the proliferation of more data at the enterprise edge, and adoption of AI and generative AI (GenAI) workloads require the use of data classification tools to improve storage optimization, data life cycle enforcement, security risk mitigation and data workflow speeds.

A hybrid-platform-native architecture for consumption-based storage as a service (STaaS) provides substantial cost efficiencies, meaningful productivity gains, and a more resilient and durable data services environment.

Widespread and costly network storage threats along with rising regulation and insurance costs have forced storage professionals to adopt new approaches for active defense.

A common platform for file and object storage consolidates unstructured data workloads and reduces complexity, costs and vendors.

The deployment of applications across an expanding use of hybrid-cloud and SaaS environments creates complexity in the protection and recovery capabilities of business-critical data.

Introduction

Market forces propelling digital transformation, adoption of on-premises and hybrid cloud strategies, and the rapid increase in data volume and proliferation across the data center, cloud and edge, require disrupting long-standing paradigms. Infrastructure and operations (I&O) leaders must adopt new storage platform technologies and techniques.

This research gives I&O leaders insights into modern storage platform technologies, methods and storage vendor investments to intercept and enable rapid assimilation of changes in the IT infrastructure landscape. It provides an overarching plan and a timeline to gain control and derisk those factors that may disrupt IT operations. The strategic storage roadmap also offers guidance on how to enhance IT outcomes in the age of hybrid cloud and the digital data.

Advancements in storage technologies and consumption plan offerings, including data services such as backup, disaster recovery, ransomware prevention and governance, will substantially augment and automate complex and labor-intensive processes.

Over the next few years, I&O leaders responsible for multicloud data storage infrastructure will change how they source, deploy, finance and manage their storage assets and the data environment. In the hybrid cloud era, platform-native capabilities will drive storage infrastructure decisions and vendor investment strategies as the data center becomes more physically and logically distributed across multiple infrastructure domains. These include on-premises, colocation, public cloud and edge domains. I&O leaders will need to weigh the benefits of centralizing mission-critical applications infrastructure in the public cloud with hybrid platform strategies that build on generational storage vendor capabilities.

I&O leaders appreciate the wholesale benefits of the cloud operating model. It replaces capital expenditure (CAPEX) financing and sourcing activities with consumption-based, vendor-managed offerings that substantially benefit IT operations across multiple facets. The shift to software-defined infrastructure methods and automation initiatives will have a favorable impact on talent retention and subject matter expertise issues, allowing IT leaders to exit hardware administration and leverage vendor life cycle management capabilities.

2024 Strategic Roadmap for Storage Overview

Future State:

Platform-native, centralized control and data services planes support hybrid IT environments.

Software-defined storage and NVMe-oF architecture replace conventional storage devices and become the mainstream.

A single platform for file and object data enables consolidation of unstructured data infrastructure workloads.

Active defense–based cyber detection solutions mitigate resilience risks.

NVMe QLC flash SSD–based systems replace a large percentage of HDD-based hybrid systems.

Current state:

Storage systems lack automated self-service orchestration.

Storage appliances and the three-tier architecture are rigid and inflexible.

Dispersed file and object storage workloads create silos of unstructured data.

Ransomware or insider cyberattacks become increasingly severe security threats.

Storage systems consume 11% to 14% of data center power and drive a large percentage of carbon emissions.

Gap analysis:

On-premises storage systems lack a platform-native data services architecture strategy.

Inflexible CAPEX financing and feature sourcing lack consumption-based as-a-service benefits.

Hardware-based storage administration does not take full advantage of AIOps.

Data protection schemes don't offer data-resilient detection capabilities to avoid ransomware threats.

Migration plan:

Transition to platform-native hybrid IT operations.

Consolidate unstructured data workloads to a single file and object platform.

Invest in advanced data protection and ransomware network storage prevention methods.

Invest in high-density QLC SSD flash-based systems to offset power and carbon emissions.

The widespread adoption of the cloud operating model is ushering in new storage architecture and data services strategies, including a data services platform-centric operations approach delivered through software-defined technologies and an API-first mindset. It's imperative that I&O leaders lay the foundation for a rapid shift by leveraging a platform ecosystem of integrated as-a-service partners to unlock advanced and emerging storage capabilities. Further, I&O leaders have to take an entirely different approach to subject matter expert processes and IT operating models to gain control of unpredictable costs by employing advanced AIOps and vendor automation tools. Traditional CAPEX sourcing activities, labor-intensive IT data management methods and fragile legacy systems simply will not scale with the proliferation of digital data.

Enterprise storage infrastructure is at an inflection point with hyperscalers offering on-premises storage services and on-premises storage vendors offering consumption services. I&O leaders will need to understand the competitive differences, narratives and nuances across the storage landscape to ensure a favorable outcome as the line blurs between the two approaches. However, IT must embrace the hybrid cloud operating model and plot a course to help synthesize the insights and guidance into an action plan with the necessary resources.

In summary, the following market forces are shaping the enterprise storage landscape over the next five years:

The data era is remarkably large, and undergoing rapid and unpredictable growth, with increasing proportional risks, and expanding data center activities and IT operations everywhere: Advancements in digitalization, use of AI and GenAI, and adoption of cloud operating models are putting enormous pressure on IT operations to transform outdated methods.

Availability and retention of staff to administer and support low-level administration and support activities will improve productivity: Advancements in AIOps and GenAI augment subject matter expertise, enabling higher levels of productivity, resilience, and automated and streamlined operations.

CAPEX processes are costly and inflexible, driving upward of two-thirds inefficiencies in sourcing and budgeting, extraneous costs, and drawn out refresh cycles: Consumption-based as a service offerings, including block, file, object and data services, replace CAPEX sourcing and IT management using on-premises upgrades.

Remediation and recovery from an ever-increasing array of unsustainable security threats are inadequate: Advancements in ransomware detection and prevention methods mitigate threat exposures and substantially improve recovery with minimal financial loss.

The data center edge is rising, where over 50% of data will be generated and processed: Advancements in edge architecture and efficient data management need to be tightly coupled with platform-native central control and data plane capabilities.

GenAI and AI Systems

Of special note is the urgent and heightened interest in GenAI and AI storage workload infrastructure and considerations for addressing the myriad and intrinsic storage-related technologies and best practices for optimal deployment methods. While AI is not a specific technology, nor are storage analysts recommending implementing it as part of the roadmap or timeline, it's important to provide insights to potential infrastructure strategies to ensure a successful project.

Not all enterprises will require building new storage specifically for running GenAI applications. For some enterprises, public cloud may be the right choice, depending on their LLM needs or GPU requirements. For other enterprises, their existing performant storage may be good enough, especially if they are just piloting an off-the-shelf language model and going directly to the inference stage, skipping the training part, or only refining an existing model.

From a feature and functionality perspective, storage for GenAI is not too different from storage for any other analytics applications. The exception is that the performance capabilities required to feed the compute farm become even more relevant for GenAI and can be amplified at a larger scale. The training stage of GenAI workflow can be very demanding from a performance point of view, depending on the model size. Not only must the storage layer support high throughput to feed the CPU or GPU farm, but it also must have the right performance to support model checkpoint and recovery fast enough to keep the computer farm running.

For GenAI storage, enterprises should consider not just the immediate training data for running a specific language model, but also the storage required to keep data for future GenAI applications. Such data is best stored on a data lake that is scalable and cost-effective, but not necessarily on high-performance storage as required for training the language model.

The three broad deployment approaches to storage for GenAI applications are:

Use public cloud storage.

Build an end-to-end storage infrastructure layer with the right performance and data management features for each stage of GenAI workflow.

Build a single platform that can host all of the training data and can support the diverse workload characteristics that correspond to the different stages of GenAI workflow.

Each of the above approaches may be augmented by a third-party hosting firm that has the infrastructure and project management expertise to implement and manage deployment. If the above approaches lead to hosting the storage layer on-premises, it is not always required to deploy multiple platforms to support the different stages for the GenAI workflow, especially when the model size or the total size of the training data is not very large. Modern storage platforms are designed to handle a diverse set of workloads to support both training and inference stages.

Answers to the questions below will dictate which of the approaches above is best for the enterprise:

Where will the generative AI model run — on-premises or in the public cloud?

Where will the compute/GPUs be located — on-premises or in the public cloud?

What is the total capacity of the data (in GB or TBs) that is used for training or refining the language model?

For multimodal AI models, will your storage technologies deliver performance for a variety of data sources (text, code, image, video and audio)?

Where will the actual application be hosted to provide inference?

Once the data is ingested and prepared, the two main activities involved in a generative AI workflow are:

Training: This is the workflow stage where the language model is created or refined from training data. This can be very compute-intensive, depending on the number of training parameters, and optimally requires a continuous data feed from the storage layer to the compute layer. The main storage capability required from a performance point of view is throughput measured in gigabytes per second at large scale. Various storage features contribute to higher throughput of the system.

Inference: This is the workflow stage, where the input from the users is run through the model to generate an output. The main performance capability that is required in this stage is very low latency, measured in hundreds of microseconds.

Other considerations when selecting storage infrastructure for GenAI workloads:

Single platform: A common platform to store all of the training data in the enterprise that is based on a key-value datastore for fast access to data. Support for GPU direct storage (GDS) is required to maximize data access and transfer speeds to keep the GPUs optimally busy. Use NFS over RDMA for file-based access support at high speed. Hybrid cloud requires a common platform that can provide access to data stored across multiple locations.

Custom metadata: This provides the ability to add context to the underlying data through enhanced metadata.

A metadata index and catalog: This allows you to easily find and discover relevant data for training.

Flash-based storage infrastructure: This addresses the lowest-latency requirements for both training and inference stages. Consider high-density capacity QLC NAND flash media for large data lakes.

2024-2028 Strategic Storage Roadmap Timeline

2024

Replace minımum 33% of IT storage admin. and support activities/costs with managed AIOps storage tools.

Going forward, consolidate unstructured data file and object stores to a single, unified platform.

Replace minimum 50% of new storage CAPEX capacity with STaaS consumption-based usage.

Mitigate compliance risks with Al/ML-based policy-aware data discovery and classification tools.

Source NIST framework-based cyber storage solutions with a 100% data recovery SLA guarantee.

Drivers:

Hybrid platform-native services architecture for consumption-based SLA-based offerings

Shift from sourcing product features to IT Ops SLAs

2025

Stop all CAPEX storage-based infrastructure sourcing by 2025 and replace with STaaS.

Replace storage hardware SME skills with software AIOps specialists and tools by 2025.

Cyber storage solution with SLA to guarantee a minimum of 99.99% cyber resilience.

Sustainability SLA-based solutions that monitor, report and remediate power and CO2e.

Drivers:

Shift from hardware to software-defined-storage infrastructure

Disaggregated storage-compute architecture

NVMe-TCP

2027

75% CAPEX infrastructure replaced by STaaS. Over 90% by 2030.

Cyber storage systems that offer 100% written guarantee that prevents cyber data threat impact to business.

Disaggregated, software-defined storage-compute architecture replaces external controller-based arrays by 2028.

Drivers:

Autonomous storage

Cyber-resilient data storage systems

Majority of enterprise data storage at the data center edge

-----

Source: Jeff Vogel, Julia Palmer, Michael Hoeck, Chandra Mukhyala; 2024 Strategic Roadmap for Storage; 23 February 2024

 

source: Andy730 public account

Replies(
Sort By   
Reply
Reply