Common Supercomputing Network Topologies
百川归海-Edward  2024-08-07 17:27   published in China

Reprinted from: architect Technology Alliance

640.jpg

High-performance computing focuses on the static latency of the traffic and requires ultra-large networking. However, the traditional Clos architecture, as a mainstream network architecture, achieves universality, but at the expense of latency and cost-effectiveness. To address this issue, the industry has studied and designed various architectures and new topologies. The fat tree, dragonfly, and torus topologies are common topologies featuring non-blocking forwarding, small network diameter, and high scalability/cost-effectiveness, respectively.

1. Fat Tree

In a traditional tree topology, the bandwidth converges by layer and the bandwidth at the root is much less than the sum of bandwidth at all the leaves. Fat tree is more like a real tree. The closer to the root, the thicker the branch is. That is, the bandwidth does not converge from the leaves to the root. This is the basis for it to build a non-blocking network. As one of the most widely used topologies, fat tree is a good choice for various applications because it provides low latency and supports various throughput options, from non-blocking connections to over-subscription. It maximizes the data throughput of various traffic modes.

640.jpg

Fat tree adopts a 1:1 non-convergence design, where the bandwidth and quantity of uplink ports are the same as those of downlink ports of the switch. In addition, it uses data center–level switches that support non-blocking forwarding and allows users to increase the number of connected GPU nodes by extending network layers.

In essence, fat tree has no bandwidth convergence. Therefore, the spine-leaf architecture of the cloud data center can be considered compliant with this concept in the case of no convergence.

If the number of ports on switches is n, a two-layer fat tree architecture can accommodate n2/2 GPUs. For example, InfiniBand switches with 40 ports can accommodate a maximum of 800 GPUs. A three-layer fat tree architecture can accommodate n x (n/2) x (n/2) GPUs. For example, InfiniBand switches with 40 ports can accommodate a maximum of 16,000 GPUs.

640.jpg

However, fat tree has obvious disadvantages as follows:

1) It contains a large number of switches, compared with the number of servers, and requires many switches and links. Therefore, costs are relatively high in a large-scale case. It requires 5M/n (where M is the quantity of servers and n is the quantity of ports on switches). When n is relatively small, a large quantity of switches are required for connecting to the fat tree architecture, thereby increasing the complexity of cabling and configuration.

2) Due to the aforementioned characteristics, fat tree cannot well support the one-to-all and all-to-all network communication modes, making it inconvenient for deploying high-performance distributed applications such as MapReduce and Dryad.

3) Theoretically, the expansion scale is limited by the number of ports on the core switches.

In essence, fat tree is a Clos architecture that focuses on universality and non-convergence at the expense of latency and cost-effectiveness. It needs more network layers, optical fibers, and switches to build a large-scale cluster network, increasing the costs. In addition, when the cluster expands, it has more network hops, which increase the communication latency and raise the possibility that low latency requirements cannot be fulfilled.

2. Dragonfly

Dragonfly is the most widely used direct-connection topology. It was proposed by John Kim and others in the paper Technology-Driven, Highly-Scalable Dragonfly Topology in 2008. Dragonfly features small network diameter and low cost, and has been used in high-performance computing networks. In addition, it also applies to data center networks with diversified computing power. Dragonfly network is as follows.

640.jpg

Dragonfly consists of three levels: switch level, group level, and system level.

1) Switch level: includes one switch and P compute nodes connected to the switch.

2) Group level: contains a switch levels that are fully connected (all-to-all). In other words, each switch has a – 1 links that are connected to other a – 1 switches.

3) System level: contains g groups that are fully connected

A single switch has p ports connected to compute nodes, a – 1 ports connected to other switches in the group, and h ports connected to switches in other groups. Therefore, the following attributes in the network can be calculated:

1) The number of ports on each switch is k = p (a – 1) h.

2) The number of groups is g = a x h 1.

3) The total number of compute nodes on the network is N = a x p x (a x h 1).

4) If all switches in a group are considered as one switch, the number of ports on this switch is k' = a x (p h).

After p, a, h, and g are determined, a dragonfly topology can be determined and represented by dfly (p, a, h, g). A relatively balanced configuration method is recommended: a = 2 x p = 2 x h. Dragonfly provides the following routing algorithms:

1) Minimal routing: contains a maximum of one global link and two local links due to its nature. That is, a maximum of three hops are needed. If there is only one direct connection between any two groups (that is, g = a x h 1), there is only one shortest path.

2) Non-minimal routing (also called valiant algorithm [VAL] or valiant load-balanced routing [VLB]): allows users to randomly select a group and send messages to the group and then to the destination. A VAL can pass through a maximum of two global links and three local links due to its nature. A maximum of five hops are needed.

3) Adaptive routing: enables a switch to dynamically select a route between the shortest one and the non-shortest one based on the network load when a data packet arrives at the switch. The shortest one is preferentially used for forwarding. When it is congested, the non-shortest one is used for forwarding. As it is hard to obtain global network status information, in addition to Universal Globally Adaptive Load-balanced routing (UGAL), a series of variant adaptive routing algorithms are proposed, such as UGAL-L and UGAL-G.

Adaptive routing provides better performance through dynamical adjustment of traffic forwarding paths based on the network link status.

Dragonfly provides good performance for various applications (or communication modes). Compared with other topologies, it shortens network paths and reduces the number of intermediate nodes. By using switches with 64 ports, Dragonfly supports a maximum of 270,000 nodes and reduces the number of hops for end-to-end switch forwarding to 3.

Dragonfly has significant advantages in performance and cost-effectiveness but needs effective congestion control and adaptive routing policies. In addition, each time capacity expansion is required in the dragonfly network, cables need to be re-routed, increasing the network complexity and management difficulty.

3. Torus

Ever-increasing model parameters and training data make it hard for a single machine to meet requirements for computing power and storage. Therefore, distributed machine learning is required. Collective communication is the basis for distributed machine learning. The difficulty of collective communication is that efficient communication needs to be performed under certain network interconnection structure constraints. As such, trade-offs must be made between efficiency and cost, bandwidth and latency, customer requirements and quality, and innovation and productization.

Torus is a completely symmetric topology that features small network diameter, simple structure, numerous paths, and good scalability, making it a great choice for collective communication. Sony proposes the 2D torus algorithm, whose main idea is: intra-group scatter-reduce > inter-group all-reduce > intra-group all-gather. IBM proposed the 3D torus algorithm.

It is represented by k-ary n-cube. k is the length of the side of the arrangement, and n is the dimension of the arrangement.

640.jpg

The 3-ary 3-cube topology is as follows.

640.jpg

Using the 2D torus topology as an example, the torus structure is as follows.

1) Horizontal: Each server has X GPU nodes, and these GPU nodes are interconnected through a proprietary protocol network (such as NVLink).

2) Vertical: Servers are interconnected through at least two RDMA NICs (NIC 0 and NIC 1) by using switches.

640.jpg

Step 1: Horizontally, the Ring Scatter Reduce operation is performed on the host to split and reduce the gradients on the eight cards on the host. In this way, after iteration, each GPU has a complete gradient in the same dimension, and this block gradient equals to the sum of all gradients corresponding to the block in all GPUs.

Step 2: Vertically, X vertical Ring All Reduce operations are performed between hosts to vertically and globally reduce the data on X GPUs of each server in the cluster.

Step 3: Horizontally, the All Gather operation is performed in the host to copy gradients on GPUi [i = 0 to (X – 1)] to other GPUs in the server.

 

Torus has the following advantages:

1) Lower latency: The ring topology provides lower latency because it has short and direct links between adjacent nodes.

2) Better locality: In a ring network, nodes that are physically close are also logically close, which may bring better data locality and reduce communication overheads, thereby reducing latency and power consumption.

3) Smaller network diameter: For the same number of nodes, the network diameter of the ring topology is smaller than that of the Clos network. Therefore, fewer switches are required, saving a lot of costs.

Torus also has the following disadvantages:

1) Predictability: The predictability cannot be ensured in a ring network.

2) Scalability: Scaling the ring network may require the entire topology to be reconfigured, which may be more complex and time-consuming.

3) Load balancing: The ring network provides multiple paths, but cannot provide as many alternative paths as the fat tree network.

4) Troubleshooting: The troubleshooting of unexpected faults is more complex. However, the flexibility of dynamic reconfigurable routes can greatly prevent accidents.

In addition to the 2D/3D structure, torus is also developing towards a higher dimension. A unit in the torus high-dimension network is called a silicon element, and a silicon element uses the 3D torus topology. A plurality of silicon elements can build a higher-dimensional 4D/5D/6D torus direct-connection network.

 

Source: Xibeichuifeng

Replies(
Sort By   
Reply
Reply