(WIP) Network Paradigm Fundamentals and Comparison

🕗 发布于 2024-07-21 22:26 网络 tcp/ip 网络协议

Preamble

Network is hard, but "AI" demands it

Network is hard. I did rather poorly in this subject during college times and in my defense I think till this very day of writing, there are very few network health providers capable of delivering thorough and accurate checkups, saying much about the complexity of the system when even experts don't fully know what they are working with.

Still, as someone interested in deep learning and its acceleration, it's impossible to not look deeper into data communication systems when we are working with >100B models. More specifically, for large computation workloads, we pull the oldest trick in CS book and cut them into parallel or pipelined pieces, and for large DNN we usually have EP, DP, PP, TP, see

Paradigms of Parallelism | Colossal-AI

and

Expert parallelism - Amazon SageMaker

for briefings. These DNN parallelisation generate different synchronization patterns and memory access workloads. For example, EP and DP have independent parallel workloads and have perhaps the highest compute to communication ratio, so they require the least bandwidth and least stringent latency, while PP and TP on the other hand, cut models and require frequent synchronization at high speed to avoid blocking tensor/vector processors.

Layered Communication and Main Schools

Different parallelism and subsequent memory access workloads and synchronization requirements are hard to cater to by a single network design and in practice we use a mixture of systems at different distribution levels.

Starting from a single ALU unit (for our purpose, an ALU, an accelerator or a fully functional processor matters not here; we can simply lump congestion to network latency, the key is we need only to see computation endpoints, memory and communication pathways), say a GPU SM core, we form layerd groups, such as:

SM core (ALU + L0 cache) => GPU (ALU swarm + L1 cache) => GPU SoC ("chip" or further packaged "card"), say Grace+2*Hoppers for H100 (GPU*N + L2 cache + memory) => node and rack (GPU SoCs + NVlink/PCIe, e.g. NVIDIA HGX H100) => data center (LAN) => www

(more about GPU network: 华尔街见闻)

Their communication efficiencies are dictated by the physical scales and densities; currently as shown, we have a mish-mash of local networks to handle communication at different levels, reflecting in the co-existence of

ld/st and rd/wr.

TP roughly resides at node level, PP at rack, DP and EP at higher level endpoints.

There are attempts to unify communications under a single banner, a single philosophy,

either "computation" with ld/st (system bus) approach, called "scaling up" for their central property in preserving communication speed and low latency by localizing data exchange, (not necessarily UMA though);

or "network" with rd/wr (RDMA) approach, called "scaling out" for the central property of device unanimity or device blindness, which is the core principal for network design; RDMA will be lossy and it's quite expensive for it to handle OoO.

(more about RDMA, heavy read, be warned: https://mp.weixin.qq.com/mp/appmsgalbum?__biz=MzUxNzQ5MTExNw==&action=getalbum&album_id=3398249338911260673#wechat_redirect)

The central question here is how to scale communication while preserve dma rate and synchronization latency.

Here we focus on the features of classic network designs.

Source and Resource

[1] entry article: https://support.huawei.com/enterprise/zh/doc/EDOC1100203347

[2] IB v Ethernet: Infiniband 和以太网Ethernet 对比_ib交换机和以太网交换机区别-CSDN博客

[3] network protocol evolution and RDMA: RDMA这十年的反思1：从协议演进的视角

[4] Infiniband whitepaper: https://network.nvidia.com/pdf/whitepapers/IB_Intro_WP_190.pdf

Protocol Evolution[3]

RoCE、IB和TCP等网络的基本知识及差异对比[1]

在分布式存储网络中，我们使用的协议有RDMA over Converged Etherne (RoCE)、Infiniband (IB) 和TCP/IP。其中RoCE和IB属于RDMA (RemoteDirect Memory Access)技术，他和传统的TCP/IP有什么区别呢，接下来我们将做详细对比。

RDMA和TCP/IP

面对高性能计算、大数据分析等IO高并发、低时延应用，现有TCP/IP软硬件架构不能满足应用的需求，这主要体现在传统的TCP/IP网络通信是通过内核发送消息，这种通信方式存在很高的数据移动和数据复制的开销。RDMA(RemoteDirect Memory Access)技术全称远程直接内存访问，就是为了解决网络传输中服务器端数据处理的延迟而产生的。如图1-1，RDMA技术能直接通过网络接口访问内存数据，无需操作系统内核的介入。这允许高吞吐、低延迟的网络通信，尤其适合在大规模并行计算机集群中使用。

图1-1 RDMA和传统TCP/IP比较

RDMA的种类

目前有三种RDMA网络，分别是Infiniband、RoCE(RDMA over Converged Ethernet)、iWARP。

其中，Infiniband是一种专为RDMA设计的网络，从硬件级别保证可靠传输，技术先进，但是成本高昂。

==> this is largely due to the fact that IB bypassed lossy issue by introducing credit, which makes it less scalable; it's basically gone rogue for a network protocol.

而RoCE 和 iWARP都是基于以太网的RDMA技术，这使高速、超低延时、极低CPU使用率的RDMA技术得以部署在目前使用最广泛的以太网上。

如图1-2所示，RoCE协议有RoCEv1和RoCEv2两个版本，RoCEv1是基于以太网链路层实现的RDMA协议(交换机需要支持PFC等流控技术，在物理层保证可靠传输)，而RoCEv2是以太网TCP/IP协议中UDP层实现，引入IP解决了扩展性问题。

图1-2 RDMA网络种类

表1-1 RoCE和InfiniBand比较

	InfiniBand	iWARP	RoCE
性能	最好	稍差（受TCP影响）	与InfiniBand相当
成本	高	中	低
稳定性	好	差	较好
交换机	IB交换机	以太网交换机	以太网交换机

由表1-1所示，三种RDMA网络的特点总结如下：

InfiniBand：设计之初就考虑了 RDMA，从硬件级别保证可靠传输，提供更高的带宽和更低的时延。但是成本高，需要支持IB网卡和交换机。
RoCE：基于 Ethernet 做 RDMA，消耗的资源比 iWARP 少，支持的特性比 iWARP 多。可以使用普通的以太网交换机，但是需要支持RoCE的网卡。
iWARP：基于TCP的RDMA网络，利用TCP达到可靠传输。相比RoCE，在大型组网的情况下，iWARP的大量TCP连接会占用大量的内存资源，对系统规格要求更高。可以使用普通的以太网交换机，但是需要支持iWARP的网卡。

分布式存储中常用的网络协议

IB：常用于DPC场景中的存储前端网络。
RoCE：常用于存储后端网络。
TCP/IP：常用于业务网络。

原文地址：https://blog.csdn.net/maxzcl/article/details/140541958

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：基于torch-pruning库对resnet18在cifar100数据集上进行剪枝实验
下一篇：北京青蓝智慧科技:在科博会上触摸科技发展脉搏

基本定时器---内部时钟中断
STM32单片机的基本定时器介绍
阅读更多2024-11-15
高效稳定！新加坡服务器托管方案助力企业全球化布局
在全球化的商业环境中，企业对于高效、稳定的服务器托管方案的需求日益迫切。作为亚洲的服务器托管中心，新加坡凭借其独特的地理位置、稳定的政治环境、先进的科技设施以及开放的市场政策，为企业提供了理想的服务器
阅读更多2024-11-15
我要学kali-linux之shell脚本编程1
学习视频来自B站up主 **泷羽sec** 有兴趣的师傅可以关注一下，如涉及侵权马上删除文章，笔记只是方便各位师傅的学习和探讨，文章所提到的网站以及内容，只做学习交流，其他均与本人以及泷羽sec团队无
阅读更多2024-11-15
【网络安全】公钥基础设施
公钥基础设施（Public Key Infrastructure，简称PKI）是一种基于公钥密码学的系统，它提供了一套完整的解决方案，用于管理和保护通过互联网传输的信息。PKI的核心功能包括密钥管理、
阅读更多2024-11-15
PGMP-练练03 ❥(^_-)
由于项目集负责向组织提供收益，因此项目集经理、项目集团队成员、项目经理和团队成员以及其他项目集利益相关者都在收益管理中具有关键角色和责任。项目集 A 正在实现计划收益，然而项目集 B 的项目集经理刚
阅读更多2024-11-15
【c++笔试强训】（第八篇）
其中，有个游戏是这样的：首先，让 n 个小朋友们围成一个大圈，小朋友们的编号是0~n-1。然后，随机指定一个数 m ，让编号为0的小朋友开始报数。每次喊到 m-1 的那个小朋友要出列唱首歌，然后可以在
阅读更多2024-11-15
C语言之中缀表达式转换为波兰表达式、逆波兰表达式
C语言之中缀表达式转换为波兰表达式、逆波兰表达式，通过将运算符号压入栈、弹出栈等操作实现普通（中缀）表达式和前缀后缀（波兰、逆波兰）表达式之间的转换。
阅读更多2024-11-15
curl 安装最新版
配置编译参数：/usr/local为指定的安装路径，--with-ssl表示需要支持ssl。为了使curl能支持ssl功能，需要提前安装openssl，执行下列指令进行安装。库文件在对应的lib路径，
阅读更多2024-11-15
Conda环境与Ubuntu环境移植详解
迁移Conda环境是数据科学和机器学习开发中的一项重要任务。通过YAML文件迁移或直接复制环境文件夹的方法，可以在不同设备间无缝切换Conda环境，确保项目依赖的一致性。在进行环境迁移时，需要注意CU
阅读更多2024-11-15
Java面向对象高级2
感觉就是有时候简化代码用的？
阅读更多2024-11-15