PyTorch教程13.3之自动并行-电子发烧友网

深度学习框架（例如 MXNet 和 PyTorch）在后端自动构建计算图。使用计算图，系统了解所有依赖关系，并可以选择性地并行执行多个非相互依赖的任务以提高速度。例如，第 13.2 节中的图 13.2.2 独立地初始化了两个变量。因此，系统可以选择并行执行它们。

通常，单个运算符将使用所有 CPU 或单个 GPU 上的所有计算资源。例如，dot算子将使用所有 CPU 上的所有内核（和线程），即使在一台机器上有多个 CPU 处理器。这同样适用于单个 GPU。因此，并行化对于单设备计算机不是很有用。有了多个设备，事情就更重要了。虽然并行化通常在多个 GPU 之间最相关，但添加本地 CPU 会略微提高性能。例如，参见 Hadjis等人。( 2016 年)专注于训练结合 GPU 和 CPU 的计算机视觉模型。借助自动并行化框架的便利，我们可以在几行 Python 代码中实现相同的目标。更广泛地说，我们对自动并行计算的讨论集中在使用 CPU 和 GPU 的并行计算，以及计算和通信的并行化。

请注意，我们至少需要两个 GPU 才能运行本节中的实验。

						import torch
from d2l import torch as d2l

						from mxnet import np, npx
from d2l import mxnet as d2l

npx.set_np()

13.3.1。GPU 上的并行计算

让我们首先定义一个要测试的参考工作负载：run 下面的函数使用分配到两个变量中的数据在我们选择的设备上执行 10 次矩阵-矩阵乘法：x_gpu1和 x_gpu2。

							devices = d2l.try_all_gpus()
def run(x):
  return [x.mm(x) for _ in range(50)]

x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])
x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])

							 

现在我们将函数应用于数据。为了确保缓存不会在结果中发挥作用，我们通过在测量之前对其中任何一个执行单次传递来预热设备。torch.cuda.synchronize() 等待 CUDA 设备上所有流中的所有内核完成。它接受一个device参数，即我们需要同步的设备。current_device()如果设备参数为（默认），则它使用由给出的当前设备None。

							run(x_gpu1)
run(x_gpu2) # Warm-up all devices
torch.cuda.synchronize(devices[0])
torch.cuda.synchronize(devices[1])

with d2l.Benchmark('GPU1 time'):
  run(x_gpu1)
  torch.cuda.synchronize(devices[0])

with d2l.Benchmark('GPU2 time'):
  run(x_gpu2)
  torch.cuda.synchronize(devices[1])

							 

							GPU1 time: 0.4967 sec
GPU2 time: 0.5151 sec

如果我们删除synchronize两个任务之间的语句，系统就可以自由地自动在两个设备上并行计算。

							with d2l.Benchmark('GPU1 & GPU2'):
  run(x_gpu1)
  run(x_gpu2)
  torch.cuda.synchronize()

							 

							GPU1 & GPU2: 0.5000 sec

						

							devices = d2l.try_all_gpus()
def run(x):
  return [x.dot(x) for _ in range(50)]

x_gpu1 = np.random.uniform(size=(4000, 4000), ctx=devices[0])
x_gpu2 = np.random.uniform(size=(4000, 4000), ctx=devices[1])

							 

Now we apply the function to the data. To ensure that caching does not play a role in the results we warm up the devices by performing a single pass on either of them prior to measuring.

							run(x_gpu1) # Warm-up both devices
run(x_gpu2)
npx.waitall()

with d2l.Benchmark('GPU1 time'):
  run(x_gpu1)
  npx.waitall()

with d2l.Benchmark('GPU2 time'):
  run(x_gpu2)
  npx.waitall()

							 

							GPU1 time: 0.5233 sec
GPU2 time: 0.5158 sec

If we remove the waitall statement between both tasks the system is free to parallelize computation on both devices automatically.

							with d2l.Benchmark('GPU1 & GPU2'):
  run(x_gpu1)
  run(x_gpu2)
  npx.waitall()

							 

							GPU1 & GPU2: 0.5214 sec

						

在上述情况下，总执行时间小于其各部分的总和，因为深度学习框架会自动安排两个 GPU 设备上的计算，而不需要代表用户编写复杂的代码。

13.3.2。并行计算与通信

在许多情况下，我们需要在不同设备之间移动数据，比如在 CPU 和 GPU 之间，或者在不同 GPU 之间。例如，当我们想要执行分布式优化时会发生这种情况，我们需要在多个加速器卡上聚合梯度。让我们通过在 GPU 上计算然后将结果复制回 CPU 来对此进行模拟。

							def copy_to_cpu(x, non_blocking=False):
  return [y.to('cpu', non_blocking=non_blocking) for y in x]

with d2l.Benchmark('Run on GPU1'):
  y = run(x_gpu1)
  torch.cuda.synchronize()

with d2l.Benchmark('Copy to CPU'):
  y_cpu = copy_to_cpu(y)
  torch.cuda.synchronize()

							 

							Run on GPU1: 0.5019 sec
Copy to CPU: 2.7168 sec

这有点低效。请注意，我们可能已经开始将的部分内容复制y到 CPU，而列表的其余部分仍在计算中。这种情况会发生，例如，当我们计算小批量的（反向传播）梯度时。一些参数的梯度将比其他参数更早可用。因此，在 GPU 仍在运行时开始使用 PCI-Express 总线带宽对我们有利。在 PyTorch 中，几个函数（例如to()和）copy_()承认一个显式non_blocking参数，它允许调用者在不需要时绕过同步。设置non_blocking=True 允许我们模拟这种情况。

							with d2l.Benchmark('Run on GPU1 and copy to CPU'):
  y = run(x_gpu1)
  y_cpu = copy_to_cpu(y, True)
  torch.cuda.synchronize()

							 

							Run on GPU1 and copy to CPU: 2.4682 sec

						

							def copy_to_cpu(x):
  return [y.copyto(npx.cpu()) for y in x]

with d2l.Benchmark('Run on GPU1'):
  y = run(x_gpu1)
  npx.waitall()

with d2l.Benchmark('Copy to CPU'):
  y_cpu = copy_to_cpu(y)
  npx.waitall()

							 

							Run on GPU1: 0.5796 sec
Copy to CPU: 3.0989 sec

This is somewhat inefficient. Note that we could already start copying parts of y to the CPU while the remainder of the list is still being computed. This situation occurs, e.g., when we compute the gradient on a minibatch. The gradients of some of the parameters will be available earlier than that of others. Hence it works to our advantage to start using PCI-Express bus bandwidth while the GPU is still running. Removing waitall between both parts allows us to simulate this scenario.

							with d2l.Benchmark('Run on GPU1 and copy to CPU'):
  y = run(x_gpu1)
  y_cpu = copy_to_cpu(y)
  npx.waitall()

							 

							Run on GPU1 and copy to CPU: 3.3488 sec

						

两个操作所需的总时间（正如预期的那样）小于它们各部分的总和。请注意，此任务不同于并行计算，因为它使用不同的资源：CPU 和 GPU 之间的总线。事实上，我们可以同时在两个设备上进行计算和通信。如上所述，计算和通信之间存在依赖关系：y[i]必须在将其复制到 CPU 之前进行计算。幸运的是，系统可以y[i-1]边计算边复制y[i]，以减少总运行时间。

我们以在一个 CPU 和两个 GPU 上进行训练时简单的两层 MLP 的计算图及其依赖关系的图示作为结尾，如图13.3.1所示。手动安排由此产生的并行程序将非常痛苦。这就是拥有基于图形的计算后端进行优化的优势所在。

https://file.elecfans.com/web2/M00/A9/CC/poYBAGR9OqiAEhMjADhnzKcOtWI169.svg

图 13.3.1两层 MLP 在一个 CPU 和两个 GPU 上的计算图及其依赖关系。

13.3.3。概括

现代系统具有多种设备，例如多个 GPU 和 CPU。它们可以并行、异步使用。
现代系统还具有多种通信资源，例如 PCI Express、存储（通常是固态驱动器或通过网络）和网络带宽。它们可以并联使用以达到最高效率。
后端可以通过自动并行计算和通信来提高性能。

13.3.4。练习

run在本节定义的函数中执行了八个操作。它们之间没有依赖关系。设计一个实验，看看深度学习框架是否会自动并行执行它们。
当单个操作员的工作量足够小时，并行化甚至可以在单个 CPU 或 GPU 上提供帮助。设计一个实验来验证这一点。
设计一个实验，在 CPU、GPU 上使用并行计算，并在两个设备之间进行通信。
使用 NVIDIA 的Nsight等调试器来验证您的代码是否有效。
设计包含更复杂数据依赖关系的计算任务，并运行实验以查看是否可以在提高性能的同时获得正确的结果。

PyTorch教程13.3之自动并行

13.3.1。GPU 上的并行计算

13.3.2。并行计算与通信

13.3.3。概括

13.3.4。练习

PyTorch教程21.3之矩阵分解

PyTorch教程22.6之随机变量

PyTorch教程23.4之使用Google Colab

PyTorch教程23.2之使用亚马逊SageMaker

PyTorch教程23.8之API

PyTorch教程4.1之Softmax回归

PyTorch教程3.6之概括

PyTorch教程4.7之环境与分配转变

PyTorch教程6.2之参数管理

PyTorch教程6.1之层和模块

PyTorch教程10.8之波束搜索

PyTorch教程12.1之优化和深度学习

PyTorch教程12.2之凸度

PyTorch教程13.4之硬件

PyTorch教程13.2之异步计算

PyTorch教程14.2之微调

PyTorch教程14.1之图像增强

PyTorch教程6.7之显卡

PyTorch教程2.5之自动微分

PyTorch教程2.3之线性代数

PyTorch教程3.1之线性回归

PyTorch教程2.6之概率统计

PyTorch教程14.4之锚箱

PyTorch教程14.11之全卷积网络

PyTorch教程14.10之转置卷积

PyTorch教程19.3之异步随机搜索

PyTorch教程21.1之推荐系统概述

PyTorch教程7.3之填充和步幅

PyTorch教程7.2之图像卷积

PyTorch教程8.2之使用块的网络(VGG)

pytorch怎么在pycharm中运行

PyTorch的介绍与使用案例

tensorflow和pytorch哪个更简单?

PyTorch的特性和使用方法

如何使用PyTorch建立网络模型

基于PyTorch的模型并行分布式训练Megatron解析

基于PyTorch AMD的解决方案

使用PyTorch加速图像分割

pytorch用来干嘛的

深度学习框架pytorch介绍

深度学习框架pytorch入门与实践

基于HLS之任务级并行编程

PyTorch教程-13.2. 异步计算

PyTorch教程-13.3. 自动并行

PyTorch 的 Autograd 机制和使用

什么时候可以更新苹果13.3

苹果13.3寸笔记本尺寸

苹果mac13.3英寸笔记本有多大 苹果13.3寸电脑的长宽是多少

苹果13.3寸笔记本电脑怎么样

基于PyTorch的深度学习入门教程之PyTorch的自动梯度计算

基于PyTorch的深度学习入门教程之PyTorch简单知识

基于PyTorch的深度学习入门教程之PyTorch重点综合实践

基于PyTorch的深度学习入门教程之DataParallel使用多GPU

基于PyTorch的深度学习入门教程之训练一个神经网络分类器

基于PyTorch的深度学习入门教程之使用PyTorch构建一个神经网络

一篇非常新的介绍PyTorch内部机制的文章

iOS13.3和iPadOS13.3对比，都有哪些差异

Pytorch 1.1.0，来了！

一文解构PyTorch：深入了解PyTorch内部机制

Facebook宣布发布深度学习框架 PyTorch 1.0开发者预览版

下载排行榜

UC3842/3/4/5电源管理芯片中文手册

DMT0660数字万用表产品说明书

ST7789V2单芯片控制器/驱动器英文手册

TPS54202H降压转换器评估模块用户指南

STM32F101x8/STM32F101xB手册

LLC 电路基本原理分析及公式推导

苹果mac13.3英寸笔记本有多大苹果13.3寸电脑的长宽是多少