PyTorch教程13.6之多个GPU的简洁实现-电子发烧友网

为每个新模型从头开始实施并行性并不好玩。此外，优化同步工具以获得高性能有很大的好处。在下文中，我们将展示如何使用深度学习框架的高级 API 来执行此操作。数学和算法与第 13.5 节中的相同。毫不奇怪，您至少需要两个 GPU 才能运行本节的代码。

						import torch
from torch import nn
from d2l import torch as d2l

						 

						from mxnet import autograd, gluon, init, np, npx
from mxnet.gluon import nn
from d2l import mxnet as d2l

npx.set_np()

13.6.1。玩具网络

让我们使用一个比13.5 节中的 LeNet 更有意义的网络，它仍然足够容易和快速训练。我们选择了一个 ResNet-18 变体（He et al. , 2016）。由于输入图像很小，我们对其进行了轻微修改。特别地，与第 8.6 节的不同之处在于，我们在开始时使用了更小的卷积核、步长和填充。此外，我们删除了最大池化层。

							#@save
def resnet18(num_classes, in_channels=1):
  """A slightly modified ResNet-18 model."""
  def resnet_block(in_channels, out_channels, num_residuals,
           first_block=False):
    blk = []
    for i in range(num_residuals):
      if i == 0 and not first_block:
        blk.append(d2l.Residual(out_channels, use_1x1conv=True,
                    strides=2))
      else:
        blk.append(d2l.Residual(out_channels))
    return nn.Sequential(*blk)

  # This model uses a smaller convolution kernel, stride, and padding and
  # removes the max-pooling layer
  net = nn.Sequential(
    nn.Conv2d(in_channels, 64, kernel_size=3, stride=1, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU())
  net.add_module("resnet_block1", resnet_block(64, 64, 2, first_block=True))
  net.add_module("resnet_block2", resnet_block(64, 128, 2))
  net.add_module("resnet_block3", resnet_block(128, 256, 2))
  net.add_module("resnet_block4", resnet_block(256, 512, 2))
  net.add_module("global_avg_pool", nn.AdaptiveAvgPool2d((1,1)))
  net.add_module("fc", nn.Sequential(nn.Flatten(),
                    nn.Linear(512, num_classes)))
  return net

							 

							#@save
def resnet18(num_classes):
  """A slightly modified ResNet-18 model."""
  def resnet_block(num_channels, num_residuals, first_block=False):
    blk = nn.Sequential()
    for i in range(num_residuals):
      if i == 0 and not first_block:
        blk.add(d2l.Residual(
          num_channels, use_1x1conv=True, strides=2))
      else:
        blk.add(d2l.Residual(num_channels))
    return blk

  net = nn.Sequential()
  # This model uses a smaller convolution kernel, stride, and padding and
  # removes the max-pooling layer
  net.add(nn.Conv2D(64, kernel_size=3, strides=1, padding=1),
      nn.BatchNorm(), nn.Activation('relu'))
  net.add(resnet_block(64, 2, first_block=True),
      resnet_block(128, 2),
      resnet_block(256, 2),
      resnet_block(512, 2))
  net.add(nn.GlobalAvgPool2D(), nn.Dense(num_classes))
  return net

							 

13.6.2。网络初始化

我们将在训练循环内初始化网络。有关初始化方法的复习，请参阅第 5.4 节。

							net = resnet18(10)
# Get a list of GPUs
devices = d2l.try_all_gpus()
# We will initialize the network inside the training loop

							 

The initialize function allows us to initialize parameters on a device of our choice. For a refresher on initialization methods see Section 5.4. What is particularly convenient is that it also allows us to initialize the network on multiple devices simultaneously. Let’s try how this works in practice.

							net = resnet18(10)
# Get a list of GPUs
devices = d2l.try_all_gpus()
# Initialize all the parameters of the network
net.initialize(init=init.Normal(sigma=0.01), ctx=devices)

							 

Using the split_and_load function introduced in Section 13.5 we can divide a minibatch of data and copy portions to the list of devices provided by the devices variable. The network instance automatically uses the appropriate GPU to compute the value of the forward propagation. Here we generate 4 observations and split them over the GPUs.

							x = np.random.uniform(size=(4, 1, 28, 28))
x_shards = gluon.utils.split_and_load(x, devices)
net(x_shards[0]), net(x_shards[1])

							 

							[08:00:43] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)

						

							(array([[ 2.2610207e-06, 2.2045981e-06, -5.4046786e-06, 1.2869955e-06,
     5.1373163e-06, -3.8297967e-06, 1.4339059e-07, 5.4683451e-06,
     -2.8279192e-06, -3.9651104e-06],
    [ 2.0698672e-06, 2.0084667e-06, -5.6382510e-06, 1.0498458e-06,
     5.5506434e-06, -4.1065491e-06, 6.0830087e-07, 5.4521784e-06,
     -3.7365021e-06, -4.1891640e-06]], ctx=gpu(0)),
 array([[ 2.4629783e-06, 2.6015525e-06, -5.4362617e-06, 1.2938218e-06,
     5.6387889e-06, -4.1360108e-06, 3.5758853e-07, 5.5125256e-06,
     -3.1957325e-06, -4.2976326e-06],
    [ 1.9431673e-06, 2.2600434e-06, -5.2698201e-06, 1.4807417e-06,
     5.4830934e-06, -3.9678889e-06, 7.5751018e-08, 5.6764356e-06,
     -3.2530229e-06, -4.0943951e-06]], ctx=gpu(1)))

						

Once data passes through the network, the corresponding parameters are initialized on the device the data passed through. This means that initialization happens on a per-device basis. Since we picked GPU 0 and GPU 1 for initialization, the network is initialized only there, and not on the CPU. In fact, the parameters do not even exist on the CPU. We can verify this by printing out the parameters and observing any errors that might arise.

							weight = net[0].params.get('weight')

try:
  weight.data()
except RuntimeError:
  print('not initialized on cpu')
weight.data(devices[0])[0], weight.data(devices[1])[0]

							 

							not initialized on cpu

						

							(array([[[ 0.01382882, -0.01183044, 0.01417865],
     [-0.00319718, 0.00439528, 0.02562625],
     [-0.00835081, 0.01387452, -0.01035946]]], ctx=gpu(0)),
 array([[[ 0.01382882, -0.01183044, 0.01417865],
     [-0.00319718, 0.00439528, 0.02562625],
     [-0.00835081, 0.01387452, -0.01035946]]], ctx=gpu(1)))

						

Next, let’s replace the code to evaluate the accuracy by one that works in parallel across multiple devices. This serves as a replacement of the evaluate_accuracy_gpu function from Section 7.6. The main difference is that we split a minibatch before invoking the network. All else is essentially identical.

							#@save
def evaluate_accuracy_gpus(net, data_iter, split_f=d2l.split_batch):
  """Compute the accuracy for a model on a dataset using multiple GPUs."""
  # Query the list of devices
  devices = list(net.collect_params().values())[0].list_ctx()
  # No. of correct predictions, no. of predictions
  metric = d2l.Accumulator(2)
  for features, labels in data_iter:
    X_shards, y_shards = split_f(features, labels, devices)
    # Run in parallel
    pred_shards = [net(X_shard) for X_shard in X_shards]
    metric.add(sum(float(d2l.accuracy(pred_shard, y_shard)) for
            pred_shard, y_shard in zip(
              pred_shards, y_shards)), labels.size)
  return metric[0] / metric[1]

							 

13.6.3。训练

和以前一样，训练代码需要执行几个基本功能以实现高效并行：

需要在所有设备上初始化网络参数。
在迭代数据集时，小批量将被划分到所有设备上。
我们跨设备并行计算损失及其梯度。
梯度被聚合并且参数被相应地更新。

最后，我们计算精度（再次并行）以报告网络的最终性能。训练例程与前面章节中的实现非常相似，只是我们需要拆分和聚合数据。

							def train(net, num_gpus, batch_size, lr):
  train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
  devices = [d2l.try_gpu(i) for i in range(num_gpus)]
  def init_weights(module):
    if type(module) in [nn.Linear, nn.Conv2d
						

PyTorch教程13.6之多个GPU的简洁实现

13.6.1。玩具网络

13.6.2。网络初始化

13.6.3。训练

适应边缘AI全新时代的GPU架构

PyTorch教程22.4之多元微积分

PyTorch教程23.5之选择服务器和GPU

PyTorch教程3.2之面向对象的设计实现

PyTorch教程4.4之从头开始实现Softmax回归

PyTorch教程5.2之多层感知器的实现

PyTorch教程11.5之多头注意力

PyTorch教程13.5之在多个GPU上进行训练

PyTorch教程3.5之线性回归的简洁实现

PyTorch教程19.4之多保真超参数优化

PyTorch教程7.4之多个输入和多个输出通道

PyTorch教程8.4之多分支网络(GoogLeNet)

PyTorch教程9.6之递归神经网络的简洁实现

PyTorch教程之从零开始的递归神经网络实现

PyTorch教程简介

PyTorch Recipes.zip

Pytorch实现MNIST手写数字识别

Effective PyTorch之 PyTorch基础知识（译）

支持并发访问可动态更新的GPU无锁跳步哈希表

获得GPU存储性能的四种方法

基于GPU的稀疏矩阵存储格式优化综述

一款电路简洁的胆石混合功放的制作

线路简洁性能极优越的准甲类电路

一款简洁优秀晶体管功放的设计制作

在ARM GPU架构上实现基于OpenCL并行优化策略

github上的pytorch学习资料详细说明

使用单片机实现不同频率闪烁多个LED灯的C语言程序实例免费下载

获得GPU存储性能的方法有哪些

使用LabWindows实现多个单片机与PC进行串口通信的工程文件和程序

使用Visual Basic实现PC与多个单片机串口通信的资料和程序

操作指南：pytorch云服务器怎么设置？

利用Arm Kleidi技术实现PyTorch优化

PyTorch GPU 加速训练模型方法

新手小白怎么通过云服务器跑pytorch？

pycharm如何调用pytorch

pytorch和python的关系是什么

如何在PyTorch中实现LeNet-5网络

PyTorch的介绍与使用案例

PyTorch的特性和使用方法

如何使用PyTorch建立网络模型

使用PyTorch构建神经网络

pytorch用来干嘛的

深度学习框架pytorch介绍

深度学习框架pytorch入门与实践

PyTorch教程-13.6. 多个 GPU 的简洁实现

PyTorch教程-13.2. 异步计算

PyTorch教程-13.5。在多个 GPU 上进行训练

PyTorch教程-9.6. 递归神经网络的简洁实现

PyTorch教程-3.5. 线性回归的简洁实现

PyTorch入门-1

PyTorch的简单实现

一份简便的PyTorch教程，从不用自己配置环境开始。

Meta开发AITemplate，大幅简化多GPU后端部署

PyTorch在哪些地方分配GPU内存

pytorch实现断电继续训练时需要注意的要点

如何在PyTorch中使用交叉熵损失函数

PyTorch 的 Autograd 机制和使用

13个PyTorch使用的小窍门

基于PyTorch的深度学习入门教程之PyTorch重点综合实践

基于PyTorch的深度学习入门教程之DataParallel使用多GPU

下载排行榜

瑞芯微RK3572开发板-产品资料更新-2026.06

矩形科技 CANopen 远程IO模块产品手册

OC5721欧创芯开关降压型LED恒流驱动器

高性能非隔离交直流转换芯片 PC9403A数据手册

四键电容式触摸按键IC FZH34产品手册

赛思画册2026