PyTorch教程-2.5. 自动微分

jf_pJlTbmA9 2023-06-05 315

电子说

1.2w人已加入

描述

回想一下2.4 节，计算导数是我们将用于训练深度网络的所有优化算法中的关键步骤。虽然计算很简单，但手工计算可能很乏味且容易出错，而且这个问题只会随着我们的模型变得更加复杂而增长。

幸运的是，所有现代深度学习框架都通过提供自动微分（通常简称为 autograd ）来解决我们的工作。当我们通过每个连续的函数传递数据时，该框架会构建一个计算图来跟踪每个值如何依赖于其他值。为了计算导数，自动微分通过应用链式法则通过该图向后工作。以这种方式应用链式法则的计算算法称为反向传播。

虽然 autograd 库在过去十年中成为热门话题，但它们的历史悠久。事实上，对 autograd 的最早引用可以追溯到半个多世纪以前（Wengert，1964 年）。现代反向传播背后的核心思想可以追溯到 1980 年的一篇博士论文 ( Speelpenning, 1980 )，并在 80 年代后期得到进一步发展 ( Griewank, 1989 )。虽然反向传播已成为计算梯度的默认方法，但它并不是唯一的选择。例如，Julia 编程语言采用前向传播（Revels等人，2016 年）. 在探索方法之前，我们先来掌握autograd这个包。

import torch

from mxnet import autograd, np, npx

npx.set_np()

from jax import numpy as jnp

import tensorflow as tf

2.5.1. 一个简单的函数

假设我们有兴趣区分函数 y=2x⊤x关于列向量x. 首先，我们分配x一个初始值。

x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

在我们计算梯度之前y关于 x，我们需要一个地方来存放它。通常，我们避免每次求导时都分配新内存，因为深度学习需要针对相同参数连续计算导数数千或数百万次，并且我们可能会面临内存耗尽的风险。请注意，标量值函数相对于向量的梯度x是向量值的并且具有相同的形状x.

# Can also create x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad # The gradient is None by default

x = np.arange(4.0)
x

array([0., 1., 2., 3.])

Before we calculate the gradient of y with respect to x, we need a place to store it. In general, we avoid allocating new memory every time we take a derivative because deep learning requires successively computing derivatives with respect to the same parameters thousands or millions of times, and we might risk running out of memory. Note that the gradient of a scalar-valued function with respect to a vector x is vector-valued and has the same shape as x.

# We allocate memory for a tensor's gradient by invoking `attach_grad`
x.attach_grad()
# After we calculate a gradient taken with respect to `x`, we will be able to
# access it via the `grad` attribute, whose values are initialized with 0s
x.grad

array([0., 0., 0., 0.])

x = jnp.arange(4.0)
x

No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Array([0., 1., 2., 3.], dtype=float32)

x = tf.range(4, dtype=tf.float32)
x

x = tf.Variable(x)

我们现在计算我们的函数x并将结果分配给y。

y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=)

我们现在可以通过调用它的方法来获取y关于的梯度。接下来，我们可以通过的属性访问渐变。xbackwardxgrad

y.backward()
x.grad

tensor([ 0., 4., 8., 12.])

# Our code is inside an `autograd.record` scope to build the computational
# graph
with autograd.record():
  y = 2 * np.dot(x, x)
y

array(28.)

We can now take the gradient of y with respect to x by calling its backward method. Next, we can access the gradient via x’s grad attribute.

y.backward()
x.grad

[09:38:36] src/base.cc:49: GPU context requested, but no GPUs found.

array([ 0., 4., 8., 12.])

y = lambda x: 2 * jnp.dot(x, x)
y(x)

Array(28., dtype=float32)

We can now take the gradient of y with respect to x by passing through the grad transform.

from jax import grad

# The `grad` transform returns a Python function that
# computes the gradient of the original function
x_grad = grad(y)(x)
x_grad

Array([ 0., 4., 8., 12.], dtype=float32)

# Record all computations onto a tape
with tf.GradientTape() as t:
  y = 2 * tf.tensordot(x, x, axes=1)
y

We can now calculate the gradient of y with respect to x by calling the gradient method.

x_grad = t.gradient(y, x)
x_grad

我们已经知道函数的梯度 y=2x⊤x关于 x应该4x. 我们现在可以验证自动梯度计算和预期结果是否相同。

x.grad == 4 * x

tensor([True, True, True, True])

现在让我们计算另一个函数x并获取它的梯度。请注意，当我们记录新的梯度时，PyTorch 不会自动重置梯度缓冲区。相反，新的渐变被添加到已经存储的渐变中。当我们想要优化多个目标函数的总和时，这种行为会派上用场。要重置梯度缓冲区，我们可以调用x.grad.zero()如下：

x.grad.zero_() # Reset the gradient
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

x.grad == 4 * x

array([ True, True, True, True])

Now let’s calculate another function of x and take its gradient. Note that MXNet resets the gradient buffer whenever we record a new gradient.

with autograd.record():
  y = x.sum()
y.backward()
x.grad # Overwritten by the newly calculated gradient

array([1., 1., 1., 1.])

x_grad == 4 * x

Array([ True, True, True, True], dtype=bool)

y = lambda x: x.sum()
grad(y)(x)

Array([1., 1., 1., 1.], dtype=float32)

x_grad == 4 * x

Now let’s calculate another function of x and take its gradient. Note that TensorFlow resets the gradient buffer whenever we record a new gradient.

with tf.GradientTape() as t:
  y = tf.reduce_sum(x)
t.gradient(y, x) # Overwritten by the newly calculated gradient

2.5.2. 非标量变量的后向

当y是向量时，y关于向量的导数最自然的解释是称为雅可比x矩阵的矩阵，其中包含关于每个分量的每个分量的偏导数。同样，对于高阶和，微分结果可能是更高阶的张量。yxyx

y 虽然 Jacobian 矩阵确实出现在一些高级机器学习技术中，但更常见的是，我们希望将的每个分量相对于完整向量的梯度求和x，从而产生与形状相同的向量x。例如，我们通常有一个向量表示我们的损失函数的值，分别为一批训练示例中的每个示例计算。在这里，我们只想总结为每个示例单独计算的梯度。

由于深度学习框架在解释非标量张量梯度的方式上有所不同，因此 PyTorch 采取了一些措施来避免混淆。调用backward非标量会引发错误，除非我们告诉 PyTorch 如何将对象缩减为标量。更正式地说，我们需要提供一些向量v这样backward会计算v⊤∂xy而不是∂xy. 下一部分可能令人困惑，但出于稍后会变得清楚的原因，这个论点（代表v) 被命名为gradient。更详细的描述见杨章的Medium帖子。

x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y))) # Faster: y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

MXNet handles this problem by reducing all tensors to scalars by summing before computing a gradient. In other words, rather than returning the Jacobian ∂xy, it returns the gradient of the sum ∂x∑iyi.

with autograd.record():
  y = x * x
y.backward()
x.grad # Equals the gradient of y = sum(x * x)

array([0., 2., 4., 6.])

y = lambda x: x * x
# grad is only defined for scalar output functions
grad(lambda x: y(x).sum())(x)

Array([0., 2., 4., 6.], dtype=float32)

By default, TensorFlow returns the gradient of the sum. In other words, rather than returning the Jacobian ∂xy, it returns the gradient of the sum ∂x∑iyi.

with tf.GradientTape() as t:
  y = x * x
t.gradient(y, x) # Same as y = tf.reduce_sum(x * x)

2.5.3. 分离计算

有时，我们希望将一些计算移到记录的计算图之外。例如，假设我们使用输入来创建一些我们不想为其计算梯度的辅助中间项。在这种情况下，我们需要从最终结果中分离出相应的计算图。下面的玩具示例更清楚地说明了这一点：假设我们有，但我们想关注on的直接影响，而不是通过传达的影响。在这种情况下，我们可以创建一个新变量，该变量具有与相同的值，但其出处（创建方式）已被清除。因此z = x * yy = x * xxzyuyu图中没有祖先，梯度不会u流向x. 例如，采用的梯度将产生结果，（与您自以来可能预期的不同）。z = x * ux3 * x * xz = x * x * x

x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

with autograd.record():
  y = x * x
  u = y.detach()
  z = u * x
z.backward()
x.grad == u

array([ True, True, True, True])

import jax

y = lambda x: x * x
# jax.lax primitives are Python wrappers around XLA operations
u = jax.lax.stop_gradient(y(x))
z = lambda x: u * x

grad(lambda x: z(x).sum())(x) == y(x)

Array([ True, True, True, True], dtype=bool)

# Set persistent=True to preserve the compute graph.
# This lets us run t.gradient more than once
with tf.GradientTape(persistent=True) as t:
  y = x * x
  u = tf.stop_gradient(y)
  z = u * x

x_grad = t.gradient(z, x)
x_grad == u

请注意，虽然此过程将y的祖先与的图分离z，但导致的计算图仍然存在，因此我们可以计算关于y的梯度。yx

x.grad.zero_()
y.sum().backward()
x.grad == 2 * x

tensor([True, True, True, True])

y.backward()
x.grad == 2 * x

array([ True, True, True, True])

grad(lambda x: y(x).sum())(x) == 2 * x

Array([ True, True, True, True], dtype=bool)

t.gradient(y, x) == 2 * x

2.5.4. 渐变和 Python 控制流

到目前为止，我们回顾了从输入到输出的路径通过诸如. 编程为我们计算结果的方式提供了更多的自由。例如，我们可以使它们依赖于辅助变量或对中间结果的条件选择。使用自动微分的一个好处是，即使构建函数的计算图需要通过迷宫般的 Python 控制流（例如，条件、循环和任意函数调用），我们仍然可以计算结果变量的梯度。为了说明这一点，请考虑以下代码片段，其中循环的迭代次数和语句的评估都取决于输入的值。z = x * x * xwhileifa

def f(a):
  b = a * 2
  while b.norm() < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

def f(a):
  b = a * 2
  while np.linalg.norm(b) < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

def f(a):
  b = a * 2
  while jnp.linalg.norm(b) < 1000:
    b = b * 2
  if b.sum() > 0:
    c = b
  else:
    c = 100 * b
  return c

def f(a):
  b = a * 2
  while tf.norm(b) < 1000:
    b = b * 2
  if tf.reduce_sum(b) > 0:
    c = b
  else:
    c = 100 * b
  return c

下面，我们调用这个函数，传入一个随机值作为输入。由于输入是一个随机变量，我们不知道计算图将采用什么形式。然而，每当我们f(a)对一个特定的输入执行时，我们就会实现一个特定的计算图并可以随后运行backward。

a = torch.randn(size=(), requires_grad=True)
d = f(a)
d.backward()

a = np.random.normal()
a.attach_grad()
with autograd.record():
  d = f(a)
d.backward()

from jax import random

a = random.normal(random.PRNGKey(1), ())
d = f(a)
d_grad = grad(f)(a)

a = tf.Variable(tf.random.normal(shape=()))
with tf.GradientTape() as t:
  d = f(a)
d_grad = t.gradient(d, a)
d_grad

尽管我们的函数f出于演示目的有点人为设计，但它对输入的依赖性非常简单：它是具有分段定义比例的线性函数。a因此，是一个包含常量项的向量，此外，需要匹配关于的梯度。f(a) / af(a) / af(a)a

a.grad == d / a

tensor(True)

a.grad == d / a

array(True)

d_grad == d / a

Array(True, dtype=bool)

d_grad == d / a

动态控制流在深度学习中很常见。例如，在处理文本时，计算图取决于输入的长度。在这些情况下，自动微分对于统计建模变得至关重要，因为不可能先验地计算梯度。

2.5.5. 讨论

您现在已经领略了自动微分的威力。用于自动和高效计算导数的库的开发极大地提高了深度学习从业者的生产力，使他们能够专注于更高级的问题。此外，autograd 允许我们设计大量模型，笔和纸的梯度计算将非常耗时。有趣的是，虽然我们使用 autograd 来优化模型（在统计意义上），但autograd 库本身的优化（在计算意义上）是框架设计者非常感兴趣的一个丰富主题。在这里，来自编译器和图形操作的工具被用来以最方便和内存效率最高的方式计算结果。

现在，试着记住这些基础知识：(i) 将梯度附加到那些我们想要导数的变量；(ii) 记录目标值的计算；(iii) 执行反向传播功能；(iv) 访问生成的梯度。

2.5.6. 练习

为什么二阶导数的计算成本比一阶导数高得多？

运行反向传播函数后，立即再次运行它，看看会发生什么。为什么？

d在我们计算关于的导数的控制流示例中 a，如果我们将变量更改a为随机向量或矩阵会发生什么？此时，计算的结果f(a)不再是标量。结果会怎样？我们如何分析这个？

让f(x)=sin⁡(x). 绘制图形f及其衍生物f′. 不要利用这个事实 f′(x)=cos⁡(x)而是使用自动微分来获得结果。

让f(x)=((log⁡x2)⋅sin⁡x)+x−1. 写出依赖图跟踪结果x到f(x).

使用链式法则计算导数dfdx上述函数，将每个术语放在您之前构建的依赖图上。

给定图形和中间导数结果，您在计算梯度时有多种选择。从开始评估结果x到f一次来自f 追溯到x. 路径从x到f通常称为前向微分，而从 f到x被称为向后微分。

你什么时候想用前向微分，什么时候用后向微分？提示：考虑所需的中间数据量、并行化步骤的能力以及涉及的矩阵和向量的大小。

打开APP阅读更多精彩内容