PyTorch教程15.10之预训练BERT-电子发烧友网

借助15.8 节中实现的 BERT 模型和15.9 节中从 WikiText-2 数据集生成的预训练示例，我们将在本节中在 WikiText-2 数据集上预训练 BERT。

						import torch
from torch import nn
from d2l import torch as d2l

						 

						from mxnet import autograd, gluon, init, np, npx
from d2l import mxnet as d2l

npx.set_np()

首先，我们将 WikiText-2 数据集加载为用于屏蔽语言建模和下一句预测的小批量预训练示例。批量大小为 512，BERT 输入序列的最大长度为 64。请注意，在原始 BERT 模型中，最大长度为 512。

						batch_size, max_len = 512, 64
train_iter, vocab = d2l.load_data_wiki(batch_size, max_len)

						batch_size, max_len = 512, 64
train_iter, vocab = d2l.load_data_wiki(batch_size, max_len)

						Downloading ../data/wikitext-2-v1.zip from https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip...

					

15.10.1。预训练 BERT

原始 BERT 有两个不同模型大小的版本（Devlin et al. , 2018）。基础模型（BERTBASE) 使用 12 层（Transformer 编码器块），具有 768 个隐藏单元（隐藏大小）和 12 个自注意力头。大模型（BERTLARGE) 使用 24 层，有 1024 个隐藏单元和 16 个自注意力头。值得注意的是，前者有 1.1 亿个参数，而后者有 3.4 亿个参数。为了便于演示，我们定义了一个小型 BERT，使用 2 层、128 个隐藏单元和 2 个自注意力头。

							net = d2l.BERTModel(len(vocab), num_hiddens=128,
          ffn_num_hiddens=256, num_heads=2, num_blks=2, dropout=0.2)
devices = d2l.try_all_gpus()
loss = nn.CrossEntropyLoss()

							 

							net = d2l.BERTModel(len(vocab), num_hiddens=128, ffn_num_hiddens=256,
          num_heads=2, num_blks=2, dropout=0.2)
devices = d2l.try_all_gpus()
net.initialize(init.Xavier(), ctx=devices)
loss = gluon.loss.SoftmaxCELoss()

							 

在定义训练循环之前，我们定义了一个辅助函数 _get_batch_loss_bert。给定训练示例的碎片，此函数计算掩码语言建模和下一句预测任务的损失。请注意，BERT 预训练的最终损失只是掩码语言建模损失和下一句预测损失的总和。

							#@save
def _get_batch_loss_bert(net, loss, vocab_size, tokens_X,
             segments_X, valid_lens_x,
             pred_positions_X, mlm_weights_X,
             mlm_Y, nsp_y):
  # Forward pass
  _, mlm_Y_hat, nsp_Y_hat = net(tokens_X, segments_X,
                 valid_lens_x.reshape(-1),
                 pred_positions_X)
  # Compute masked language model loss
  mlm_l = loss(mlm_Y_hat.reshape(-1, vocab_size), mlm_Y.reshape(-1)) *\
  mlm_weights_X.reshape(-1, 1)
  mlm_l = mlm_l.sum() / (mlm_weights_X.sum() + 1e-8)
  # Compute next sentence prediction loss
  nsp_l = loss(nsp_Y_hat, nsp_y)
  l = mlm_l + nsp_l
  return mlm_l, nsp_l, l

							 

							#@save
def _get_batch_loss_bert(net, loss, vocab_size, tokens_X_shards,
             segments_X_shards, valid_lens_x_shards,
             pred_positions_X_shards, mlm_weights_X_shards,
             mlm_Y_shards, nsp_y_shards):
  mlm_ls, nsp_ls, ls = [], [], []
  for (tokens_X_shard, segments_X_shard, valid_lens_x_shard,
     pred_positions_X_shard, mlm_weights_X_shard, mlm_Y_shard,
     nsp_y_shard) in zip(
    tokens_X_shards, segments_X_shards, valid_lens_x_shards,
    pred_positions_X_shards, mlm_weights_X_shards, mlm_Y_shards,
    nsp_y_shards):
    # Forward pass
    _, mlm_Y_hat, nsp_Y_hat = net(
      tokens_X_shard, segments_X_shard, valid_lens_x_shard.reshape(-1),
      pred_positions_X_shard)
    # Compute masked language model loss
    mlm_l = loss(
      mlm_Y_hat.reshape((-1, vocab_size)), mlm_Y_shard.reshape(-1),
      mlm_weights_X_shard.reshape((-1, 1)))
    mlm_l = mlm_l.sum() / (mlm_weights_X_shard.sum() + 1e-8)
    # Compute next sentence prediction loss
    nsp_l = loss(nsp_Y_hat, nsp_y_shard)
    nsp_l = nsp_l.mean()
    mlm_ls.append(mlm_l)
    nsp_ls.append(nsp_l)
    ls.append(mlm_l + nsp_l)
    npx.waitall()
  return mlm_ls, nsp_ls, ls

							 

调用上述两个辅助函数，以下函数定义了在 WikiText-2 ( ) 数据集上train_bert预训练 BERT ( ) 的过程。训练 BERT 可能需要很长时间。与在函数中指定训练的时期数不同（参见第 14.1 节），以下函数的输入指定训练的迭代步数。nettrain_itertrain_ch13num_steps

							def train_bert(train_iter, net, loss, vocab_size, devices, num_steps):
  net(*next(iter(train_iter))[:4])
  net = nn.DataParallel(net, device_ids=devices).to(devices[0])
  trainer = torch.optim.Adam(net.parameters(), lr=0.01)
  step, timer = 0, d2l.Timer()
  animator = d2l.Animator(xlabel='step', ylabel='loss',
              xlim=[1, num_steps], legend=['mlm', 'nsp'])
  # Sum of masked language modeling losses, sum of next sentence prediction
  # losses, no. of sentence pairs, count
  metric = d2l.Accumulator(4)
  num_steps_reached = False
  while step < num_steps and not num_steps_reached:
    for tokens_X, segments_X, valid_lens_x, pred_positions_X,\
      mlm_weights_X, mlm_Y, nsp_y in train_iter:
      tokens_X = tokens_X.to(devices[0])
      segments_X = segments_X.to(devices[0])
      valid_lens_x = valid_lens_x.to(devices[0])
      pred_positions_X = pred_positions_X.to(devices[0])
      mlm_weights_X = mlm_weights_X.to(devices[0])
      mlm_Y, nsp_y = mlm_Y.to(devices[0]), nsp_y.to(devices[0])
      trainer.zero_grad()
      timer.start()
      mlm_l, nsp_l, l = _get_batch_loss_bert(
        net, loss, vocab_size, tokens_X, segments_X, valid_lens_x,
        pred_positions_X, mlm_weights_X, mlm_Y, nsp_y)
      l.backward()
      trainer.step()
      metric.add(mlm_l, nsp_l, tokens_X.shape[0], 1)
      timer.stop()
      animator.add(step + 1,
             (metric[0] / metric[3], metric[1] / metric[3]))
      step += 1
      if step == num_steps:
        num_steps_reached = True
        break

  print(f'MLM loss {metric[0] / metric[3]:.3f}, '
     f'NSP loss {metric[1] / metric[3]:.3f}')
  print(f'{metric[2] / timer.sum():.1f} sentence pairs/sec on '
     f'{str(devices)}')

							 

							def train_bert(train_iter, net, loss, vocab_size, devices, num_steps):
  trainer = gluon.Trainer(net.collect_params(), 'adam',
              {'learning_rate': 0.01})
  step, timer = 0, d2l.Timer()
  animator = d2l.Animator(xlabel='step', ylabel='loss',
              xlim=[1, num_steps], legend=['mlm', 'nsp'])
  # Sum of masked language modeling losses, sum of next sentence prediction
  # losses, no. of sentence pairs, count
  metric = d2l.
						

PyTorch教程15.10之预训练BERT

15.10.1。预训练 BERT

15铅CO2传感器塑料地栅阵列封装V15.10x10封装外形图

PyTorch教程11.9之使用Transformer进行大规模预训练

PyTorch教程13.5之在多个GPU上进行训练

PyTorch教程之15.2近似训练

PyTorch教程15.9之预训练BERT的数据集

PyTorch教程15.4之预训练word2vec

PyTorch教程16.7之自然语言推理：微调BERT

PyTorch教程16.6之针对序列级和令牌级应用程序微调BERT

PyTorch Recipes.zip

Pytorch实现MNIST手写数字识别

《初级电工技能实战训练》pdf

《初级电工技能实战训练》pdf

Effective PyTorch之 PyTorch基础知识（译）

Allegro16.6基础课程训练参考教材

嵌入式AI简报 |特斯拉发布AI训练芯片Dojo D1

阿吉特 AI代码协作训练解决方案

基于强化学习的虚拟场景角色乒乓球训练

虚实融合辅助的言语康复训练系统

基于神经网络的中文命名实体识别方法

训练SQL注入的sqil-labs-master闯关游戏

基于BERT的中文科技NLP预训练模型

基于BERT+Bo-LSTM+Attention的病历短文分类模型

基于Unity3D游戏引擎的神经反馈治疗系统

基于预训练模型和长短期记忆网络的深度学习模型

一种基于BERT模型的社交电商文本分类算法

一种侧重于学习情感特征的预训练方法

一种脱离预训练的多尺度目标检测网络模型

融合BERT词向量与TextRank的关键词抽取方法

结合BERT模型的中文文本分类算法

github上的pytorch学习资料详细说明

谷歌模型训练软件有哪些功能和作用

如何让网络模型加速训练

基于PyTorch的模型并行分布式训练Megatron解析

pytorch用来干嘛的

如何将Pytorch自训练模型变成OpenVINO IR模型形式

PyTorch教程-16.7。自然语言推理：微调 BERT

PyTorch教程-16.6. 针对序列级和令牌级应用程序微调 BERT

PyTorch教程-15.9。预训练 BERT 的数据集

自训练Pytorch模型使用OpenVINO™优化并部署在AI爱克斯开发板

NLP入门之Bert的前世今生

pytorch实现断电继续训练时需要注意的要点

使用NVIDIA Triton模型分析器确定最佳AI模型服务配置

什么是BERT？为何选择BERT？

如何向大规模预训练语言模型中融入知识？

基于PyTorch的深度学习入门教程之PyTorch的自动梯度计算

基于PyTorch的深度学习入门教程之PyTorch简单知识

基于PyTorch的深度学习入门教程之PyTorch重点综合实践

基于PyTorch的深度学习入门教程之训练一个神经网络分类器

基于PyTorch的深度学习入门教程之使用PyTorch构建一个神经网络

在BERT中引入知识图谱中信息的若干方法

如何让PyTorch模型训练变得飞快？

图解BERT预训练模型！

如何在BERT中引入知识图谱中信息

PyTorch 1.6即将原生支持自动混合精度训练

9个用Pytorch训练快速神经网络的技巧

改进版BERT——SpanBERT，通过表示和预测分词提升预训练效果！

BERT再次制霸GLUE排行榜！BERT王者归来了！

XLNet vs BERT，对比得明明白白！

1024块TPU在燃烧！将BERT预训练模型的训练时长从3天缩减到了76分钟

BERT模型的PyTorch实现

下载排行榜

对讲机原理、使用及纵图集

马可尼IFR 2945A综合测试仪使用手册

基于STM32微处理器为核心的水质监测系统

IP2369支持PD3.1 快充输入输出协议支持2~6 节串联电池集成升降压功率最大充放电功率45W

摩托罗拉P200写频软件

BH系列高压干簧继电器-测试测量与水银继电器替代的理想选择!