PyTorch教程15.9之预训练BERT的数据集-电子发烧友网

为了预训练第 15.8 节中实现的 BERT 模型，我们需要以理想的格式生成数据集，以促进两项预训练任务：掩码语言建模和下一句预测。一方面，原始的 BERT 模型是在两个巨大的语料库 BookCorpus 和英文维基百科（参见第15.8.5 节）的串联上进行预训练的，这使得本书的大多数读者难以运行。另一方面，现成的预训练 BERT 模型可能不适合医学等特定领域的应用。因此，在自定义数据集上预训练 BERT 变得越来越流行。为了便于演示 BERT 预训练，我们使用较小的语料库 WikiText-2 ( Merity et al. , 2016 )。

与 15.3节用于预训练word2vec的PTB数据集相比，WikiText-2(i)保留了原有的标点符号，适合下一句预测；(ii) 保留原始案例和编号；(iii) 大两倍以上。

						import os
import random
import torch
from d2l import torch as d2l

						 

						import os
import random
from mxnet import gluon, np, npx
from d2l import mxnet as d2l

npx.set_np()

						 

在 WikiText-2 数据集中，每一行代表一个段落，其中在任何标点符号及其前面的标记之间插入空格。保留至少两句话的段落。为了简单起见，为了拆分句子，我们只使用句点作为分隔符。我们将在本节末尾的练习中讨论更复杂的句子拆分技术。

						#@save
d2l.DATA_HUB['wikitext-2'] = (
  'https://s3.amazonaws.com/research.metamind.io/wikitext/'
  'wikitext-2-v1.zip', '3c914d17d80b1459be871a5039ac23e752a53cbe')

#@save
def _read_wiki(data_dir):
  file_name = os.path.join(data_dir, 'wiki.train.tokens')
  with open(file_name, 'r') as f:
    lines = f.readlines()
  # Uppercase letters are converted to lowercase ones
  paragraphs = [line.strip().lower().split(' . ')
         for line in lines if len(line.split(' . ')) >= 2]
  random.shuffle(paragraphs)
  return paragraphs

						 

						#@save
d2l.DATA_HUB['wikitext-2'] = (
  'https://s3.amazonaws.com/research.metamind.io/wikitext/'
  'wikitext-2-v1.zip', '3c914d17d80b1459be871a5039ac23e752a53cbe')

#@save
def _read_wiki(data_dir):
  file_name = os.path.join(data_dir, 'wiki.train.tokens')
  with open(file_name, 'r') as f:
    lines = f.readlines()
  # Uppercase letters are converted to lowercase ones
  paragraphs = [line.strip().lower().split(' . ')
         for line in lines if len(line.split(' . ')) >= 2]
  random.shuffle(paragraphs)
  return paragraphs

						 

15.9.1。为预训练任务定义辅助函数

下面，我们首先为两个 BERT 预训练任务实现辅助函数：下一句预测和掩码语言建模。这些辅助函数将在稍后将原始文本语料库转换为理想格式的数据集以预训练 BERT 时调用。

15.9.1.1。生成下一句预测任务

根据15.8.5.2 节的描述，该 _get_next_sentence函数为二元分类任务生成一个训练样例。

								#@save
def _get_next_sentence(sentence, next_sentence, paragraphs):
  if random.random() < 0.5:
    is_next = True
  else:
    # `paragraphs` is a list of lists of lists
    next_sentence = random.choice(random.choice(paragraphs))
    is_next = False
  return sentence, next_sentence, is_next

								 

								#@save
def _get_next_sentence(sentence, next_sentence, paragraphs):
  if random.random() < 0.5:
    is_next = True
  else:
    # `paragraphs` is a list of lists of lists
    next_sentence = random.choice(random.choice(paragraphs))
    is_next = False
  return sentence, next_sentence, is_next

								 

以下函数paragraph通过调用该 _get_next_sentence函数从输入生成用于下一句预测的训练示例。这paragraph是一个句子列表，其中每个句子都是一个标记列表。该参数 max_len指定预训练期间 BERT 输入序列的最大长度。

								#@save
def _get_nsp_data_from_paragraph(paragraph, paragraphs, vocab, max_len):
  nsp_data_from_paragraph = []
  for i in range(len(paragraph) - 1):
    tokens_a, tokens_b, is_next = _get_next_sentence(
      paragraph[i], paragraph[i + 1], paragraphs)
    # Consider 1 '' token and 2 '' tokens
    if len(tokens_a) + len(tokens_b) + 3 > max_len:
      continue
    tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
    nsp_data_from_paragraph.append((tokens, segments, is_next))
  return nsp_data_from_paragraph

								 

								#@save
def _get_nsp_data_from_paragraph(paragraph, paragraphs, vocab, max_len):
  nsp_data_from_paragraph = []
  for i in range(len(paragraph) - 1):
    tokens_a, tokens_b, is_next = _get_next_sentence(
      paragraph[i], paragraph[i + 1], paragraphs)
    # Consider 1 '' token and 2 '' tokens
    if len(tokens_a) + len(tokens_b) + 3 > max_len:
      continue
    tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
    nsp_data_from_paragraph.append((tokens, segments, is_next))
  return nsp_data_from_paragraph

								 

15.9.1.2。生成掩码语言建模任务

为了从 BERT 输入序列为掩码语言建模任务生成训练示例，我们定义了以下 _replace_mlm_tokens函数。在它的输入中，tokens是代表BERT输入序列的token列表，candidate_pred_positions 是BERT输入序列的token索引列表，不包括特殊token（masked语言建模任务中不预测特殊token），num_mlm_preds表示预测（召回 15% 的随机标记来预测）。遵循第 15.8.5.1 节中屏蔽语言建模任务的定义，在每个预测位置，输入可能被特殊的“”标记或随机标记替换，或者保持不变。最后，该函数返回可能替换后的输入标记、发生预测的标记索引以及这些预测的标签。

								#@save
def _replace_mlm_tokens(tokens, candidate_pred_positions, num_mlm_preds,
            vocab):
  # For the input of a masked language model, make a new copy of tokens and
  # replace some of them by '' or random tokens
  mlm_input_tokens = [token for token in tokens]
  pred_positions_and_labels = []
  # Shuffle for getting 15% random tokens for prediction in the masked
  # language modeling task
  random.shuffle(candidate_pred_positions)
  for mlm_pred_position in candidate_pred_positions:
    if len(pred_positions_and_labels) >= num_mlm_preds:
      break
    masked_token = None
    # 80% of the time: replace the word with the '' token
    if random.random() < 0.8:
      masked_token = ''
    else:
      # 10% of the time: keep the word unchanged
      if random.random() < 0.5:
        masked_token = tokens[mlm_pred_position]
      # 10% of the time: replace the word with a random word
      else:
        masked_token = random.choice(vocab.idx_to_token)
    mlm_input_tokens[mlm_pred_position] = masked_token
    pred_positions_and_labels.append(
      (mlm_pred_position, tokens[mlm_pred_position]))
  return mlm_input_tokens, pred_positions_and_labels

								 

								#@save
def _replace_mlm_tokens(tokens, candidate_pred_positions, num_mlm_preds,
            vocab):
  # For the input of a masked language model, make a new copy of tokens and
  # replace some of them by '' or random tokens
  mlm_input_tokens = [token for token in tokens]
  pred_positions_and_labels = []
  # Shuffle for getting 15% random tokens for prediction in the masked
  # language modeling task
  random.shuffle(candidate_pred_positions)
  for mlm_pred_position in candidate_pred_positions:
    if len(pred_positions_and_labels) >= num_mlm_preds:
      break
    masked_token = None
    # 80% of the time: replace the word with the '' token
    if random.random() < 0.8:
      masked_token = ''
    else:
      # 10% of the time: keep the word unchanged
      if random.random() < 0.5:
        masked_token = tokens[mlm_pred_position]
      # 10% of the time: replace the word with a random word
      else:
        masked_token = random.choice(vocab.idx_to_token)
    mlm_input_tokens[mlm_pred_position] = masked_token
    pred_positions_and_labels.append(
      (mlm_pred_position, tokens[mlm_pred_position]))
  return mlm_input_tokens, pred_positions_and_labels

								 

通过调用上述_replace_mlm_tokens函数，以下函数将 BERT 输入序列 ( tokens) 作为输入并返回输入标记的索引（在可能的标记替换之后，如第15.8.5.1 节所述）、发生预测的标记索引和标签这些预测的指标。

								#@save
def _get_mlm_data_from_tokens(tokens, vocab):
  candidate_pred_positions = []
  # `tokens` is a list of strings
  for i, token in enumerate(tokens):
    # Special tokens are not predicted in the masked language modeling
    # task
    if token in ['', '']:
      continue
    candidate_pred_positions.append(i)
  # 15% of random tokens are predicted in the masked language modeling task
  num_mlm_preds = max(1, round(len(tokens) * 0.15))
  mlm_input_tokens, pred_positions_and_labels = _replace_mlm_tokens(
    tokens, candidate_pred_positions, num_mlm_preds, vocab)
  pred_positions_and_labels = sorted(pred_positions_and_labels,
                    key=lambda x: x[0])
  pred_positions <
							

PyTorch教程15.9之预训练BERT的数据集

15.9.1。为预训练任务定义辅助函数

15.9.1.1。生成下一句预测任务

15.9.1.2。生成掩码语言建模任务

PyTorch教程3.3之综合回归数据

PyTorch教程4.2之图像分类数据集

PyTorch教程10.5之机器翻译和数据集

PyTorch教程11.9之使用Transformer进行大规模预训练

PyTorch教程13.5之在多个GPU上进行训练

PyTorch教程14.6之对象检测数据集

PyTorch教程14.9之语义分割和数据集

PyTorch教程之15.2近似训练

PyTorch教程15.4之预训练word2vec

PyTorch教程16.1之情绪分析和数据集

PyTorch教程15.10之预训练BERT

PyTorch教程16.7之自然语言推理：微调BERT

PyTorch教程16.6之针对序列级和令牌级应用程序微调BERT

PyTorch教程之数据预处理

PyTorch Recipes.zip

嵌入式AI简报 |特斯拉发布AI训练芯片Dojo D1

阿吉特 AI代码协作训练解决方案

基于神经网络的中文命名实体识别方法

基于BERT的中文科技NLP预训练模型

机器学习的训练样本数据选择方法综述

基于BERT+Bo-LSTM+Attention的病历短文分类模型

基于预训练模型和长短期记忆网络的深度学习模型

一种可分享数据和机器学习模型的区块链

一种基于BERT模型的社交电商文本分类算法

一种侧重于学习情感特征的预训练方法

一种脱离预训练的多尺度目标检测网络模型

融合BERT词向量与TextRank的关键词抽取方法

结合BERT模型的中文文本分类算法

github上的pytorch学习资料详细说明

跨项目缺陷预测的训练数据新的视角

如何让网络模型加速训练

基于PyTorch的模型并行分布式训练Megatron解析

深度学习框架pytorch介绍

Multi-CLS BERT：传统集成的有效替代方案

如何将Pytorch自训练模型变成OpenVINO IR模型形式

PyTorch教程-16.7。自然语言推理：微调 BERT

PyTorch教程-16.6. 针对序列级和令牌级应用程序微调 BERT

PyTorch教程-15.9。预训练 BERT 的数据集

自训练Pytorch模型使用OpenVINO™优化并部署在AI爱克斯开发板

NLP入门之Bert的前世今生

pytorch实现断电继续训练时需要注意的要点

利用 Python 和 PyTorch 处理面向对象的数据集（2）) ：创建数据集对象

利用Python和PyTorch处理面向对象的数据集（1）

什么是BERT？为何选择BERT？

基于PyTorch的深度学习入门教程之PyTorch的自动梯度计算

基于PyTorch的深度学习入门教程之PyTorch简单知识

基于PyTorch的深度学习入门教程之PyTorch重点综合实践

基于PyTorch的深度学习入门教程之训练一个神经网络分类器

基于PyTorch的深度学习入门教程之使用PyTorch构建一个神经网络

在BERT中引入知识图谱中信息的若干方法

如何让PyTorch模型训练变得飞快？

图解BERT预训练模型！

如何在BERT中引入知识图谱中信息

PyTorch 1.6即将原生支持自动混合精度训练

9个用Pytorch训练快速神经网络的技巧

改进版BERT——SpanBERT，通过表示和预测分词提升预训练效果！

BERT再次制霸GLUE排行榜！BERT王者归来了！

XLNet vs BERT，对比得明明白白！

1024块TPU在燃烧！将BERT预训练模型的训练时长从3天缩减到了76分钟

BERT模型的PyTorch实现

下载排行榜

储能电源市场分析

传感芯片选型指南

储能电源市场分析报告

ATmega8芯片中文手册

2023年光伏行业发展回顾报告

2A多电池高效开关充电器AN_SY6912A中文资料规格书