PyTorch教程9.2之将原始文本转换为序列数据-电子发烧友网

在本书中，我们经常会使用表示为单词、字符或单词序列的文本数据。首先，我们需要一些基本工具来将原始文本转换为适当形式的序列。典型的预处理流水线执行以下步骤：

将文本作为字符串加载到内存中。
将字符串拆分为标记（例如，单词或字符）。
构建一个词汇词典，将每个词汇元素与一个数字索引相关联。
将文本转换为数字索引序列。

						import collections
import random
import re
import torch
from d2l import torch as d2l

						 

						import collections
import random
import re
from mxnet import np, npx
from d2l import mxnet as d2l

npx.set_np()

						 

						import collections
import random
import re
import jax
from jax import numpy as jnp
from d2l import jax as d2l

						 

						No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

					

						import collections
import random
import re
import tensorflow as tf
from d2l import tensorflow as d2l

						 

9.2.1. 读取数据集

在这里，我们将使用 HG Wells 的The Time Machine，这是一本 30000 多字的书。虽然实际应用程序通常会涉及大得多的数据集，但这足以演示预处理管道。以下_download方法将原始文本读入字符串。

							class TimeMachine(d2l.DataModule): #@save
  """The Time Machine dataset."""
  def _download(self):
    fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
               '090b5e7e70c295757f55df93cb0a180b9691891a')
    with open(fname) as f:
      return f.read()

data = TimeMachine()
raw_text = data._download()
raw_text[:60]

							 

'时间机器，HG Wells [1898]nnnnnInnnThe Time Tra'

							class TimeMachine(d2l.DataModule): #@save
  """The Time Machine dataset."""
  def _download(self):
    fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
               '090b5e7e70c295757f55df93cb0a180b9691891a')
    with open(fname) as f:
      return f.read()

data = TimeMachine()
raw_text = data._download()
raw_text[:60]

							 

							Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...

						

'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'

							class TimeMachine(d2l.DataModule): #@save
  """The Time Machine dataset."""
  def _download(self):
    fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
               '090b5e7e70c295757f55df93cb0a180b9691891a')
    with open(fname) as f:
      return f.read()

data = TimeMachine()
raw_text = data._download()
raw_text[:60]

							 

'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'

							class TimeMachine(d2l.DataModule): #@save
  """The Time Machine dataset."""
  def _download(self):
    fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
               '090b5e7e70c295757f55df93cb0a180b9691891a')
    with open(fname) as f:
      return f.read()

data = TimeMachine()
raw_text = data._download()
raw_text[:60]

							 

'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'

为简单起见，我们在预处理原始文本时忽略标点符号和大写字母。

							@d2l.add_to_class(TimeMachine) #@save
def _preprocess(self, text):
  return re.sub('[^A-Za-z]+', ' ', text).lower()

text = data._preprocess(raw_text)
text[:60]

							 

'the time machine by h g wells i the time traveller for so it'

							@d2l.add_to_class(TimeMachine) #@save
def _preprocess(self, text):
  return re.sub('[^A-Za-z]+', ' ', text).lower()

text = data._preprocess(raw_text)
text[:60]

							 

'the time machine by h g wells i the time traveller for so it'

							@d2l.add_to_class(TimeMachine) #@save
def _preprocess(self, text):
  return re.sub('[^A-Za-z]+', ' ', text).lower()

text = data._preprocess(raw_text)
text[:60]

							 

'the time machine by h g wells i the time traveller for so it'

							@d2l.add_to_class(TimeMachine) #@save
def _preprocess(self, text):
  return re.sub('[^A-Za-z]+', ' ', text).lower()

text = data._preprocess(raw_text)
text[:60]

							 

'the time machine by h g wells i the time traveller for so it'

9.2.2. 代币化

标记是文本的原子（不可分割）单元。每个时间步对应 1 个 token，但究竟什么是 token 是一种设计选择。例如，我们可以将句子“Baby needs a new pair of shoes”表示为一个包含 7 个单词的序列，其中所有单词的集合包含一个很大的词汇表（通常是数万或数十万个单词）。或者我们将同一个句子表示为更长的 30 个字符序列，使用更小的词汇表（只有 256 个不同的 ASCII 字符）。下面，我们将预处理后的文本标记为一系列字符。

							@d2l.add_to_class(TimeMachine) #@save
def _tokenize(self, text):
  return list(text)

tokens = data._tokenize(text)
','.join(tokens[:30])

							 

't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '

							@d2l.add_to_class(TimeMachine) #@save
def _tokenize(self, text):
  return list(text)

tokens = data._tokenize(text)
','.join(tokens[:30])

							 

't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '

							@d2l.add_to_class(TimeMachine) #@save
def _tokenize(self, text):
  return list(text)

tokens = data._tokenize(text)
','.join(tokens[:30])

							 

't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '

							@d2l.add_to_class(TimeMachine) #@save
def _tokenize(self, text):
  return list(text)

tokens = data._tokenize(text)
','.join(tokens[:30])

							 

't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '

9.2.3. 词汇

这些标记仍然是字符串。然而，我们模型的输入最终必须由数值输入组成。接下来，我们介绍一个用于构建词汇表的类，即，将每个不同的标记值与唯一索引相关联的对象。首先，我们确定训练语料库中的唯一标记集。然后我们为每个唯一标记分配一个数字索引。为方便起见，通常会删除不常用的词汇元素。Whenever we encounter a token at training or test time that had not been previously seen or was dropped from the vocabulary, we represent it by a special “” token, signifying that this is an unknown value.

					class Vocab: #@save
  """Vocabulary for text."""
  def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):
    # Flatten a 2D list if needed
    if tokens and isinstance(tokens[0], list):
      tokens = [token for line in tokens for token in line]
    # Count token frequencies
    counter = collections.Counter(tokens)
    self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
                 reverse=True)
    # The list of unique tokens
    self.idx_to_token = list(sorted(set([''] + reserved_tokens + [
      token for token, freq in self.token_freqs if freq >= min_freq])))
    self.token_to_idx = {token: idx
               for idx, token in enumerate(self.idx_to_token)}

  def __len__(self):
    return len(self.idx_to_token)

  def __getitem__(self, tokens):
    if not isinstance(tokens, (list, tuple)):
      return self.token_to_idx.get(tokens,
				

PyTorch教程9.2之将原始文本转换为序列数据

9.2.1. 读取数据集

9.2.2. 代币化

9.2.3. 词汇

PyTorch教程21.7之序列感知推荐系统

PyTorch教程23.2之使用亚马逊SageMaker

PyTorch教程23.8之API

PyTorch教程3.3之综合回归数据

PyTorch教程4.1之Softmax回归

PyTorch教程3.6之概括

PyTorch教程4.2之图像分类数据集

PyTorch教程6.2之参数管理

PyTorch教程10.5之机器翻译和数据集

PyTorch教程10.8之波束搜索

PyTorch教程12.2之凸度

PyTorch教程13.4之硬件

PyTorch教程13.3之自动并行

PyTorch教程13.2之异步计算

PyTorch教程14.2之微调

PyTorch教程14.1之图像增强

PyTorch教程6.7之显卡

PyTorch教程2.5之自动微分

PyTorch教程14.6之对象检测数据集

PyTorch教程14.9之语义分割和数据集

PyTorch教程15.9之预训练BERT的数据集

PyTorch教程16.4之自然语言推理和数据集

PyTorch教程16.6之针对序列级和令牌级应用程序微调BERT

PyTorch教程8.2之使用块的网络(VGG)

文本挖掘之概率主题模型综述

基于双向长短时记忆的序列标注神经网络模型

基于语义感知的中文短文本摘要生成技术

异构文本数据转换过程中解析XML文本的方法对比

如何使用XSLT将XML转换为XHTML教程免费下载

小波回声状态网络的时间序列预测

pytorch如何训练自己的数据

vlookup过来的数据怎么变成文本

怎么把clob字段转换为字符串

CLOB类型的数据转换为VARCHAR类型

oracle怎么把clob字段转换为字符串

string类型转换成日期

实序列的z变换为什么会出现一对相互共轭的复数零点？

Java序列化怎么使用

PyTorch教程-16.1. 情绪分析和数据集

PyTorch教程-9.2. 将原始文本转换为序列数据

机器学习算法学习之特征工程3

机器学习算法学习之特征工程2

机器学习算法学习之特征工程1

PyTorch文本分类任务的基本流程

如何将AIFF文件转换为SWF呢

使用NVIDIA NeMo进行文本规范化和反向文本规范化

全面的分子生物学和序列分析工具套件有什么新功能

分享一款批量将PDF文件转换为Word的神器

序列数据和文本的深度学习

压电陶瓷材料可以将机械能转换为电能或者将电能转换为机械能

AI新模型可将文本转换为生动的图像

textCNN论文与原理——短文本分类

使用Arduino将罗技踏板转换为赛车游戏的模拟手刹

如何将旧的CFL灯转换为LED灯

Bamboo Spark 支持手写文本转换的智能笔记本

关于AI文本生成动画模型的论文

机器学习的特征工程是将原始的输入数据转换成特征

新西兰开始将路灯转换为LED 安全性将提高并减少光污染

如何将Altera的SDC约束转换为Xilinx XDC约束

如何使用Python编写能够从原始文本提取信息的程序

下载排行榜

UC3842/3/4/5电源管理芯片中文手册

华瑞昇CR216芯片数字万用表规格书附原理图及校正流程方法

DMT0660数字万用表产品说明书

3314A函数发生器维修手册

TPS54202H降压转换器评估模块用户指南

HY12P65/HY12P66数字万用表芯片规格书