今日头条
本篇是利用 Python 和 PyTorch 处理面向对象的数据集系列博客的第 3 篇。
第 3 部分:repetita iuvant(*):猫和狗
(*) 是一个拉丁语词组,意为“水滴石穿,功到自成”
在本篇博文中,我们将在“猫和狗”数据库上重复先前第 2 部分中已完成的过程,并且我们将添加一些其它内容。
通常,简单的数据集都是按文件夹来组织的。例如,猫、狗以及每个类别的“训练”、“验证”和“测试”文件夹。
通过将数据集组织为单一对象,即可避免文件夹树的复杂性。在此应用中,所有图片都保存到同一个文件夹内。
我们只需 1 个标签文件来标明哪个是狗,哪个是猫即可。
以下包含了用于自动创建标签文件的代码。即使每张图片名称本身就包含了标签,我们也特意创建 1 个专用 label.txt 文件:其中每行都包含文件名和标签:猫 (cat) = 0 狗 (dog) = 1。
在此示例最后,我们将回顾使用 PyTorch 拆分数据集的 2 种方法,并训练 1 个非常简单的模型。
输入 [ ]:
data_path = './raw_data/dogs_cats/all'
import os
files = [f for f in os.listdir(data_path) ]
#for f in files:
# print(f)
with open(data_path + '/'+ "labels.txt", "a") as myfile:
for f in files:
if f.split('.')[0]=='cat':
label = 0
elif f.split('.')[0]=='dog':
label = 1
else:
print("ERROR in recognizing file " + f + "label")
myfile.write(f + ' ' + str(label) + '\n')
输入 [106]:
raw_data_path = './raw_data/dogs_cats/all'
im_example_cat = Image.open(raw_data_path + '/' + 'cat.1070.jpg')
im_example_dog = Image.open(raw_data_path + '/' + 'dog.1070.jpg')
fig, axs = plt.subplots(1, 2, figsize=(10, 3))
axs[0].set_title('should be a cat')
axs[0].imshow(im_example_cat)
axs[1].set_title('should be a dog')
axs[1].imshow(im_example_dog)
plt.show()
请务必刷新样本列表:
输入 [ ]:
del sample_list
@functools.lru_cache(1)
def getSampleInfoList(raw_data_path):
sample_list = []
with open(str(raw_data_path) + '/labels.txt', mode = 'r') as f:
reader = csv.reader(f, delimiter = ' ')
for i, row in enumerate(reader):
imgname = row[0]
label = int(row[1])
sample_list.append(DataInfoTuple(imgname, label))
sample_list.sort(reverse=False, key=myFunc)
# print("DataInfoTouple: samples list length = {}".format(len(sample_list)))
return sample_list
数据集对象创建非常简单,只需 1 行代码即可:
输入 [114]:
mydataset = MyDataset(isValSet_bool = None, raw_data_path = raw_data_path, norm = False, resize = True, newsize = (64, 64))
如需进行归一化,则应计算平均值和标准差,并重新生成归一化后的数据集。
代码如下,以确保代码完整性。
输入 [ ]:
imgs = torch.stack([img_t for img_t, _ in mydataset], dim = 3)
im_mean = imgs.view(3, -1).mean(dim=1).tolist()
im_std = imgs.view(3, -1).std(dim=1).tolist()
del imgs
normalize = transforms.Normalize(mean=im_mean, std=im_std)
mydataset = MyDataset(isValSet_bool = None, raw_data_path = raw_data_path, norm = True, resize = True, newsize = (64, 64))
将数据库拆分为训练集、验证集和测试集。
下一步是训练阶段所必需的。通常,整个样本数据集已重新打乱次序,随后拆分为三个集:训练集、验证集和测试集。
如果您已将数据集组织为数据张量和标签张量,那么您可使用 2 次“sklearn.model_selection.train_test_split”。
首先,将其拆分为“训练”和“测试”,然后再将“训练”拆分为“验证”和“训练”。
结果如下所示:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)
但我们想把数据集保留为对象,而 PyTorch 则可帮助我们简化这一操作。
例如,我们仅创建“训练和验证”集。
方法 1:
此处,我们对索引进行打乱次序,然后创建数据集
输入 [ ]:
n_samples = len(mydataset)
# 验证集中将包含多少样本
n_val = int(0.2 * n_samples)
# 重要!对数据集进行打乱次序。首先对索引进行打乱次序。
shuffled_indices = torch.randperm(n_samples)
# 第一步是拆分索引
train_indices = shuffled_indices[:-n_val]
val_indices = shuffled_indices[-n_val:]
train_indices, val_indices
输入 [ ]:
from torch.utils.data.sampler import SubsetRandomSampler
batch_size = 64
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(val_indices)
train_loader = torch.utils.data.DataLoader(mydataset, batch_size=batch_size, sampler=train_sampler)
validation_loader = torch.utils.data.DataLoader(mydataset, batch_size=batch_size, sampler=valid_sampler)
方法 2
以下是直接对数据库进行打乱次序的示例。代码风格更抽象:
输入 [116]:
train_size = int(0.9 * len(mydataset))
valid_size = int(0.1 * len(mydataset))
train_dataset, valid_dataset = torch.utils.data.random_split(mydataset, [train_size, valid_size])
# 如需“测试”数据集,则请取消注释
#test_size = valid_size
#train_size = train_size - test_size
#train_dataset, test_dataset = torch.utils.data.random_split(train_dataset, [train_size, test_size])
len(mydataset), len(train_dataset), len(valid_dataset)
输出 [116]:
(25000, 22500, 2500)
模型定义
输入 [41]:
import torch.nn as nn
import torch.nn.functional as F
n_out = 2
输入 [ ]:
# NN 极小
# 期望的精确度为 0.66
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding = 1)
self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding = 1)
self.fc1 = nn.Linear(8*16*16, 32)
self.fc2 = nn.Linear(32, 2)
def forward(self, x):
out = F.max_pool2d(torch.tanh(self.conv1(x)), 2)
out = F.max_pool2d(torch.tanh(self.conv2(out)), 2)
#print(out.shape)
out = out.view(-1,8*16*16)
out = torch.tanh(self.fc1(out))
out = self.fc2(out)
return out
输入 [131]:
# 模型更深 - 但训练时间在我的 CPU 上开始变得有些难以承受
class ResBlock(nn.Module):
def __init__(self, n_chans):
super(ResBlock, self).__init__()
self.conv = nn.Conv2d(n_chans, n_chans, kernel_size=3, padding=1)
self.batch_norm = nn.BatchNorm2d(num_features=n_chans)
def forward(self, x):
out = self.conv(x)
out = self.batch_norm(out)
out = torch.relu(out)
return out + x
输入 [177]:
class Net(nn.Module):
def __init__(self, n_chans1=32, n_blocks=10):
super(Net, self).__init__()
self.n_chans1 = n_chans1
self.conv1 = nn.Conv2d(3, n_chans1, kernel_size=3, padding=1)
self.conv3 = nn.Conv2d(n_chans1, n_chans1, kernel_size=3, padding=1)
self.resblocks = nn.Sequential(* [ResBlock(n_chans=n_chans1)] * n_blocks)
self.fc1 = nn.Linear(n_chans1 * 8 * 8, 32)
self.fc2 = nn.Linear(32, 2)
def forward(self, x):
out = F.max_pool2d(torch.relu(self.conv1(x)), 2)
out = self.resblocks(out)
out = F.max_pool2d(torch.relu(self.conv3(out)), 2)
out = F.max_pool2d(torch.relu(self.conv3(out)), 2)
out = out.view(-1, self.n_chans1 * 8 * 8)
out = torch.relu(self.fc1(out))
out = self.fc2(out)
return out
model = Net(n_chans1=32, n_blocks=5)
让我们来显示模型大小:
输入 [178]:
model = Net()
numel_list = [p.numel() for p in model.parameters() if p.requires_grad == True]
sum(numel_list), numel_list
输出 [178]:
(85090, [864, 32, 9216, 32, 9216, 32, 32, 32, 65536, 32, 64, 2])
简单且聪明的窍门来检查图片 shape 的不匹配和错误:训练模型前先执行正向运行来进行检查:
输入 [180]:
model(mydataset[0][0].unsqueeze(0))
# 需要解压缩才能添加维度并对批次进行仿真
输出 [180]:
tensor([[0.7951, 0.6417]], grad_fn=)
成功了!
模型训练
虽然这并非本文的目标,但既然有了模型,何不试试训练一下,更何况 Pytorch 还免费提供了 DataLoader。
DataLoader 的任务是通过灵活的采样策略对来自数据集的 mini-batch 进行采样。将自动把数据集打乱,然后再加载 mini-batch。如需获取参考信息,请访问 https://pytorch.org/docs/stable/data.html
输入 [181]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print("Training on device {}.".format(device))
Training on device cpu.
输入 [182]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=64, shuffle=False) # 注:此处无需乱序
输入 [183]:
def training_loop(n_epochs, optimizer, model, loss_fn, train_loader):
for epoch in range(1, n_epochs + 1):
loss_train = 0.0
for imgs, labels in train_loader:
outputs = model(imgs)
loss = loss_fn(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_train += loss.item()
if epoch == 1 or epoch % 5 == 0:
print('{} Epoch {}, Training loss {}'.format(
datetime.datetime.now(), epoch, float(loss_train)))
输入 [184]:
model = Net()
# 预训练的模型 models_data_path = './raw_data/models' model.load_state_dict(torch.load(models_data_path + '/cats_dogs.pt'))
输入 [185]:
optimizer = optim.SGD(model.parameters(), lr=1e-2)
loss_fn = nn.CrossEntropyLoss()
training_loop(
n_epochs = 20,
optimizer = optimizer,
model = model,
loss_fn = loss_fn,
train_loader = train_loader,
)
2020-09-15 19:33:03.105620 Epoch 1, Training loss 224.0338312983513
2020-09-15 20:01:35.993491 Epoch 5, Training loss 153.11289536952972
2020-09-15 20:36:51.486071 Epoch 10, Training loss 113.09166505932808
2020-09-15 21:11:37.375586 Epoch 15, Training loss 85.17814277857542
2020-09-15 21:46:05.792975 Epoch 20, Training loss 59.60428727790713
输入 [189]:
for loader in [train_loader, valid_loader]:
correct = 0
total = 0
with torch.no_grad():
for imgs, labels in loader:
outputs = model(imgs)
_, predicted = torch.max(outputs, dim=1)
total += labels.shape[0]
correct += int((predicted == labels).sum())
print("Accuracy: %f" % (correct / total))
Accuracy: 0.956756
Accuracy: 0.830800
性能一般,但目的只是为了测试组织为 Python 对象的数据集是否有效,并测试我们能否训练一般模型。
此外还需要注意,为了能够以我的 CPU 来加速训练,所有图像都降采样到 64x64。
输入 [187]:
models_data_path = './raw_data/models'
torch.save(model.state_dict(), models_data_path + '/cats_dogs.pt')
输入 [ ]:
# 如需加载先前保存的模型
model = Net()
model.load_state_dict(torch.load(models_data_path + 'cats_dogs.pt'))
附录
认识 DIM
理解 pytorch sum 或 mean 中的“dim”的方法是,它会折叠所指定的维度。因此,当它折叠维度 0(行),它会变为仅 1 行(它在整列范围内进行操作)。
输入 [ ]:
a = torch.randn(2, 3)
a
输入 [ ]:
torch.mean(a)
输入 [ ]:
torch.mean(a, dim=0) # now collapsing rows, only one row will result
输入 [ ]:
torch.mean(a, dim=1) # now collapsing columns, only one column will remain
审核编辑 黄昊宇
全部0条评论
快来发表一下你的评论吧 !