循环神经网络

最新推荐文章于 2024-08-06 13:54:02 发布

s_yangyang

最新推荐文章于 2024-08-06 13:54:02 发布

阅读量161

点赞数

CC 4.0 BY-SA版权

文章标签： rnn 人工智能深度学习

本文链接：https://blog.csdn.net/weixin_43460403/article/details/133846379

一、序列模型

1. 序列模型的应用

语音识别
DNA序列分析
音乐生成
情感分类
机器翻译

1.1 自回归模型

自回归模型简称AR模型，是一种处理时间序列的方法，描述当前值与历史值之间的关系，用变量自身的历史时间数据对自身进行预测。

$\mathbf{x_{t} = a_{1}*x_{t-1}+a_{2}*x_{t-2}+...+a_{t-1}*x_{1} + c +\sigma_{t}}$

其中： c是常数项，标准差 $\mathbf{\sigma }$ 等于的随机误差值；被假设为对于任何的t都不变

对条件概率建模：

$\mathbf{ p(x_{t}|x_{1},...x_{t-1}) = p(x_t|f(x_{1},...x_{t-1}))}$

1.2 马尔可夫假设

假设当前数据只跟p个过去数据点相关

$\mathbf{p(x_{t}|x_{1}, ...x_{t}) = p(x_{t}|x_{t-p},...x_{t-1}) = p(x_t|f(x_{t-p},...x_{t-1}))}$

1.3 训练

用马尔可夫假设，训练MLP，预测模型

1.3.1 数据

使用正弦函数和一些可加性噪声来生成序列数据，时间步为1,2,…,1000。

T = 1000   # 产生1000个点
time = torch.arange(1, T+1, dtype=torch.float32)
x = torch.sin(0.01 * time) + torch.normal(0, 0.2, (T,))
d2l.plot(time, [x], 'time', 'x', xlim=[1, 1000], figsize=(6, 3))

输出：

将数据映射为数据对 $\mathbf{y_t=x_{t}}$ , 和 $\mathbf{X_{t} = [x_{t-p}, ... , x_{t-1}]}$

tau = 4
features = torch.zeros((T - tau, tau))
for i in range(tau):
    features[:, i] = x[i: T - tau + i]
# y
labels = x[tau:].reshape((-1, 1))

batch_size, n_train = 16, 600
# 只有前n_train个样本用于训练
train_iter = d2l.load_array((features[:n_train], labels[:n_train]),
                            batch_size, is_train=True)

1.3.2 模型架构

使用一个相当简单的架构训练模型：一个拥有两个全连接层的多层感知机，ReLU激活函数和平方损失。

# 初始化网络权重的函数
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)

# 一个简单的多层感知机
def get_net():
    net = nn.Sequential(nn.Linear(4, 10),
                        nn.ReLU(),
                        nn.Linear(10, 1))
    net.apply(init_weights)
    return net

# 平方损失。注意：MSELoss计算平方误差时不带系数1/2
loss = nn.MSELoss(reduction='none')

1.3.3 训练模型

def train(net, train_iter, loss, epochs, lr):
    
    trainer = torch.optim.Adam(net.parameters(), lr)
    
    for epoch in range(epochs):
        for X, y in train_iter:
            trainer.zero_grad()
            l = loss(net(X), y)
            l.sum().backward()
            trainer.step()
        print(f'epoch {epoch + 1}, ', f'loss: {d2l.evaluate_loss(net, train_iter, loss):f}')

net = get_net()
train(net, train_iter, loss, 5, 0.01)

输出：

epoch 1,  loss: 0.060284
epoch 2,  loss: 0.057505
epoch 3,  loss: 0.061656
epoch 4,  loss: 0.055687
epoch 5,  loss: 0.052883

1.3.4 预测

首先是检查模型预测下一个时间步的能力，也就是单步预测（one-step-ahead prediction）。一步预测是用4个已知预测下一步的未知数，就是说每次都用的是已知数。

# 预测的数据
onestep_predict = net(features)
# 画图
d2l.plot([time, time[tau:]],
         [x.detach().numpy(), onestep_predict.detach().numpy()], 'time',
         'x', legend=['data', '1-step_predict'], xlim=[1, 1000],
         figsize=(6, 3))

输出：

根据训练好的模型，从604开始，根据自己预测得到的值，来预测后面的数据

multistep_predict = torch.zeros(T)
# 从604开始 根据训练好的模型预测值
multistep_predict[: n_train + tau] = x[: n_train + tau]
for i in range(n_train + tau, T):
    # 根据net，求得预测值
    multistep_predict[i] = net(multistep_predict[i - tau:i].reshape((1, -1)))

d2l.plot([time, time[tau:], time[n_train + tau:]],
         [x.detach().numpy(), onestep_predict.detach().numpy(),
          multistep_predict[n_train + tau:].detach().numpy()], 'time',
         'x', legend=['data', '1-step_predict', 'multistep_predict'],
         xlim=[1, 1000], figsize=(6, 3))

输出：

多步预测：比如用前4个预测后16个，那操作就是，先用前4个，预测第5个，然后把预测的5加入（不用真实的5，不然就成单步预测了），用2-5这四个，预测第6个，依次类推。基于k=1,4,16,64，通过对整个序列预测的计算

max_steps = 64

features = torch.zeros((T - tau - max_steps + 1, tau + max_steps))
# 列i（i<tau）是来自x的观测，其时间步从（i）到（i+T-tau-max_steps+1）
for i in range(tau):
    features[:, i] = x[i: i + T - tau - max_steps + 1]

# 列i（i>=tau）是来自（i-tau+1）步的预测，其时间步从（i）到（i+T-tau-max_steps+1）
for i in range(tau, tau + max_steps):
    features[:, i] = net(features[:, i - tau:i]).reshape(-1)

# 步数
steps = (1, 4, 16, 64)
d2l.plot([time[tau + i - 1: T - max_steps + i] for i in steps],
         [features[:, (tau + i - 1)].detach().numpy() for i in steps], 'time', 'x',
         legend=[f'{i}-step preds' for i in steps], xlim=[5, 1000],
         figsize=(6, 3))

输出：

1.4. 总结

1. 于直到时间步t的观测序列，其在时间步t+k的预测输出是“k步预测”。随着我们对预测时间k值的增加，会造成误差的快速累积和预测质量的极速下降。

2. 文本预处理

2.1 预处理步骤

1. 将文本作为字符串加载到内存中。

2. 将字符串拆分为词元（如单词和字符）。

3. 建立一个词表，将拆分的词元映射到数字索引。

4. 将文本转换为数字索引序列，方便模型操作

2.2 读取数据集

d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt',
                                '090b5e7e70c295757f55df93cb0a180b9691891a')

def read_time_machine():  #@save
    # 将数据集加载到文本行的列表中
    with open(d2l.download('time_machine'), 'r') as f:
        lines = f.readlines()
    # 返回只有字母的数据
    return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]

lines = read_time_machine()

2.3 词元化

def tokenize(lines, token='word'): 
    """将文本行拆分为单词或字符词元"""
    if token == 'word':
        return [line.split() for line in lines]
    elif token == 'char':
        return [list(line) for line in lines]
    else:
        print('错误：未知词元类型：' + token)

# 每一行变成列表
tokens = tokenize(lines)
for i in range(11):
    print(tokens[i])

输出：

['the', 'time', 'machine', 'by', 'h', 'g', 'wells']
[]
[]
[]
[]
['i']
[]
[]
['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him']
['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']
['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']

2.4 词表

首先使用数据集作为语料库来构建词表，然后打印前几个高频词元及其索引

'''
构建字典， 用来将字符串类型的标记映射到从0开始的数字索引
'''
vocab = d2l.Vocab(tokens)
print(list(vocab.token_to_idx.items())[:10])

输出：

[('<unk>', 0), ('the', 1), ('i', 2), ('and', 3), ('of', 4), ('a', 5), ('to', 6), ('was', 7), ('in', 8), ('that', 9)]

将每一条文本行转换成一个数字索引列表

for i in [0, 10]:
    print('文本:', tokens[i])
    print('索引:', vocab[tokens[i]])

输出：

文本: ['the', 'time', 'machine', 'by', 'h', 'g', 'wells']
索引: [1, 19, 50, 40, 2183, 2184, 400]
文本: ['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']
索引: [2186, 3, 25, 1044, 362, 113, 7, 1421, 3, 1045, 1]

2.5 整合所有功能

将所有功能打包到load_corpus_time_machine函数中，该函数返回corpus（词元索引列表）和vocab（时光机器语料库的词表）

def load_corpus_time_machine(max_tokens=-1):  #@save
    """返回数据集的词元索引列表和词表"""
    lines = read_time_machine()
    tokens = tokenize(lines, 'char')
    vocab = Vocab(tokens)
    # 因数据集中的每个文本行不一定是一个句子或一个段落，
    # 所以将所有文本行展平到一个列表中
    corpus = [vocab[token] for line in tokens for token in line]
    if max_tokens > 0:
        corpus = corpus[:max_tokens]
    # corpus为列表，存储文中词元所对应词典的位置，vocab为词典元素有多少
    return corpus, vocab

corpus, vocab = load_corpus_time_machine()
len(corpus), len(vocab)

输出：

(170580, 28)

2.6 总结

1. 文本是序列数据的一种最常见的形式之一。

2. 为了对文本进行预处理，我们通常将文本拆分为词元，构建词表将词元字符串映射为数字索引，并将文本数据转换为词元索引以供模型操作。

3. 语言模型

3.1 介绍

给定文本序列 $\mathbf{x_{1},...,x_{T}}$ , 语言模型的目标是估计联合概率 $p(x_{1}, ... , x_{T})$ ，它的应用包括：

1. 预训练模型

2. 生成文本，给定前面的词，生成后续文本

3. 判断多个序列中哪一个更常见

3.2 N元语法

当序列很长时，因为文本量不够大，很可能 $n(x_{1},...,x_{T})\leqslant 1$ , 使用马尔可夫假设可以缓解这个问题。

一元语法： $p(x_{1},x_{2},x_{3},x_{4})=p(x_{1})*p(x_{2})*p(x_{3})*p(x_{4})$ 只依赖于自身

二元语法： $p(x_{1},x_{2},x_{3},x_{4})=p(x_{1})*p(x_{2}|x_{1})*p(x_{3}|x_{2})*p(x_{4}|x_{3})$ 只依赖于前一个

三元语法： $p(x_{1},x_{2},x_{3},x_{4})=p(x_{1})*p(x_{2}|x_{1})*p(x_{3}|x_1,x_{2})*p(x_{4}|x_{2}, x_{3})$ 只依赖于前连个

3.3 自然语言统计

根据前面数据集构建词表，并打印前10个最常用的（频率最高的）单词。

import random
import torch
from d2l import torch as d2l

tokens = d2l.tokenize(d2l.read_time_machine())
# 因为每个文本行不一定是一个句子或一个段落，因此我们把所有文本行拼接到一起
corpus = [token for line in tokens for token in line]
# 构建词表
vocab = d2l.Vocab(corpus)
# 前十个出现频次最高的词元
vocab.token_freqs[:10]

输出：

[('the', 2261),
 ('i', 1267),
 ('and', 1245),
 ('of', 1155),
 ('a', 816),
 ('to', 695),
 ('was', 552),
 ('in', 541),
 ('that', 443),
 ('my', 440)]

画出词元出现的频率：

# 画出词元出现的频率
freqs = [freq for token, freq in vocab.token_freqs]
d2l.plot(freqs, xlabel='token: x', ylabel='frequency: n(x)',
         xscale='log', yscale='log')

输出：

二元语法的频率：

# 二元语法
bigram_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])]
# 词表
bigram_vocab = d2l.Vocab(bigram_tokens)
# 查看前10个二元语法的频率
bigram_vocab.token_freqs[:10]

输出：

[(('of', 'the'), 309),
 (('in', 'the'), 169),
 (('i', 'had'), 130),
 (('i', 'was'), 112),
 (('and', 'the'), 109),
 (('the', 'time'), 102),
 (('it', 'was'), 99),
 (('to', 'the'), 85),
 (('as', 'i'), 78),
 (('of', 'a'), 73)]

三元语法的频率：

# 生成三元语法
trigram_tokens = [triple for triple in zip(
    corpus[:-2], corpus[1:-1], corpus[2:])]
# 此表
trigram_vocab = d2l.Vocab(trigram_tokens)
# 前10频率
trigram_vocab.token_freqs[:10]

输出：

[(('the', 'time', 'traveller'), 59),
 (('the', 'time', 'machine'), 30),
 (('the', 'medical', 'man'), 24),
 (('it', 'seemed', 'to'), 16),
 (('it', 'was', 'a'), 15),
 (('here', 'and', 'there'), 15),
 (('seemed', 'to', 'me'), 14),
 (('i', 'did', 'not'), 14),
 (('i', 'saw', 'the'), 13),
 (('i', 'began', 'to'), 13)]

直观地对比三种模型中的词元频率：一元语法、二元语法和三元语法

# 取频率
bigram_freqs = [freq for token, freq in bigram_vocab.token_freqs]
trigram_freqs = [freq for token, freq in trigram_vocab.token_freqs]

d2l.plot([freqs, bigram_freqs, trigram_freqs], xlabel='token: x',
         ylabel='frequency: n(x)', xscale='log', yscale='log',
         legend=['unigram', 'bigram', 'trigram'])

输出：

根据以上得出的图形，词频以一种明确的方式迅速衰减。将前几个单词作为例外消除后，剩余的所有单词大致遵循双对数坐标图上的一条直线。这意味着单词的频率满足齐普夫定律。即第 $n_{i}$ 个最常用单词的频率为 $\mathbf{logn_{i}=-\alpha logi+c}$ ，其中 $\alpha$ 是刻画分布的指数， $c$ 是常数。