李宏毅机器学习2023-HW5-Machine Translation

🕗 发布于 2024-10-18 04:56 机器学习 机器翻译 人工智能

Task

Machine translation 中文翻译为英文

此任务没有什么难度，跟着课件中的hint做即可

Data

Paired data
- TED2020: TED Talk
  - Raw: 400,726 (sentences)
  - Processed: 394, 052 (sentences)
- en, zh
Monolingual data
- traditional Chinese(zh)

Workflow

在这里插入图片描述

Preprocessing

a. download raw data
b. clean and normalize
c. remove bad data (too long/short)
d. tokenization
Training

a. initialize a model
b. train it with training data
Testing

a. generate translation of test data
b. evaluate the performance

Model

Encoder

The Encoder is a RNN or Transformer Encoder. The following description is for RNN. For every input token, Encoder will generate a output vector and a hidden states vector, and the hidden states vector is passed on to the next step. In other words, the Encoder sequentially reads in the input sequence, and outputs a single vector at each timestep, then finally outputs the final hidden states, or content vector, at the last timestep.
Parameters:
- args
  - encoder_embed_dim: the dimension of embeddings, this compresses the one-hot vector into fixed dimensions, which achieves dimension reduction
  - encoder_ffn_embed_dim is the dimension of hidden states and output vectors
  - encoder_layers is the number of layers for Encoder RNN
  - dropout determines the probability of a neuron’s activation being set to 0, in order to prevent overfitting. Generally this is applied in training, and removed in testing.
- dictionary: the dictionary provided by fairseq. it’s used to obtain the padding index, and in turn the encoder padding mask.
- embed_tokens: an instance of token embeddings (nn.Embedding)
Inputs:
- src_tokens: integer sequence representing english e.g. 1, 28, 29, 205, 2
Outputs:
- outputs: the output of RNN at each timestep, can be furthur processed by Attention
- final_hiddens: the hidden states of each timestep, will be passed to decoder for decoding
- encoder_padding_mask: this tells the decoder which position to ignore

编码器（Encoder）是一个循环神经网络（RNN）或者 Transformer 中的编码器。下面的描述是针对 RNN 的。对于每一个输入的 token，编码器会生成一个输出向量和一个隐藏状态向量，并且将隐藏状态向量传递给下一步。换句话说，编码器顺序地读入输入序列，并且在每一个时间步输出一个单独的向量，然后在最后一个时间步输出最终的隐藏状态，或者称为内容向量（content vector）。
参数:
- args
  - encoder_embed_dim: 嵌入的维度，将 one-hot 向量压缩到固定的维度，实现降维的效果
  - encoder_ffn_embed_dim: 隐藏状态和输出向量的维度
  - encoder_layers: RNN 编码器的层数
  - dropout 确定了一个神经元的激活值被设为 0 的概率，用于防止过拟合。通常这个参数在训练时使用，在测试时移除
- dictionary: fairseq 提供的字典。它用于获取填充索引，进而得到编码器的填充掩码（encoder padding mask）
- embed_tokens: 一个 token embedding 的实例（nn.Embedding）
Inputs:
- src_tokens: 一个表示英语的整数序列，例如: 1, 28, 29, 205, 2
Outputs:
- outputs: RNN 在每个时间步的输出，可以由注意力机制（Attention）进一步处理
- final_hiddens: 每个时间步的隐藏状态，会被传递给解码器（decoder）进行解码
- encoder_padding_mask: 这个参数告诉解码器哪些位置要忽略

Attention

When the input sequence is long, “content vector” alone cannot accurately represent the whole sequence, attention mechanism can provide the Decoder more information.
According to the Decoder embeddings of the current timestep, match the Encoder outputs with decoder embeddings to determine correlation, and then sum the Encoder outputs weighted by the correlation as the input to Decoder RNN.
Common attention implementations use neural network / dot product as the correlation between query (decoder embeddings) and key (Encoder outputs), followed by softmax to obtain a distribution, and finally values (Encoder outputs) is weighted sum-ed by said distribution.
Parameters:
- input_embed_dim: dimensionality of key, should be that of the vector in decoder to attend others
- source_embed_dim: dimensionality of query, should be that of the vector to be attended to (encoder outputs)
- output_embed_dim: dimensionality of value, should be that of the vector after attention, expected by the next layer
Inputs:
- inputs: is the key, the vector to attend to others
- encoder_outputs: is the query/value, the vector to be attended to
- encoder_padding_mask: this tells the decoder which position to ignore
Outputs:
- output: the context vector after attention
- attention score: the attention distribution
当输入序列很长时，单独的“内容向量”就不能准确地表示整个序列，注意力机制可以为解码器提供更多信息。
根据当前时间步的解码器embeddings，将编码器输出与解码器 embeddings 进行匹配，确定相关性，然后将编码器输出按相关性加权求和作为解码器 RNN 的输入。
常见的注意力实现使用神经网络/点积作为 query（解码器 embeddings）和 key（编码器输出）之间的相关性，然后用 softmax 得到一个分布，最后用该分布对 value（编码器输出）进行加权求和。
参数:
- input_embed_dim: key 的维度，应该是解码器中用于 attend 其他向量的向量的维度
- source_embed_dim: query 的维度，应该是被 attend 的向量（编码器输出）的维度
- output_embed_dim: value 的维度，应该是 after attention 的向量的维度，符合下一层的期望,
Inputs:
- inputs: key, 用于 attend 其他向量
- encoder_outputs: query/value, 被 attend 的向量
- encoder_padding_mask: 这个告诉解码器应该忽略那些位置
Outputs:
- output: attention 后的上下文向量
- attention score: attention 的分数

Decoder

The hidden states of Decoder will be initialized by the final hidden states of Encoder (the content vector)
At the same time, Decoder will change its hidden states based on the input of the current timestep (the outputs of previous timesteps), and generates an output
Attention improves the performance
The seq2seq steps are implemented in decoder, so that later the Seq2Seq class can accept RNN and Transformer, without furthur modification.

Parameters:
- args
  - decoder_embed_dim: is the dimensionality of the decoder embeddings, similar to encoder_embed_dim，
  - decoder_ffn_embed_dim: is the dimensionality of the decoder RNN hidden states, similar to encoder_ffn_embed_dim
  - decoder_layers: number of layers of RNN decoder
  - share_decoder_input_output_embed: usually, the projection matrix of the decoder will share weights with the decoder input embeddings
- dictionary: the dictionary provided by fairseq
- embed_tokens: an instance of token embeddings (nn.Embedding)
Inputs:
- prev_output_tokens: integer sequence representing the right-shifted target e.g. 1, 28, 29, 205, 2
- encoder_out: encoder’s output.
- incremental_state: in order to speed up decoding during test time, we will save the hidden state of each timestep. see forward() for details.
Outputs:
- outputs: the logits (before softmax) output of decoder for each timesteps
- extra: unsused
解码器的隐藏状态将由编码器的最终隐藏状态（the content vector）初始化
同时，解码器会根据当前时间步的输入（前一时间步的输出）改变其隐藏状态，并生成一个输出
注意力机制可以提高性能
seq2seq 的步骤是在解码器中实现的，这样以后 Seq2Seq 类可以接受 RNN 和 Transformer，而不需要进一步修改。
参数:
- args
  - decoder_embed_dim: 解码器嵌入的维度，类似于 encoder_embed_dim
  - decoder_ffn_embed_dim: 解码器 RNN 隐藏状态的维度，类似于 encoder_ffn_embed_dim
  - decoder_layers: RNN 解码器的层数
  - share_decoder_input_output_embed: 通常，解码器的投影矩阵会与解码器输入 embeddings 共享权重
- dictionary: fairseq 提供的字典
- embed_tokens: 一个 token embedding 的实例（nn.Embedding）
输入:
- prev_output_tokens: 表示右移目标的整数序列，例如: 1, 28, 29, 205, 2
- encoder_out: 编码器的输出
- incremental_state: 为了加速测试时的解码，我们会保存每个时间步的隐藏状态。详见forward()。
输出:
- outputs: 解码器在每个时间步的输出的对数（softmax之前）
- extra: 未使用

Evaluation

BLEU（双语评估替换）

BLEU（Bilingual Evaluation Understudy）是一种用于评估机器生成的翻译质量的指标，通常与一个或多个参考翻译进行比较。它由Kishore Papineni等人在2002年的论文《BLEU: a Method for Automatic Evaluation of Machine Translation》中提出。

关键概念

N-gram 匹配
- BLEU通过计算机器翻译与参考翻译之间的N-gram匹配来评估翻译质量。N-gram是指连续的N个词的序列。
- 常见的N-gram包括：
  - 1-gram（unigram）：单个词的匹配。“Thank”, “you”, “so”, “much”
  - 2-gram（bigram）：两个连续词的匹配。“Thank you”, “you so”, “so much”
  - 3-gram（trigram）：三个连续词的匹配。“Thank you so”, “you so much”
  - 4-gram（quadgram）：四个连续词的匹配。“Thank you so much”
精度（Precision）
- 精度是指机器翻译中与参考翻译匹配的N-gram的比例。
- 公式：
  $
  \text{Precision} = \frac{\text{机器翻译中匹配的N-gram数量}}{\text{机器翻译中所有N-gram数量}}
  $
修正精度（Modified Precision）

为了避免机器翻译中出现重复匹配的问题，BLEU引入了修正精度。
修正精度考虑了每个N-gram在参考翻译中出现的最大次数。
假设我们有一个机器翻译（MT）和一个参考翻译（Ref），我们想要计算N-gram的修正精度。

设 $ \text{Count}{\text{MT}}(n\text{-gram}) $ 表示某个N-gram在机器翻译中出现的次数。设 $ \text{Count}{\text{Ref}}(n\text{-gram}) $ 表示某个N-gram在参考翻译中出现的最大次数。修正后的N-gram计数 $ \text{Count}{\text{clip}}(n\text{-gram}) $ 取机器翻译中的N-gram计数和参考翻译中的最大N-gram计数的较小值。公式：
$
\text{Count}{\text{clip}}(n\text{-gram}) = \min(\text{Count}{\text{MT}}(n\text{-gram}), \text{Count}{\text{Ref}}(n\text{-gram}))
$

修正精度 $ p_n $ 是所有N-gram的修正计数之和除以机器翻译中所有N-gram的计数之和。公式：
$
p_n = \frac{\sum_{\text{n-gram} \in \text{MT}} \text{Count}{\text{clip}}(n\text{-gram})}{\sum{\text{n-gram} \in \text{MT}} \text{Count}_{\text{MT}}(n\text{-gram})}
$

短句惩罚（Brevity Penalty）

由于较短的翻译更容易获得高精度，BLEU引入了短句惩罚来平衡这一问题。
短句惩罚的公式：
$
\text{BP} = \begin{cases}
1 & \text{如果机器翻译长度 > 参考翻译长度} \
e^{(1 - \frac{\text{参考翻译长度}}{\text{机器翻译长度}})} & \text{如果机器翻译长度 ≤ 参考翻译长度}
\end{cases}
$

BLEU分数

BLEU分数综合了1-gram到4-gram的修正精度和短句惩罚。
公式：
$\text{BLEU} = \text{BP} \times \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)$
其中，$ p_n $ 是N-gram的修正精度，$ w_n $ 是权重，通常取 $ w_n = \frac{1}{N} $。

Baseline

Simple

Just run sample code, 在对应的ipynb文件中我已添加各个结构的详细介绍

Medium

Add learning rate scheduler and train longer

learning rate scheduler

$d_{\text{model}}^{-0.5}\cdot\min({step\_num}^{-0.5},{step\_num}\cdot{warmup\_steps}^{-1.5})$

import math

def get_rate(d_model, step_num, warmup_step):
    # TODO: Change lr from constant to the equation shown above 
    # medium
    lr = 1.0 / math.sqrt(d_model) * min(1.0 / math.sqrt(step_num), step_num / (warmup_step * math.sqrt(warmup_step)))
    return lr

在这里插入图片描述

change maximum epoch for training (15->30)

config = Namespace(
    
    ...
    
    # maximum epochs for training (Medium)
    max_epoch=30,
    start_epoch=1,
    
    ...
    
)

Strong

Switch to Transformer and tuning hyperparameter

将模型转变为Transformer(Model Initialization)

# # HINT: transformer architecture
from fairseq.models.transformer import (
    TransformerEncoder,
    TransformerDecoder,
)

def build_model(args, task):
    """ build a model instance based on hyperparameters """
    src_dict, tgt_dict = task.source_dictionary, task.target_dictionary

    # token embeddings
    encoder_embed_tokens = nn.Embedding(len(src_dict), args.encoder_embed_dim, src_dict.pad())
    decoder_embed_tokens = nn.Embedding(len(tgt_dict), args.decoder_embed_dim, tgt_dict.pad())

    # encoder decoder
    # HINT: TODO: switch to TransformerEncoder & TransformerDecoder
    # encoder = RNNEncoder(args, src_dict, encoder_embed_tokens)
    # decoder = RNNDecoder(args, tgt_dict, decoder_embed_tokens)
    # strong
    encoder = TransformerEncoder(args, src_dict, encoder_embed_tokens)
    decoder = TransformerDecoder(args, tgt_dict, decoder_embed_tokens)
    
    ...

修改模型超参数(Architecture Related Configuration)

参考Attention is all your Need表3中transformer-base的超参数

为避免训练时间过长，可以修改max_epoch 设置得更小一些，比如10,15也可以达到很好的效果

# strong
arch_args = Namespace(
    encoder_embed_dim=512,
    encoder_ffn_embed_dim=2048,
    encoder_layers=6,  
    decoder_embed_dim=512,
    decoder_ffn_embed_dim=2048,
    decoder_layers=6,  
    share_decoder_input_output_embed=True,
    dropout=0.3,
)

==注意：==也可以直接用课件中hint，只是将encoder_layers和decoder_layers改为4，只是得出来的效果没有那么好，此时模型的参数数量会和之前的 RNN 差不多，在 max_epoch =30 的情况下，Bleu 可以达到 23.59，如下：

arch_args = Namespace(
 encoder_embed_dim=256,
 encoder_ffn_embed_dim=512,
 encoder_layers=1, # recommend to increase -> 4
 decoder_embed_dim=256,
 decoder_ffn_embed_dim=1024,
 decoder_layers=1, # recommend to increase -> 4
 share_decoder_input_output_embed=True,
 dropout=0.3,
)

config里面savedir文件夹位置需要改变

# config里面savedir文件夹位置需要改变，否则无法loadcheckpoint的model
config = Namespace(    
datadir = "./DATA/databin/ted2020",    
    #savedir = "./checkpoints/rnn",    
    savedir = "./checkpoints/transformer",

Boss

补全Back-translation部分所有的TODO

思路：

将目标语言和源语言对调，训练从中文到英文得反向模型(backward translation model)
- source_lang="zh"
- target_lang = "en"
用反向模型(backward translation model)对单语数据(monolingual data)进行翻译，得到综合数据(synthetic data)
- Complete TODOs in the sample code
- All the TODOs can be completed by using commands from earlier cells
利用新的数据(synthetic data)训练更strong的正向翻译模型(max_epoch=30就可以过boss baseline)

完整的运行流程是：

将实验配置中 / Configuration for experiments 的 BACK_TRANSLATION 设置为 True 运行训练一个 back-translation 模型，并处理好对应的语料。
将实验配置 / Configuration for experiments 中的 BACK_TRANSLATION 设置为 False 运行结合 ted2020 和 mono (back-translation) 的语料进行训练。

config = Namespace(
    datadir = "./DATA/data-bin/ted2020_with_mono", #boss
    savedir = "./checkpoints/transformer-bt",

    ...

BACK_TRANSLATION = False # Boss
if BACK_TRANSLATION:
    config.datadir = "./DATA/data-bin/ted2020"
    config.savedir = "./checkpoints/transformer-back"
    config.source_lang, config.target_lang= tgt_lang, src_lang

在Generate Prediction中

if not BACK_TRANSLATION: # boss
    raise

TODO: clean corpus

remove sentences that are too long or too short 删除太长或者太短的句子
unify punctuation 统一标点符号

hint: you can use clean_s() defined above to do this

def clean_mono_corpus(prefix, l, ratio=9, max_len=1000, min_len=1):
    if Path(f'{prefix}.clean.zh').exists():
        print(f'{prefix}.clean.zh exists. skipping clean.')
        return
    with open(f'{prefix}', 'r') as l_in_f:
        with open(f'{prefix}.clean.zh', 'w') as l_out_f:
            for s in l_in_f:
                s = s.strip()
                s = clean_s(s, l)
                s_len = len_s(s, l)
                if min_len > 0: # remove short sentence
                    if s_len < min_len:
                        continue
                if max_len > 0: # remove long sentence
                    if s_len > max_len:
                        continue
                print(s, file=l_out_f)
                

mono_data_prefix = f'{mono_prefix}/ted_zh_corpus.deduped'
clean_mono_corpus(mono_data_prefix, 'zh')

Generate pseudo translation生成伪翻译

import os

zh_path = mono_prefix / 'ted_zh_corpus.deduped.clean.zh'
en_path = mono_prefix / 'ted_zh_corpus.deduped.clean.en'

if en_path.exists():
    print(f"{en_path} exists. skipping the generation of psuedo translation.")

with open(zh_path, 'r') as zh_f:
    with open(en_path, 'w') as en_f:
        for line in zh_f:
            line = line.strip()
            if line:
                # Replace with '_.' to get a pseudo translation
                pseudo = '_.'
                # Write the pseudo translation string to the target file and add a new line
                print(pseudo, file=en_f)
            # If the line is empty, just add a new line
            else:
                print(file=en_f)

TODO: Subword Units

Use the spm model of the backward model to tokenize the data into subword units 用反向模型的spm模型将数据分为子词单位

hint: spm model is located at DATA/raw-data/\[dataset\]/spm\[vocab_num\].model

def spm_encode(prefix, vocab_size, mono_prefix):
    spm_model = spm.SentencePieceProcessor(model_file=str(prefix/f'spm{vocab_size}.model'))

    in_path = mono_prefix / 'ted_zh_corpus.deduped.clean'

    for lang in [src_lang, tgt_lang]:
        out_path = mono_prefix / f'mono.tok.{lang}'
        # if out_path.exists():
        #     print(f"{out_path} exists. skipping spm_encode.")
        # else:
        with open(out_path, 'w') as out_f:
            with open(f'{in_path}.{lang}', 'r') as in_f:
                for line in in_f:
                    line = line.strip()
                    tok = spm_model.encode(line, out_type=str)
                    print(' '.join(tok), file=out_f)

spm_encode(prefix, vocab_size, mono_prefix)

TODO: Generate synthetic data with backward model

Add binarized monolingual data to the original data directory, and name it with “split_name” 将二进制化的单语言数据添加到原始数据目录中，并将其命名为“split_name”

ex. ./DATA/data-bin/ted2020/[split_name].zh-en.[“en”, “zh”].[“bin”, “idx”]

then you can use 'generate_prediction(model, task, split="split_name")' to generate translation prediction

# hint: do prediction on split='mono' to create prediction_file
task.load_dataset(split="mono", epoch=1)
generate_prediction(model, task, split='mono', outfile='./prediction.txt' )

TODO: Create new dataset

Combine the prediction data with monolingual data
Use the original spm model to tokenize data into Subword Units
Binarize data with fairseq

# Combine prediction_file (.en) and mono.zh (.zh) into a new dataset.
#
# hint: tokenize prediction_file with the spm model
!cp ./prediction.txt {mono_prefix}/'ted_zh_corpus.deduped.clean.en'
spm_encode(prefix, vocab_size, mono_prefix)
# output: ./DATA/rawdata/mono/mono.tok.en & mono.tok.zh
#
# hint: use fairseq to binarize these two files again
binpath = Path('./DATA/data-bin/synthetic')
src_dict_file = './DATA/data-bin/ted2020/dict.en.txt'
tgt_dict_file = src_dict_file
monopref = './DATA/rawdata/mono/mono.tok' # or whatever path after applying subword tokenization, w/o the suffix (.zh/.en)
if binpath.exists():
    print(binpath, "exists, will not overwrite!")
else:
    !python -m fairseq_cli.preprocess\
        --source-lang 'zh'\
        --target-lang 'en'\
        --trainpref {monopref}\
        --destdir {binpath}\
        --srcdict {src_dict_file}\
        --tgtdict {tgt_dict_file}\
        --workers 2

Gradescope

Problem 1: Visualize Positional Embedding

在这里插入图片描述

放在Confirm model weights used to generate submission模块后

def get_cosine_similarity_matrix(x):
    x = x / x.norm(dim=1, keepdim=True)
    sim = torch.mm(x, x.t())
    return sim


# Get the positional embeddings from the decoder of the model
pos_emb = model.decoder.embed_positions.weights.cpu().detach()
sim = get_cosine_similarity_matrix(pos_emb)
# sim = F.cosine_similarity(pos_emb.unsqueeze(1), pos_emb.unsqueeze(0), dim=-1) # same

# Plot the heatmap of the cosine similarity matrix of the positional embeddings
plt.imshow(sim, cmap="hot", vmin=0, vmax=1)
plt.colorbar()

plt.show()

在这里插入图片描述

Problem 2: Gradient Explosion

Clipping Gradient Norm
在这里插入图片描述

在config = Namespace()中use_wandb=True或如下图：

在这里插入图片描述

Code Link

github

原文地址：https://blog.csdn.net/qq_42875127/article/details/143031087

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：51单片机应用开发（进阶）---IO操作（心形灯为例）
下一篇：Apache Hive 安装指南

如何用示波器检测次级点火系统（一）
单看标题可能会让你觉得这篇文章的主题是关于检测线圈，火花塞和火花塞插头电线。但我们指的是分析燃烧室内电子的行为。目标是看燃料混合物，阀座，压缩，积碳和其它影响这种特性的症状。最终目的是要学会分析示波器
阅读更多2024-10-18
JavaWeb合集03-Maven
bin：存放的是可执行文件，如：测试、打包、编译等这些指令。conf：存放配置文件的，配置本地仓库和阿里云私服就就在settings.xml文件下配置lib：存放的是maven它所依赖的一些jar包文
阅读更多2024-10-18
[k8s理论知识]2.docker基础（一）
PaaS项目被广泛接受的原因是他们提供了一种应用托管的能力。在PaaS出现之前，虚拟机和云计算已经是很普遍的技术和服务。主流做法是租用一批AWS或openstack的虚拟机，然后像管理物理服务器一样，
阅读更多2024-10-18
Maven概述
Maven【[ˈmevən]】这个词可以翻译为"专家","内⾏"。作为Apache组织中的⼀个颇为成功的开源项⽬，Maven主要服务于基于java平台的项⽬构
阅读更多2024-10-18
docker 初步理解
例如在一些基础层当中，原本有了一些环境，比如java、tomcat、maven，然后我们需要再去安装一些redis、mysql或者是python等等，那么就只需要在层的概念上，添加上一层需要添加的环境
阅读更多2024-10-18
得物iOS函数调用栈及符号化调优实践｜得物技术
伴随得物用户规模和业务复杂度不断提升，应用上线前的隐私合规检测愈发重要；当前得物版本灰度前的合规检测正在向着规模化、自动化发展，其中动态检测的性能和效率瓶颈也越发凸显。因此iOS端符号化堆栈采集的难点
阅读更多2024-10-18
本地项目上传Github+Gitee上传特定分支到Linux服务器（自用）
git init。
阅读更多2024-10-18
JavaWeb合集11-Maven高级
当分模块开发时，需要对项目进行打包，这时就会发现，相关联的模块没有下载到Maven中打包会失败，就需要一个一个模块的进行下载（install）到Maven仓库中，比较繁琐。注意：聚合工程中所包含的模块
阅读更多2024-10-18
【java】docker里的tomcat容器的java程序生成日志时间和日期不对，或者，java程序获取的时间日期不对；解决方案4步解决
【java】docker里的tomcat容器的java程序生成日志时间和日期不对，或者，java程序获取的时间日期不对；解决方案4步解决
阅读更多2024-10-18
mac git管理（新）
1.本地生成ssh key添加到远程gitlab上。3.再git clone 就可以了。本地拉取gitlab地址报错。
阅读更多2024-10-18

李宏毅机器学习2023-HW5-Machine Translation

文章目录

Task

Model

Encoder

Attention

Decoder

Evaluation

BLEU（双语评估替换）

Baseline

Simple

Medium

Strong

Boss

TODO: clean corpus

TODO: Subword Units

TODO: Generate synthetic data with backward model

TODO: Create new dataset

Gradescope

Problem 1: Visualize Positional Embedding

Problem 2: Gradient Explosion

Code Link

相关文章