摘要
文章做了一个大规模的NMT系统调优实验,寻找最好的模型结构参数,各种trick.
实验结果
1. Embedding Dimensionality(词向量维度)
Dim | newstest2013 | Params |
---|---|---|
128 | 21.50 ± 0.16 (21.66) | 36.13M |
256 | 21.73 ± 0.09 (21.85) | 46.20M |
512 | 21.78 ± 0.05 (21.83) | 66.32M |
1024 | 21.36 ± 0.27 (21.67) | 106.58M |
2048 | 21.86 ± 0.17 (22.08) | 187.09M |
词向量维度不是越大越好,128维的词响亮收敛速度两倍快.
2. RNN Cell Variant(RNN类型)
Cell | newstest2013 | Params |
---|---|---|
LSTM | 22.22 ± 0.08 (22.33) | 68.95M |
GRU | 21.78 ± 0.05 (21.83) | 66.32M |
Vanilla-Dec | 15.38 ± 0.28 (15.73) | 63.18M |
LSTM cells consistently outperformed GRU cells
实验测出来LSTM比GRU好,而且速度也没差多少.所以用LSTM应该不亏.
3. Encoder and Decoder Depth(模型深度)
模型中加入了residual connections(残差连接),
$$x^{(l+1)}=h_t^{(l)}(x_t^{(l)}, h_{t-1}^{(l)})+x_t^{(l)}$$
其中\(h_t^{(l)}(x_t^{(l)}, h_{t-1}^{(l)}\)表示\(l\)层在第\(t\)个step的输出.加入了residual后效果得到了明显的提升.
Depth | newstest2013 | Params |
---|---|---|
Enc-2 | 21.78 ± 0.05 (21.83) | 66.32M |
Enc-4 | 21.85 ± 0.32 (22.23) | 69.47M |
Enc-8 | 21.32 ± 0.14 (21.51) | 75.77M |
Enc-8-Res | 19.23 ± 1.96 (21.97) | 75.77M |
Enc-8-ResD | 17.30 ± 2.64 (21.03) | 75.77M |
Dec-1 | 21.76 ± 0.12 (21.93) | 64.75M |
Dec-2 | 21.78 ± 0.05 (21.83) | 66.32M |
Dec-4 | 22.37 ± 0.10 (22.51) | 69.47M |
Dec-4-Res | 17.48 ± 0.25 (17.82) | 68.69M |
Dec-4-ResD | 21.10 ± 0.24 (21.43) | 68.69M |
Dec-8 | 01.42 ± 0.23 (1.66) | 75.77M |
Dec-8-Res | 16.99 ± 0.42 (17.47) | 75.77M |
Dec-8-ResD | 20.97 ± 0.34 (21.42) | 75.77M |
We expected deep models to perform better across the board
实验中4层的时候结构更好,还是根据数据来调整.但是文章认为,更深的模型表现会更好,只是还没有找到更好的深度序列模型优化技术.
4. Unidirectional vs. Bidirectional Encoder(单双向编码器)
Cell | newstest2013 | Params |
---|---|---|
Bidi-2 | 21.78 ± 0.05 (21.83) | 66.32M |
Uni-1 | 20.54 ± 0.16 (20.73) | 63.44M |
Uni-1R | 21.16 ± 0.35 (21.64) | 63.44M |
Uni-2 | 20.98 ± 0.10 (21.07) | 65.01M |
Uni-2R | 21.76 ± 0.21 (21.93) | 65.01M |
Uni-4 | 21.47 ± 0.22 (21.70) | 68.16M |
Uni-4R | 21.32 ± 0.42 (21.89) | 68.16M |
双向的效果更好,但差距不是太大.
5. Attention Mechanism(注意力机制)
有两种的注意机得分的计算方法,加法变种和乘法变种,其中乘法变种的计算更快.\(W_1 j_j\)和\(W_2 s_i\)的维度成为注意力维度,一般取128到1024.
$$sorce(h_j,s_i)=< v,\;tanh(W_1 j_j+W_2 s_i) >$$ $$sorce(h_j,s_i)=< W_1 h_j,\;W_2 S_i >$$
Attention | newstest2013 | Params |
---|---|---|
Mul-128 | 22.03 ± 0.08 (22.14) | 65.73M |
Mul-256 | 22.33 ± 0.28 (22.64) | 65.93M |
Mul-512 | 21.78 ± 0.05 (21.83) | 66.32M |
Mul-1024 | 18.22 ± 0.03 (18.26) | 67.11M |
Add-128 | 22.23 ± 0.11 (22.38) | 65.73M |
Add-256 | 22.33 ± 0.04 (22.39) | 65.93M |
Add-512 | 22.47 ± 0.27 (22.79) | 66.33M |
Add-1028 | 22.10 ± 0.18 (22.36) | 67.11M |
None-State | 9.98 ± 0.28 (10.25) | 64.23M |
None-Input | 11.57 ± 0.30 (11.85) | 64.49M |
可以看出,加了attention表现更好,而且加法变种的效果更好.
6. Beam Search Strategies
Beam | newstest2013 | Params |
---|---|---|
B1 | 20.66 ± 0.31 (21.08) | 66.32M |
B3 | 21.55 ± 0.26 (21.94) | 66.32M |
B5 | 21.60 ± 0.28 (22.03) | 66.32M |
B10 | 21.57 ± 0.26 (21.91) | 66.32M |
B25 | 21.47 ± 0.30 (21.77) | 66.32M |
B100 | 21.10 ± 0.31 (21.39) | 66.32M |
B10-LP-0.5 | 21.71 ± 0.25 (22.04) | 66.32M |
B10-LP-1.0 | 21.80 ± 0.25 (22.16) | 66.32M |
LP为长度惩罚,一般设置beam的宽度为10即可.
7. Final System Comparison(最终的模型比较)
模型参数如下
Hyperparameter | Value |
---|---|
embedding dim | 512 |
rnn cell variant | LSTMCell |
encoder depth | 4 |
decoder depth | 4 |
attention dim | 512 |
attention type | Bahdanau |
encoder | bidirectional |
beam size | 10 |
length penalty | 1.0 |
各个模型的对比
Model | newstest14 | newstest15 |
---|---|---|
Ours (experimental) | 22.03 | 24.75 |
Ours (combined) | 22.19 | 25.23 |
OpenNMT | 19.34 | - |
Luong | 20.9 | - |
BPE-Char | 21.5 | 23.9 |
BPE | - | 20.5 |
RNNSearch-LV | 19.4 | - |
RNNSearch | - | 16.5 |
Deep-Att* | 20.6 | - |
GNMT* | 24.61 | - |
Deep-Conv* | - | 24.3 |
链接
论文 Massive Exploration of Neural Machine Translation
数据集 WMT14
注释
所有数据,图表均来源于论文.