Massive Exploration of Neural Machine Translation

摘要

  文章做了一个大规模的NMT系统调优实验,寻找最好的模型结构参数,各种trick.

实验结果

1. Embedding Dimensionality(词向量维度)

Dim newstest2013 Params
128 21.50 ± 0.16 (21.66) 36.13M
256 21.73 ± 0.09 (21.85) 46.20M
512 21.78 ± 0.05 (21.83) 66.32M
1024 21.36 ± 0.27 (21.67) 106.58M
2048 21.86 ± 0.17 (22.08) 187.09M

  词向量维度不是越大越好,128维的词响亮收敛速度两倍快.  

2. RNN Cell Variant(RNN类型)

Cell newstest2013 Params
LSTM 22.22 ± 0.08 (22.33) 68.95M
GRU 21.78 ± 0.05 (21.83) 66.32M
Vanilla-Dec 15.38 ± 0.28 (15.73) 63.18M

LSTM cells consistently outperformed GRU cells

  实验测出来LSTM比GRU好,而且速度也没差多少.所以用LSTM应该不亏.

3. Encoder and Decoder Depth(模型深度)

模型中加入了residual connections(残差连接),
$$x^{(l+1)}=h_t^{(l)}(x_t^{(l)}, h_{t-1}^{(l)})+x_t^{(l)}$$
  其中\(h_t^{(l)}(x_t^{(l)}, h_{t-1}^{(l)}\)表示\(l\)层在第\(t\)个step的输出.加入了residual后效果得到了明显的提升.
res_dense

Depth newstest2013 Params
Enc-2 21.78 ± 0.05 (21.83) 66.32M
Enc-4 21.85 ± 0.32 (22.23) 69.47M
Enc-8 21.32 ± 0.14 (21.51) 75.77M
Enc-8-Res 19.23 ± 1.96 (21.97) 75.77M
Enc-8-ResD 17.30 ± 2.64 (21.03) 75.77M
Dec-1 21.76 ± 0.12 (21.93) 64.75M
Dec-2 21.78 ± 0.05 (21.83) 66.32M
Dec-4 22.37 ± 0.10 (22.51) 69.47M
Dec-4-Res 17.48 ± 0.25 (17.82) 68.69M
Dec-4-ResD 21.10 ± 0.24 (21.43) 68.69M
Dec-8 01.42 ± 0.23 (1.66) 75.77M
Dec-8-Res 16.99 ± 0.42 (17.47) 75.77M
Dec-8-ResD 20.97 ± 0.34 (21.42) 75.77M

We expected deep models to perform better across the board

  实验中4层的时候结构更好,还是根据数据来调整.但是文章认为,更深的模型表现会更好,只是还没有找到更好的深度序列模型优化技术.   

4. Unidirectional vs. Bidirectional Encoder(单双向编码器)

Cell newstest2013 Params
Bidi-2 21.78 ± 0.05 (21.83) 66.32M
Uni-1 20.54 ± 0.16 (20.73) 63.44M
Uni-1R 21.16 ± 0.35 (21.64) 63.44M
Uni-2 20.98 ± 0.10 (21.07) 65.01M
Uni-2R 21.76 ± 0.21 (21.93) 65.01M
Uni-4 21.47 ± 0.22 (21.70) 68.16M
Uni-4R 21.32 ± 0.42 (21.89) 68.16M

  双向的效果更好,但差距不是太大.

5. Attention Mechanism(注意力机制)

  有两种的注意机得分的计算方法,加法变种和乘法变种,其中乘法变种的计算更快.\(W_1 j_j\)和\(W_2 s_i\)的维度成为注意力维度,一般取128到1024.
$$sorce(h_j,s_i)=< v,\;tanh(W_1 j_j+W_2 s_i) >$$ $$sorce(h_j,s_i)=< W_1 h_j,\;W_2 S_i >$$

Attention newstest2013 Params
Mul-128 22.03 ± 0.08 (22.14) 65.73M
Mul-256 22.33 ± 0.28 (22.64) 65.93M
Mul-512 21.78 ± 0.05 (21.83) 66.32M
Mul-1024 18.22 ± 0.03 (18.26) 67.11M
Add-128 22.23 ± 0.11 (22.38) 65.73M
Add-256 22.33 ± 0.04 (22.39) 65.93M
Add-512 22.47 ± 0.27 (22.79) 66.33M
Add-1028 22.10 ± 0.18 (22.36) 67.11M
None-State 9.98 ± 0.28 (10.25) 64.23M
None-Input 11.57 ± 0.30 (11.85) 64.49M

可以看出,加了attention表现更好,而且加法变种的效果更好.

6. Beam Search Strategies

Beam newstest2013 Params
B1 20.66 ± 0.31 (21.08) 66.32M
B3 21.55 ± 0.26 (21.94) 66.32M
B5 21.60 ± 0.28 (22.03) 66.32M
B10 21.57 ± 0.26 (21.91) 66.32M
B25 21.47 ± 0.30 (21.77) 66.32M
B100 21.10 ± 0.31 (21.39) 66.32M
B10-LP-0.5 21.71 ± 0.25 (22.04) 66.32M
B10-LP-1.0 21.80 ± 0.25 (22.16) 66.32M

  LP为长度惩罚,一般设置beam的宽度为10即可.

7. Final System Comparison(最终的模型比较)

模型参数如下

Hyperparameter Value
embedding dim 512
rnn cell variant LSTMCell
encoder depth 4
decoder depth 4
attention dim 512
attention type Bahdanau
encoder bidirectional
beam size 10
length penalty 1.0

各个模型的对比

Model newstest14 newstest15
Ours (experimental) 22.03 24.75
Ours (combined) 22.19 25.23
OpenNMT 19.34 -
Luong 20.9 -
BPE-Char 21.5 23.9
BPE - 20.5
RNNSearch-LV 19.4 -
RNNSearch - 16.5
Deep-Att* 20.6 -
GNMT* 24.61 -
Deep-Conv* - 24.3

链接

论文 Massive Exploration of Neural Machine Translation
数据集 WMT14

注释

所有数据,图表均来源于论文.