Massive Exploration of Neural Machine Translation

摘要

　　文章做了一个大规模的NMT系统调优实验，寻找最好的模型结构参数，各种trick．

实验结果

1. Embedding Dimensionality（词向量维度）

Dim	newstest2013	Params
128	21.50 ± 0.16 (21.66)	36.13M
256	21.73 ± 0.09 (21.85)	46.20M
512	21.78 ± 0.05 (21.83)	66.32M
1024	21.36 ± 0.27 (21.67)	106.58M
2048	21.86 ± 0.17 (22.08)	187.09M

　　词向量维度不是越大越好，128维的词响亮收敛速度两倍快．　

2. RNN Cell Variant（RNN类型）

Cell	newstest2013	Params
LSTM	22.22 ± 0.08 (22.33)	68.95M
GRU	21.78 ± 0.05 (21.83)	66.32M
Vanilla-Dec	15.38 ± 0.28 (15.73)	63.18M

LSTM cells consistently outperformed GRU cells

　　实验测出来LSTM比GRU好，而且速度也没差多少．所以用LSTM应该不亏．

3. Encoder and Decoder Depth（模型深度）

模型中加入了residual connections（残差连接）,
$$x^{(l+1)}=h_t^{(l)}(x_t^{(l)}, h_{t-1}^{(l)})+x_t^{(l)}$$
　　其中$h_t^{(l)}(x_t^{(l)}, h_{t-1}^{(l)}$表示$l$层在第$t$个step的输出.加入了residual后效果得到了明显的提升．
res_dense

Depth	newstest2013	Params
Enc-2	21.78 ± 0.05 (21.83)	66.32M
Enc-4	21.85 ± 0.32 (22.23)	69.47M
Enc-8	21.32 ± 0.14 (21.51)	75.77M
Enc-8-Res	19.23 ± 1.96 (21.97)	75.77M
Enc-8-ResD	17.30 ± 2.64 (21.03)	75.77M
Dec-1	21.76 ± 0.12 (21.93)	64.75M
Dec-2	21.78 ± 0.05 (21.83)	66.32M
Dec-4	22.37 ± 0.10 (22.51)	69.47M
Dec-4-Res	17.48 ± 0.25 (17.82)	68.69M
Dec-4-ResD	21.10 ± 0.24 (21.43)	68.69M
Dec-8	01.42 ± 0.23 (1.66)	75.77M
Dec-8-Res	16.99 ± 0.42 (17.47)	75.77M
Dec-8-ResD	20.97 ± 0.34 (21.42)	75.77M

We expected deep models to perform better across the board

　　实验中4层的时候结构更好，还是根据数据来调整．但是文章认为，更深的模型表现会更好，只是还没有找到更好的深度序列模型优化技术．　　

4. Unidirectional vs. Bidirectional Encoder（单双向编码器）

Cell	newstest2013	Params
Bidi-2	21.78 ± 0.05 (21.83)	66.32M
Uni-1	20.54 ± 0.16 (20.73)	63.44M
Uni-1R	21.16 ± 0.35 (21.64)	63.44M
Uni-2	20.98 ± 0.10 (21.07)	65.01M
Uni-2R	21.76 ± 0.21 (21.93)	65.01M
Uni-4	21.47 ± 0.22 (21.70)	68.16M
Uni-4R	21.32 ± 0.42 (21.89)	68.16M

　　双向的效果更好，但差距不是太大．

5. Attention Mechanism（注意力机制）

　　有两种的注意机得分的计算方法，加法变种和乘法变种，其中乘法变种的计算更快．$W_1 j_j$和$W_2 s_i$的维度成为注意力维度，一般取128到1024．
$$sorce(h_j,s_i)=< v,\;tanh(W_1 j_j+W_2 s_i) >$$ $$sorce(h_j,s_i)=< W_1 h_j,\;W_2 S_i >$$

Attention	newstest2013	Params
Mul-128	22.03 ± 0.08 (22.14)	65.73M
Mul-256	22.33 ± 0.28 (22.64)	65.93M
Mul-512	21.78 ± 0.05 (21.83)	66.32M
Mul-1024	18.22 ± 0.03 (18.26)	67.11M
Add-128	22.23 ± 0.11 (22.38)	65.73M
Add-256	22.33 ± 0.04 (22.39)	65.93M
Add-512	22.47 ± 0.27 (22.79)	66.33M
Add-1028	22.10 ± 0.18 (22.36)	67.11M
None-State	9.98 ± 0.28 (10.25)	64.23M
None-Input	11.57 ± 0.30 (11.85)	64.49M

可以看出，加了attention表现更好，而且加法变种的效果更好．

6. Beam Search Strategies

Beam	newstest2013	Params
B1	20.66 ± 0.31 (21.08)	66.32M
B3	21.55 ± 0.26 (21.94)	66.32M
B5	21.60 ± 0.28 (22.03)	66.32M
B10	21.57 ± 0.26 (21.91)	66.32M
B25	21.47 ± 0.30 (21.77)	66.32M
B100	21.10 ± 0.31 (21.39)	66.32M
B10-LP-0.5	21.71 ± 0.25 (22.04)	66.32M
B10-LP-1.0	21.80 ± 0.25 (22.16)	66.32M

　　LP为长度惩罚，一般设置beam的宽度为10即可．

7. Final System Comparison（最终的模型比较）

模型参数如下

Hyperparameter	Value
embedding dim	512
rnn cell variant	LSTMCell
encoder depth	4
decoder depth	4
attention dim	512
attention type	Bahdanau
encoder	bidirectional
beam size	10
length penalty	1.0

各个模型的对比

Model	newstest14	newstest15
Ours (experimental)	22.03	24.75
Ours (combined)	22.19	25.23
OpenNMT	19.34	-
Luong	20.9	-
BPE-Char	21.5	23.9
BPE	-	20.5
RNNSearch-LV	19.4	-
RNNSearch	-	16.5
Deep-Att*	20.6	-
GNMT*	24.61	-
Deep-Conv*	-	24.3

链接

论文 Massive Exploration of Neural Machine Translation
数据集 WMT14

注释

所有数据，图表均来源于论文．