subword-nmt #2

528031207 · 2019-07-10T05:34:15Z

为什么我用subword-nmt get-vocab --input tmp/raw-train.zh-en.en --output en.vocab生成不了数据呀，是版本不兼容吗？

yanwii · 2019-07-10T07:03:45Z

建议英文分词参考一下subword-mnt, problem中也换成原生的SubwordEncoder, 这么弄下来中译英效果会好很多，

528031207 · 2019-07-10T07:12:39Z

谢谢，我尝试一下，这个模型摆弄好几天了没有什么进展，你的建议对我很有帮助！

------------------ 原始邮件 ------------------ 发件人: "Ken"<[email protected]>; 发送时间: 2019年7月10日(星期三) 下午3:03 收件人: "yanwii/machine-translation"<[email protected]>; 抄送: "忘尘居"<[email protected]>; "Author"<[email protected]>; 主题: Re: [yanwii/machine-translation] subword-nmt (#2) 建议英文分词参考一下subword-mnt, problem中也换成原生的SubwordEncoder, 这么弄下来中译英效果会好很多， — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

528031207 · 2019-07-10T12:42:14Z

我做英译汉按照训练代码没有问题，但是在做汉译英的时候总是报下面的错，即使我把batch_size调为128也报错，这种情况应该怎么处理呢
(0) Resource exhausted: Ran out of GPU memory when allocating 688855104 bytes for
[[{{node transformer/parallel_0_5/transformer/transformer/padded_cross_entropy/smoothing_cross_entropy/softmax_cross_entropy_with_logits}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[training/control_dependency/_6751]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

528031207 · 2019-07-10T13:00:51Z

大佬，我把--decode_hparams="batch_size=1024"改为--hparams="batch_size=1024"就可以正常运行了，这两个参数的区别是什么呢，对结果又有什么影响呢？现在每100step的运行时间是之前的一半了。

yanwii · 2019-07-11T01:56:56Z

hparams应该是作为训练时候的参数， decode_hparams作为解码时候的参数，默认的batch_size为2048，所以现在时间减半了。另外OOM是显存超了

528031207 · 2019-07-11T06:26:33Z

大佬我用你提供的方式做英译中可以达到很好的效果，为什么做中译英loss会卡在4.9左右不降呢，bleu也只有不到2，以下是我的参数
export CUDA_VISIBLE_DEVICES=0
t2t-trainer --data_dir=data --output_dir=model_rev --problem=translate_enzh_sub50k_rev --model=transformer --hparams_set=transformer_big --train_steps=200000 --eval_steps=100 --t2t_usr_dir=user_dir --tmp_dir=tmp/ --hparams="batch_size=2048" --worker_gpu_memory_fraction=0.92 --decode_hparams="batch_size=1024"
当我调小学习率的时候，会OOM，即使batchsize调到512也还是OOM 是哪里出了问题吗?还望大佬指点迷津！

yanwii · 2019-07-11T07:46:50Z

中译英的表现跟你分词的结果很有关系，我当时中文分字，英文使用bpe分词，虽然loss降不太下去，bleu在20左右，但实际测试效果还是可以的。

cfwin · 2019-09-09T07:52:07Z

为什么不用jieba的分词呢？
效果不好吗？

yanwii · 2019-09-09T13:17:36Z

中文分词词典大小不好控制，很容易OOV，按字来会好很多。

cfwin · 2019-09-26T02:51:19Z

对中文中的英语单词和特殊字段（例如URL）是怎么预处理的呢？
需要处理吗？
在翻译系统在线翻译的时候，OOV 应该怎么处理呢？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subword-nmt #2

subword-nmt #2

528031207 commented Jul 10, 2019

yanwii commented Jul 10, 2019

528031207 commented Jul 10, 2019 via email

528031207 commented Jul 10, 2019

528031207 commented Jul 10, 2019

yanwii commented Jul 11, 2019

528031207 commented Jul 11, 2019

yanwii commented Jul 11, 2019

cfwin commented Sep 9, 2019

yanwii commented Sep 9, 2019

cfwin commented Sep 26, 2019 •

edited

Loading

subword-nmt #2

subword-nmt #2

Comments

528031207 commented Jul 10, 2019

yanwii commented Jul 10, 2019

528031207 commented Jul 10, 2019 via email

528031207 commented Jul 10, 2019

528031207 commented Jul 10, 2019

yanwii commented Jul 11, 2019

528031207 commented Jul 11, 2019

yanwii commented Jul 11, 2019

cfwin commented Sep 9, 2019

yanwii commented Sep 9, 2019

cfwin commented Sep 26, 2019 • edited Loading

cfwin commented Sep 26, 2019 •

edited

Loading