Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subword-nmt #2

Open
528031207 opened this issue Jul 10, 2019 · 10 comments
Open

subword-nmt #2

528031207 opened this issue Jul 10, 2019 · 10 comments

Comments

@528031207
Copy link

为什么我用subword-nmt get-vocab --input tmp/raw-train.zh-en.en --output en.vocab生成不了数据呀,是版本不兼容吗?

@yanwii
Copy link
Owner

yanwii commented Jul 10, 2019

建议英文分词参考一下subword-mnt, problem中也换成原生的SubwordEncoder, 这么弄下来中译英效果会好很多,

@528031207
Copy link
Author

528031207 commented Jul 10, 2019 via email

@528031207
Copy link
Author

我做英译汉按照训练代码没有问题,但是在做汉译英的时候总是报下面的错,即使我把batch_size调为128也报错,这种情况应该怎么处理呢
(0) Resource exhausted: Ran out of GPU memory when allocating 688855104 bytes for
[[{{node transformer/parallel_0_5/transformer/transformer/padded_cross_entropy/smoothing_cross_entropy/softmax_cross_entropy_with_logits}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[training/control_dependency/_6751]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

@528031207
Copy link
Author

大佬,我把--decode_hparams="batch_size=1024"改为--hparams="batch_size=1024"就可以正常运行了,这两个参数的区别是什么呢,对结果又有什么影响呢?现在每100step的运行时间是之前的一半了。

@yanwii
Copy link
Owner

yanwii commented Jul 11, 2019

hparams应该是作为训练时候的参数, decode_hparams作为解码时候的参数,默认的batch_size为2048,所以现在时间减半了。另外OOM是显存超了

@528031207
Copy link
Author

大佬我用你提供的方式做英译中可以达到很好的效果,为什么做中译英loss会卡在4.9左右不降呢,bleu也只有不到2,以下是我的参数
export CUDA_VISIBLE_DEVICES=0
t2t-trainer --data_dir=data --output_dir=model_rev --problem=translate_enzh_sub50k_rev --model=transformer --hparams_set=transformer_big --train_steps=200000 --eval_steps=100 --t2t_usr_dir=user_dir --tmp_dir=tmp/ --hparams="batch_size=2048" --worker_gpu_memory_fraction=0.92 --decode_hparams="batch_size=1024"
当我调小学习率的时候,会OOM,即使batchsize调到512也还是OOM 是哪里出了问题吗?还望大佬指点迷津!

@yanwii
Copy link
Owner

yanwii commented Jul 11, 2019

中译英的表现跟你分词的结果很有关系,我当时中文分字,英文使用bpe分词,虽然loss降不太下去,bleu在20左右,但实际测试效果还是可以的。

@cfwin
Copy link

cfwin commented Sep 9, 2019

为什么不用jieba的分词呢?
效果不好吗?

@yanwii
Copy link
Owner

yanwii commented Sep 9, 2019

中文分词词典大小不好控制,很容易OOV,按字来会好很多。

@cfwin
Copy link

cfwin commented Sep 26, 2019

对中文中的英语单词和特殊字段(例如URL)是怎么预处理的呢?
需要处理吗?
在翻译系统在线翻译的时候,OOV 应该怎么处理呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants