Count probability of a text using transformer MLM model.
Data available at: http://poleval.pl/tasks/task1/
python3 predict_lm.py --model xlm-roberta-base --nbest data/test2/nbest.txt --output data/test2/xlm-roberta-base_lm.jsonl
- for PolishRoberta argument
--opi
is needed. - add
--referenece data/test2/reference.txt
to calculate WER
python3 chose_lm.py data/test2/xlm-roberta-base_lm.jsonl 2>/dev/null > predictions.txt
python3 correct_spaces_lm.py --model xlm-roberta-base --path predictions.txt > predictions_spaces.txt
dev | test | test2 | |
---|---|---|---|
oracle from 100-best | 6.13% | 5.92% | |
1-best | 12.09% | 12.22% | |
xlm-roberta-large | 9.25% | 8.86% | |
xlm-roberta-large ppl | 9.68% | 9.44% | |
xlm-roberta-large + preprocess | 8.53% | 8.14% | |
polishroberta-base + preprocess | 8.15% | 7.99% | |
polishroberta-large + preprocess | 7.77% | 7.81% | |
polishroberta-large + preprocess + correct spaces | 7.85 | 7.84 | |
xlm-roberta-large + preprocess + correct spaces | X |
- support for longer texts