Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that good quality text will give good model.
We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.
- Experiments setup and goals
- Data
- Training Polish RoBERTa protocol with Fairseq
- Pretrained models and vocabs
- Used libraries
- Acknowledgements
- About Ermlab Software
During experiments, we want to examine:
- impact of different learning schedulers for training speed and accuracy, tested:
- linear schedule with warmup
- cyclic schedule: cosine, triangular
- impact of training time on final accuracy
- Polish Wikipedia dump 03.2020 - archive link https://dumps.wikimedia.org/plwiki/20200301 (not working anymore)
- Polish private book corpus (6GB)
- Cleaned Polish Oscar corpus (remove non-polish sentences, keep only valid sentences etc.)(Cleaned Polish Oscar details)
Our main assumption is that good quality text should produce good language model. So far the most popular polish dataset was "Polish wikipedia dump" however this text characterize with formal language. Second source of text is polish part of Oscar corpus - crawled text from the polish internet. When we investigate this corpus with more details it appears that it contains a lot of: foreign sentences (in Russian, English, German etc.), too short sentences and not grammatical sentences (as words enumerations).
We prepared a few cleaning heuristics:
- remove sentences shorter than
- remove non polish sentences
- remove ungrammatical sentences (without verbs and with too many nouns)
- perform sentence tokenization and save each sentence in new line, after each document the new line was added
Data was cleaned with use of process_sentences.py script, the whole process is presented in the polish_process_data.ipynb notebook.
- Polish Wikipedia dump (03.2020)
- Cleaned Polish Oscar corpus
Summary of Cleaned Polish Oscar corpus
File | All lines | All sentences | Invalid length sent. | Non-polish sent. | Ungrammatical sent. | Valid sentences |
---|---|---|---|---|---|---|
corpus_oscar_2020-04-10_32M_lines.txt | 32 000 506 | 94 332 394 | 1 796 371 | 296 093 | 8 100 750 | 84 139 180 |
corpus_oscar_2020-04-10_64M_lines.txt | 32 000 560 | 96 614 563 | 1 777 586 | 491 789 | 7 869 507 | 86 475 681 |
corpus_oscar_2020-04-10_96M_lines.txt | 32 001 738 | 96 457 553 | 1 796 083 | 302 598 | 7 908 090 | 86 450 782 |
corpus_oscar_2020-04-10_128M_lines.txt | 32 002 212 | 97 761 040 | 1 919 071 | 305 924 | 7 891 846 | 87 644 199 |
corpus_oscar_2020-04-10_128M_above_lines.txt | 17 519 467 | 53 446 884 | 1 090 714 | 212 657 | 4 343 296 | 47 800 217 |
Train Corpus | Lines | Words | Characters |
---|---|---|---|
Polish Wikipedia (2020-03) | 11 748 343 | 181 560 313 | 1 309 416 493 |
Books | 81 140 395 | 829 404 801 | 5 386 053 287 |
Oscar (32M part, cleared) | 112 466 497 | 1 198 735 834 | 8 454 177 161 |
Total | 205 355 235 | 2 209 700 948 | 15 149 646 941 |
For testing we take ~10% of each corpus
Test Corpus | Lines | Words | Characters |
---|---|---|---|
Polish Wikipedia (2020-03) | 1 305 207 | 21 333 280 | 155 403 453 |
Books | 9 007 716 | 93 141 853 | 610 111 989 |
Oscar (32M part, cleared) | 14 515 735 | 157 303 490 | 1 104 855 397 |
Total | 24 828 658 | 271 778 623 | 1 870 370 839 |
General recipe of the final data preparation and model training process:
- Prepare huge text file data.txt e.g. Wikipedia text, where each sentence is in a new line and each article is separated by two new lines.
- Take 10-15M lines and prepare another file for sentencepiece (vocabulary builder) - again, each sentence is in one line.
- Train sentencepiece vocabulary and save it in fairseq format vocab.fairseq.txt.
- Encode data.txt with trained sentencepiece model to data.sp.txt.
- Preprocess data.sp.txt with fairseq-preprocess.
- Run training.
Detailed data preparation steps for fairseq (vocab gen and binarization) are available in separate notebook polish_roberta_vocab.ipynb.
Commands needed to reproduce fairseq models with various training protocols may be found in polish_roberta_training.ipynb.
- PoLitBert_v32k_linear_50k
- PoLitBert_v32k_tri_50k
- PoLitBert_v32k_cos1_2_50k
- PoLitBert_v32k_tri_125k
- PoLitBert_32k_cos1_5
- PoLitBert_v32k_cos1_5_50k
- PoLitBert_v50k_linear_50k
All models were evaluated at 26.07.2020 with 9 KLEJ benchmark tasks . Below results were achieved with use of fine-tuning scripts from Polish RoBERTa without any further tweaks. which suggests that the potential of the models may not been fully utilized yet.
Model | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN | PolEmo2.0-OUT | DYK | PSC | AR | Avg |
---|---|---|---|---|---|---|---|---|---|---|
PoLitBert_v32k_linear_50k | 92.3 | 91.5 | 92.2 | 64 | 89.8 | 76.1 | 60.2 | 97.9 | 87.6 | 83.51 |
PoLitBert_v32k_linear_50k_2ep | 91.9 | 91.8 | 90.9 | 64.6 | 89.1 | 75.9 | 59.8 | 97.9 | 87.9 | 83.31 |
PoLitBert_v32k_tri_125k | 93.6 | 91.7 | 91.8 | 62.4 | 90.3 | 75.7 | 59 | 97.4 | 87.2 | 83.23 |
PoLitBert_v32k_linear_125k_2ep | 94.3 | 92.1 | 92.8 | 64 | 90.6 | 79.1 | 51.7 | 94.1 | 88.7 | 83.04 |
PoLitBert_v32k_tri_50k | 93.9 | 91.7 | 92.1 | 57.6 | 88.8 | 77.9 | 56.6 | 96.5 | 87.7 | 82.53 |
PoLitBert_v32k_linear_125k | 94 | 91.3 | 91.8 | 61.1 | 90.4 | 78.1 | 50.8 | 95.8 | 88.2 | 82.39 |
PoLitBert_v50k_linear_50k | 92.8 | 92.3 | 91.7 | 57.7 | 90.3 | 80.6 | 42.2 | 97.4 | 88.5 | 81.50 |
PoLitBert_v32k_cos1_2_50k | 92.5 | 91.6 | 90.7 | 60.1 | 89.5 | 73.5 | 49.1 | 95.2 | 87.5 | 81.08 |
PoLitBert_v32k_cos1_5_50k | 93.2 | 90.7 | 89.5 | 51.7 | 89.5 | 74.3 | 49.1 | 97.1 | 87.5 | 80.29 |
A comparison with other developed models is available in the continuously updated leaderboard of evaluation tasks.
We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.
Link to PoLitBert research log (same as below).
Experiment | Model name | Vocab size | Scheduler | BSZ | WPB | Steps | Train tokens | Train loss | Valid loss | Best (test) loss |
---|---|---|---|---|---|---|---|---|---|---|
#1 | PoLitBert_v32k_linear_50k (tensorboard) | 32k | linear decay | 8 192 | 4,07E+06 | 50 000 | 2,03E+11 | 1,502 | 1,460 | 1,422 |
#2 | PoLitBert_v32k_tri_50k (tensorboard) | 32k | triangular | 8 192 | 4,07E+06 | 50 000 | 2,03E+11 | 1,473 | 1,436 | 1,402 |
#3 | PoLitBert_v32k_cos1_50k (tensorboard) | 32k | cosine mul=1 | 8 192 | 4,07E+06 | 23 030 | 9,37E+10 | 10,930 | 11,000 | 1,832 |
#4 | PoLitBert_v32k_cos1_2_50k (tensorboard) | 32k | cosine mul=1 peak=0.0005 | 8 192 | 4,07E+06 | 50 000 | 2,03E+11 | 1,684 | 1,633 | 1,595 |
#5 | PoLitBert_v32k_cos1_3_50k (tensorboard) | 32k | cosine mul=2 | 8 192 | 4,07E+06 | 3 735 | 1,52E+10 | 10,930 | ||
#6 | PoLitBert_v32k_cos1_4_50k (tensorboard) | 32k | cosine mul=2 grad-clip=0.9 | 8 192 | 4,07E+06 | 4 954 | 2,02E+10 | 10,910 | 10,940 | 2,470 |
#8 | PoLitBert_v32k_tri_125k (tensorboard) | 32k | triangular | 8 192 | 4,07E+06 | 125 000 | 5,09E+11 | 1,435 | 1,313 | 1,363 |
#9 | PoLitBert_v32k_cos1_5_50k (tensorboard) | 32k | cosine, mul=2, grad-clip=0.9 | 8 192 | 4,07E+06 | 125 000 | 5,09E+11 | 1,502 | 1,358 | 1,426 |
#10 | PoLitBert_v32k_linear_125k (tensorboard) | 32k | linear decay | 8 192 | 4,07E+06 | 125 000 | 5,09E+11 | 1,322 | 1,218 | 1,268 |
#11 | PoLitBert_v50k_linear_50k (tensorboard) | 50k | linear decay | 8 192 | 4,07E+06 | 50 000 | 2,04E+11 | 1,546 | 1,439 | 1,480 |
- KRNNT - Polish morphological tagger. - we use dockerized version
- langdetect - for detecting sentence language
- polyglot - for detecting sentence language
- sentencepiece
- Fairseq v0.9
- langdetect needs additional package
- install sudo apt-get install libicu-dev
- sentencepiece was installed from source code
This is the joint work of companies Ermlab Software and Literacka
Part of the work was financed from the grant of The Polish National Centre for Research and Development no. POIR.01.01.01-00-1213/19, the beneficiary of which was Literacka. Project title "Asystent wydawniczy - oprogramowanie do analizy treści, wykorzystujące algorytmy sztucznej inteligencji w celu zautomatyzowania procesu wydawniczego i predykcji sukcesów rynkowych publikacji."
We would like to express ours gratitude to NVidia Inception Programme and Amazon AWS for providing the free GPU credits - thank you!
- simonefrancia from Musixmatch for his detailed explanations how they trained RoBERTa Italian model Umberto
Ermlab - Polish machine learning company
🦉 Website | Repository