Skip to content

Polish RoBERTA model trained on Polish literature, Wikipedia, and Oscar. The major assumption is that quality text will give a good model.

License

Notifications You must be signed in to change notification settings

Ermlab/PoLitBert

Repository files navigation

PoLitBert - Polish RoBERTa model

Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that good quality text will give good model.

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Table of Contents

Experiments setup and goals

During experiments, we want to examine:

  • impact of different learning schedulers for training speed and accuracy, tested:
    • linear schedule with warmup
    • cyclic schedule: cosine, triangular
  • impact of training time on final accuracy

Data

Data processing for training

Our main assumption is that good quality text should produce good language model. So far the most popular polish dataset was "Polish wikipedia dump" however this text characterize with formal language. Second source of text is polish part of Oscar corpus - crawled text from the polish internet. When we investigate this corpus with more details it appears that it contains a lot of: foreign sentences (in Russian, English, German etc.), too short sentences and not grammatical sentences (as words enumerations).

We prepared a few cleaning heuristics:

  • remove sentences shorter than
  • remove non polish sentences
  • remove ungrammatical sentences (without verbs and with too many nouns)
  • perform sentence tokenization and save each sentence in new line, after each document the new line was added

Data was cleaned with use of process_sentences.py script, the whole process is presented in the polish_process_data.ipynb notebook.

Summary of Cleaned Polish Oscar corpus

File All lines All sentences Invalid length sent. Non-polish sent. Ungrammatical sent. Valid sentences
corpus_oscar_2020-04-10_32M_lines.txt 32 000 506 94 332 394 1 796 371 296 093 8 100 750 84 139 180
corpus_oscar_2020-04-10_64M_lines.txt 32 000 560 96 614 563 1 777 586 491 789 7 869 507 86 475 681
corpus_oscar_2020-04-10_96M_lines.txt 32 001 738 96 457 553 1 796 083 302 598 7 908 090 86 450 782
corpus_oscar_2020-04-10_128M_lines.txt 32 002 212 97 761 040 1 919 071 305 924 7 891 846 87 644 199
corpus_oscar_2020-04-10_128M_above_lines.txt 17 519 467 53 446 884   1 090 714 212 657 4 343 296 47 800 217

Training, testing dataset stats

Train Corpus Lines Words Characters
Polish Wikipedia (2020-03) 11 748 343 181 560 313 1 309 416 493
Books 81 140 395 829 404 801 5 386 053 287
Oscar (32M part, cleared) 112 466 497 1 198 735 834 8 454 177 161
Total 205 355 235 2 209 700 948 15 149 646 941

For testing we take ~10% of each corpus

Test Corpus Lines Words Characters
Polish Wikipedia (2020-03) 1 305 207 21 333 280 155 403 453
Books 9 007 716 93 141 853 610 111 989
Oscar (32M part, cleared) 14 515 735 157 303 490 1 104 855 397
Total 24 828 658 271 778 623 1 870 370 839

Training Polish RoBERTA protocol with Fairseq

General recipe of the final data preparation and model training process:

  1. Prepare huge text file data.txt e.g. Wikipedia text, where each sentence is in a new line and each article is separated by two new lines.
  2. Take 10-15M lines and prepare another file for sentencepiece (vocabulary builder) - again, each sentence is in one line.
  3. Train sentencepiece vocabulary and save it in fairseq format vocab.fairseq.txt.
  4. Encode data.txt with trained sentencepiece model to data.sp.txt.
  5. Preprocess data.sp.txt with fairseq-preprocess.
  6. Run training.

Detailed data preparation steps for fairseq (vocab gen and binarization) are available in separate notebook polish_roberta_vocab.ipynb.

Commands needed to reproduce fairseq models with various training protocols may be found in polish_roberta_training.ipynb.

Pretrained models and vocabs

KLEJ evaluation

All models were evaluated at 26.07.2020 with 9 KLEJ benchmark tasks . Below results were achieved with use of fine-tuning scripts from Polish RoBERTa without any further tweaks. which suggests that the potential of the models may not been fully utilized yet.

Model NKJP-NER CDSC-E CDSC-R CBD PolEmo2.0-IN PolEmo2.0-OUT DYK PSC AR Avg
PoLitBert_v32k_linear_50k 92.3 91.5 92.2 64 89.8 76.1 60.2 97.9 87.6 83.51
PoLitBert_v32k_linear_50k_2ep 91.9 91.8 90.9 64.6 89.1 75.9 59.8 97.9 87.9 83.31
PoLitBert_v32k_tri_125k 93.6 91.7 91.8 62.4 90.3 75.7 59 97.4 87.2 83.23
PoLitBert_v32k_linear_125k_2ep 94.3 92.1 92.8 64 90.6 79.1 51.7 94.1 88.7 83.04
PoLitBert_v32k_tri_50k 93.9 91.7 92.1 57.6 88.8 77.9 56.6 96.5 87.7 82.53
PoLitBert_v32k_linear_125k 94 91.3 91.8 61.1 90.4 78.1 50.8 95.8 88.2 82.39
PoLitBert_v50k_linear_50k 92.8 92.3 91.7 57.7 90.3 80.6 42.2 97.4 88.5 81.50
PoLitBert_v32k_cos1_2_50k 92.5 91.6 90.7 60.1 89.5 73.5 49.1 95.2 87.5 81.08
PoLitBert_v32k_cos1_5_50k 93.2 90.7 89.5 51.7 89.5 74.3 49.1 97.1 87.5 80.29

A comparison with other developed models is available in the continuously updated leaderboard of evaluation tasks.

Details of models training

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Link to PoLitBert research log (same as below).

Experiment Model name Vocab size Scheduler BSZ WPB Steps Train tokens Train loss Valid loss Best (test) loss
#1 PoLitBert_v32k_linear_50k (tensorboard) 32k linear decay 8 192 4,07E+06 50 000 2,03E+11 1,502 1,460 1,422
#2 PoLitBert_v32k_tri_50k (tensorboard) 32k triangular 8 192 4,07E+06 50 000 2,03E+11 1,473 1,436 1,402
#3 PoLitBert_v32k_cos1_50k (tensorboard) 32k cosine mul=1 8 192 4,07E+06 23 030 9,37E+10 10,930 11,000 1,832
#4 PoLitBert_v32k_cos1_2_50k (tensorboard) 32k cosine mul=1 peak=0.0005 8 192 4,07E+06 50 000 2,03E+11 1,684 1,633 1,595
#5 PoLitBert_v32k_cos1_3_50k (tensorboard) 32k cosine mul=2 8 192 4,07E+06 3 735 1,52E+10 10,930
#6 PoLitBert_v32k_cos1_4_50k (tensorboard) 32k cosine mul=2 grad-clip=0.9 8 192 4,07E+06 4 954 2,02E+10 10,910 10,940 2,470
#8 PoLitBert_v32k_tri_125k (tensorboard) 32k triangular 8 192 4,07E+06 125 000 5,09E+11 1,435 1,313 1,363
#9 PoLitBert_v32k_cos1_5_50k (tensorboard) 32k cosine, mul=2, grad-clip=0.9 8 192 4,07E+06 125 000 5,09E+11 1,502 1,358 1,426
#10 PoLitBert_v32k_linear_125k (tensorboard) 32k linear decay 8 192 4,07E+06 125 000 5,09E+11 1,322 1,218 1,268
#11 PoLitBert_v50k_linear_50k (tensorboard) 50k linear decay 8 192 4,07E+06 50 000 2,04E+11 1,546 1,439 1,480

Used libraries

Instalation dependecies and problems

  • langdetect needs additional package
    • install sudo apt-get install libicu-dev
  • sentencepiece was installed from source code

Acknowledgements

This is the joint work of companies Ermlab Software and Literacka

Part of the work was financed from the grant of The Polish National Centre for Research and Development no. POIR.01.01.01-00-1213/19, the beneficiary of which was Literacka. Project title "Asystent wydawniczy - oprogramowanie do analizy treści, wykorzystujące algorytmy sztucznej inteligencji w celu zautomatyzowania procesu wydawniczego i predykcji sukcesów rynkowych publikacji."

We would like to express ours gratitude to NVidia Inception Programme and Amazon AWS for providing the free GPU credits - thank you!

Authors:

Also appreciate the help from

About Ermlab Software

Ermlab - Polish machine learning company

🦉 Website | :octocat: Repository

.

About

Polish RoBERTA model trained on Polish literature, Wikipedia, and Oscar. The major assumption is that quality text will give a good model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published