An evaluation study of lemmatizers on different German language corpora. Branch ba-lk contains the code for the Bachelor's thesis of Lydia Körber.
- Download the datasets.
bash dataset-download.sh
- In order to avoid python dependency conflicts, each lemmatizer is installed in a separate virtual environment.
for dir in algorithms/*; do
bash "${dir}/install.sh"
done
- If you wish to track the CO2 emissions during the computation, execute as described here:
sudo chmod -R a+r /sys/class/powercap/intel-rapl
Then start the computations with the following command.
bash run.sh
The study has been conducted on a Debian GNU/Linux 10 machine with 72 CPUs and 188 GB RAM using Python3.7.3.
(4.) To run the evaluation scripts in Jupyter Notebook, execute the following commands:
cd nbs
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
jupyter lab
Data set (paper) | Format | Era | Genre | Language area | Guidelines | Annotation | Pre-processing |
---|---|---|---|---|---|---|---|
Empirist 2019 | tab-separated | 21st c | a) dialogue (CMC): chat (social, professional), tweets, WhatsApp chats, blog comments, Wikipedia threads; b) web articles (Web) | DE | link, based on TIGER | manual | Normalized and original tokens used as input. |
GerManC-GS | XML | 1650 - 1800 (Early Modern German) | drama, humanities, legal texts, letters, narrative prose, newspapers, scientific texts, sermons | DE, AT, CH | link | manual | Normalized and original tokens used as input. Captions and stage directions ignored. |
NoSta-D | TCF, XML | 14th - 21st c | historical (anselm), chat (unicum), spoken (bematac), learner (falko), literary prose (kafka), newspaper (tueba-dz) | DE | semi-automatic (TreeTagger) | Normalized and original tokens used as input. | |
RUB 2019, balanced | Conll-U | 20th - 21st c | Novelette, movie subtitles, sermon, TED talks, Wikipedia | DE | TIGER with some modifications | manual | UPOS tags are not available and need to be converted from XPOS tags (STTS). |
TGermaCorp | Conll-U | 16th - 21st c | literature, Wikipedia | DE | semi-automatic (TreeTagger) | ||
UD GSD, v2.10 (TIGER Korpus) | Conll-U | 21st c | daily newspaper (Frankfurter Rundschau) | DE | link | manual | |
UD HDT, v2.10 (Hamburg Treebank) | Conll-U | 20th c | IT magazine (Heise) | DE | link | manual | |
UD PUD, v2.10 | Conll-U | 21st c | Wikipedia articles | DE | manual |
- algorithms - seperate directory for each algorithm, each containing an install script (
install.sh
), overview of installed third party libraries (requirements.txt
), and a run script (run.py
)- baseline - baseline algorithm, lemma = surface form
- gpt3: no need to run run_api_queries.py yourself, you can also just execute run.py to evaluate the outputs of OpenAI queries from 20.-22.02.2023 (outputs listed here)
- run_api_queries.py - GPT-3 queries via the OpenAI API, to run this script, an OpenAI account and API key is needed, run
export OPEN_AI_KEY=INSERT_KEY_HERE
before executing the run script
- run_api_queries.py - GPT-3 queries via the OpenAI API, to run this script, an OpenAI account and API key is needed, run
- germalemma
- hanta
- rnntagger
- simplemma
- smorlemma
- spacy2
- spacy3
- spacy3.3+
- stanza
- trankit
- treetagger
- logs - log file
- nbs - evaluation results and notebooks
- gpt3_outputs - outputs of OpenAI API queries 20.-22.02.2023 with text-davinci-003
- formats.json - overview of different output formats of gpt-3 experiments
- emissions - energy consumption of experiments
- lemmata - lemmatizer outputs for each corpus for qualitative evaluation
- all_lemmata.csv - outputs of all lemmatizers on all corpora
- evaluation.ipynb - quantitative evaluation of results
- evaluation-gpt3.ipynb - evaluation of outputs of GPT-3 queries
- evaluation-grauzonen.csv - extracts of all_lemmata.csv to analyze compund words, nominalized participles and adjective comparation
- evaluation-qualitative.ipynb - preparing qualitative evaluation of results
results-*.json
- results of an algorithm, metrics calculated overall and by PoS tag
- gpt3_outputs - outputs of OpenAI API queries 20.-22.02.2023 with text-davinci-003
- src - source code scripts
- loader.py - load the datasets
- metrics.py - evaluation metrics
- reader.py - read functions for different types of datasets
- run.py - central run function to run an algorithm on a dataset and output the results to nbs
- stts_to_upos.txt - convert STTS to uPoS tags, based on table
- dataset-download.sh - download datasets
- README.md
- run.sh - run all algorithms