eval-de-lemmatise

An evaluation study of lemmatizers on different German language corpora. Branch ba-lk contains the code for the Bachelor's thesis of Lydia Körber.

Usage

Download the datasets.

bash dataset-download.sh

In order to avoid python dependency conflicts, each lemmatizer is installed in a separate virtual environment.

for dir in algorithms/*; do
    bash "${dir}/install.sh"
done

If you wish to track the CO2 emissions during the computation, execute as described here:

sudo chmod -R a+r /sys/class/powercap/intel-rapl

Then start the computations with the following command.

bash run.sh

The study has been conducted on a Debian GNU/Linux 10 machine with 72 CPUs and 188 GB RAM using Python3.7.3.

(4.) To run the evaluation scripts in Jupyter Notebook, execute the following commands:

cd nbs
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
jupyter lab

Data sets

Data set (paper)	Format	Era	Genre	Language area	Guidelines	Annotation	Pre-processing
Empirist 2019	tab-separated	21st c	a) dialogue (CMC): chat (social, professional), tweets, WhatsApp chats, blog comments, Wikipedia threads; b) web articles (Web)	DE	link, based on TIGER	manual	Normalized and original tokens used as input.
GerManC-GS	XML	1650 - 1800 (Early Modern German)	drama, humanities, legal texts, letters, narrative prose, newspapers, scientific texts, sermons	DE, AT, CH	link	manual	Normalized and original tokens used as input. Captions and stage directions ignored.
NoSta-D	TCF, XML	14th - 21st c	historical (anselm), chat (unicum), spoken (bematac), learner (falko), literary prose (kafka), newspaper (tueba-dz)	DE		semi-automatic (TreeTagger)	Normalized and original tokens used as input.
RUB 2019, balanced	Conll-U	20th - 21st c	Novelette, movie subtitles, sermon, TED talks, Wikipedia	DE	TIGER with some modifications	manual	UPOS tags are not available and need to be converted from XPOS tags (STTS).
TGermaCorp	Conll-U	16th - 21st c	literature, Wikipedia	DE		semi-automatic (TreeTagger)
UD GSD, v2.10 (TIGER Korpus)	Conll-U	21st c	daily newspaper (Frankfurter Rundschau)	DE	link	manual
UD HDT, v2.10 (Hamburg Treebank)	Conll-U	20th c	IT magazine (Heise)	DE	link	manual
UD PUD, v2.10	Conll-U	21st c	Wikipedia articles	DE		manual

Repository Structure

algorithms - seperate directory for each algorithm, each containing an install script (install.sh), overview of installed third party libraries (requirements.txt), and a run script (run.py)
- baseline - baseline algorithm, lemma = surface form
- gpt3: no need to run run_api_queries.py yourself, you can also just execute run.py to evaluate the outputs of OpenAI queries from 20.-22.02.2023 (outputs listed here)
  - run_api_queries.py - GPT-3 queries via the OpenAI API, to run this script, an OpenAI account and API key is needed, run export OPEN_AI_KEY=INSERT_KEY_HERE before executing the run script
- germalemma
- hanta
- rnntagger
- simplemma
- smorlemma
- spacy2
- spacy3
- spacy3.3+
- stanza
- trankit
- treetagger
- logs - log file
nbs - evaluation results and notebooks
- gpt3_outputs - outputs of OpenAI API queries 20.-22.02.2023 with text-davinci-003
  - formats.json - overview of different output formats of gpt-3 experiments
- emissions - energy consumption of experiments
- lemmata - lemmatizer outputs for each corpus for qualitative evaluation
  - all_lemmata.csv - outputs of all lemmatizers on all corpora
- evaluation.ipynb - quantitative evaluation of results
- evaluation-gpt3.ipynb - evaluation of outputs of GPT-3 queries
- evaluation-grauzonen.csv - extracts of all_lemmata.csv to analyze compund words, nominalized participles and adjective comparation
- evaluation-qualitative.ipynb - preparing qualitative evaluation of results
- results-*.json - results of an algorithm, metrics calculated overall and by PoS tag
src - source code scripts
- loader.py - load the datasets
- metrics.py - evaluation metrics
- reader.py - read functions for different types of datasets
- run.py - central run function to run an algorithm on a dataset and output the results to nbs
- stts_to_upos.txt - convert STTS to uPoS tags, based on table
dataset-download.sh - download datasets
README.md
run.sh - run all algorithms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eval-de-lemmatise

Usage

Data sets

Repository Structure

About

Releases 1

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
algorithms		algorithms
nbs		nbs
src		src
.gitignore		.gitignore
.zenodo.json		.zenodo.json
LICENSE		LICENSE
README.md		README.md
dataset-download.sh		dataset-download.sh
run.sh		run.sh

License

zentrum-lexikographie/eval-de-lemma

Folders and files

Latest commit

History

Repository files navigation

eval-de-lemmatise

Usage

Data sets

Repository Structure

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages