The current work is dedicated to building a morphological analyzer for Bezhta language (< Tsezic < Avar-Andic-Tsezic < Nakh-Dagestan; Glottolog: bezh1248). This repository contains a prototype for a Bezhta morphological analyzer. It is a part of a larger project by the students of the School of Linguistics and the Linguistic Convergence Laboratory at the NRU HSE that aims to provide digital tools for endangered languages.
The project is distributed under the GNU General Public License v3.0.
The parser follows (Comri et al., 2015) and (Madieva, 1965) descriptions of Bezhta Proper with the lexicon gathered from (Khalilov, 2015) dictionary. The digitized version of the dictionary is available at bezhta_dict.
For evaluation, I use Bezhta translation of The Gospel of Luke and The Book of Proverbs, a text from Madieva's grammar (1964) and two annotated texts. The texts are available in the corpora directory
The project requires lexd and hfst. You can get them by the following command:
curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
apt install lexd
apt install hfst
make
Analyze a word:
echo 'соралила' | hfst-lookup bezhta.analyzer.hfst
Transliterator allows to transliterate Bezhta words from Cyrillic to Latin script.
make cy2lat.transliterator.disam.hfst
Transliterate a word:
echo 'соралила' | hfst-lookup cy2lat.transliterator.disam.hfst
Build transliterated analyzer:
make bezhta.tr.analyzer.hfst
Look up a word in Latin script:
echo 'soralila' | hfst-lookup bezhta.tr.analyzer.hfst
The segmenter identifies the morpheme boundaries in the input word.
make bezhta.segm.hfst
Segmenting a word:
echo 'нисойо' | hfst-lookup bezhta.segm.hfst
Result:
нисойо нисо>йо
Analyzer:
make bezhta.analyzer.hfstol
mv bezhta.analyzer.hfstol coverage
cd coverage
make check-coverage
Additionally, make-check-unrecog
can be used to get a list of unrecognized tokens. Note that all text files should start with text-
Current performance: ~75% naive coverage
Transliterator:
make bezhta.tr.analyzer.hfst
mv bezhta.tr.analyzer.hfst transliterator
make check-coverage
Note: some symbols may be recognized incorrectly, I recommend using transliterator_coverage.ipynb
instead.
make bezhta.analyzer.hfstol
mv bezhta.analyzer.hfstol accuracy
cd accuracy
To analyze texts with the parser, use
hfst-proc bezhta.analyzer.hfstol text-annotated-1.txt > FILENAME-1.txt
hfst-proc bezhta.analyzer.hfstol text-annotated-1.txt > FILENAME-2.txt
Then compute accuracy:
python3 accuracy.py FILENAME-1.txt text-1-gold.txt
python3 accuracy.py FILENAME-2.txt text-2-gold.txt
cd guesser
make bezhta.guesser.hfst
Guessing a token:
echo 'войъис' bezhta.guesser.hfst
For evaluation, see guesser_evaluation.ipynb