ELTE Poetry Corpus is a continuously expanding database developed by the Department of Digital Humanities at Eötvös Loránd University. Currently, the corpus contains the complete poems of 52 Hungarian canonical poets, the sound devices of the poems and the grammatical features of words in XML format (in TEI and non-TEI XML format).
- number of poets: 52
- number of poems: 13 436
- number of words: 2 740 826
- number of tokens: 3 473 102
For more information of the size of subcorpora and the authors' year of birth and death, see the files subcorpus_sizes.tsv and poets_birth_and_death.tsv
The source of the corpus was the collection of the Hungarian Electronic Library, which contains numerous poetic oeuvres in digitized form.
- The texts from the Hungarian Electronic Library were converted into TEI XML format based on the Text Encoding Initiative.
- The automatically converted poems containing the annotations of structural units were checked manually (level1).
- Then, we tokenized the poems and annotated the grammatical features of words by using e-magyar, an NLP tool chain for Hungarian texts. The level2 folder contains the TEI XML files in which the morphosyntactic features (values of the msd attributes) are annotated in the format of universal dependencies, while the level2_emMorph folder contains the same files in which the morphosyntactic features are annotated in its own, emMorph format of e-magyar.
- After the grammatical annotation, we also annotated the rhyme patterns, the rhyme pairs, the rhythm of lines, the alliterations and the phonological features of words (level3).
- Finally, we added further annotations of poetic features to the corpus and changed the name and the position of some elements and attributes, using a non-TEI XML format defined for the project (level4).
The poem_texts folder contains the poems in TXT format, without the XML annotations. This version of the corpus was generated from the level1 files. The TXT files contain the editorial notes related to date and place, which are in
elements in the TEI versions.
<head>
: title<lg>
: stanza<l>
: line<p>
: subtitle, epigraph, separator, editorial note
<w>
: word<pc>
: punctuation mark@lemma
: lemma@pos
: part of speech@msd
: morphosyntactic features (Universal Dependencies)
@met
: meterQual
: qualitative meter based on stressed and unstressed syllablesQuan
: quantitative meter based on long and short syllables (possible values: iambic, trochaic, dactylic, anapestic)QuanScore
: score of quantitative meter (before 0.5, the poem does not really have any intended quantitative meter)
@rhyme
: rhyme pattern@real
: rhythm (0: short syllable; 1: long syllable)<spanGrp type="phonStructures">
: standoff annotation of the phonological features of words<span>
: standoff annotation of the phonological features of a word content of<span>
: phonological representation of the wordc
: consonantb
: short back vowelB
: long back vowelf
: short front vowelF
: long front vowel
@subtype
: syllable number@type
: type of vowelslow
: only back vowels in the wordhigh
: only front vowels in the wordmixed
: front and back vowels in the word
@target
: thexml:id
of the annotated word<linkGrp type="rhymePairs">
: standoff annotation of rhyme pairs<link>
: standoff annotation of a rhyme pair@target
: xml:id of the two words in a rhyme pair<spanGrp type="alliterations">
: standoff annotation of alliterations<span>
: standoff annotation of an alliteration@target
:xml:id
of the words in the alliteration@type
: structure of the alliterationa
: alliterating wordn
: non-alliterating word (only one non-alliterating word can be between two alliterating words)
By changing the name and the position of certain elements and attributes in level3 and by adding further annotations to the corpus, it is easier to process but cannot be expressed in valid TEI XML format.
@met_qual
: qualitative meter based on stressed and unstressed syllables (conversion of level3's@met
in<div>
)@met_quan
: quantitative meter based on long and short syllables, possible values: iambic, trochaic, dactylic, anapestic (conversion of level3's@met
in<div>
)@met_quanScore
: score of quantitative meter, before 0.5, the poem does not really have any intended quantitative meter (conversion of level3's@met
in<div>
)@div_numStanza
: number of stanzas in the poem@div_numLine
: number of lines in the poem@div_numWord
: number of words in the poem@div_numSyll
: number of syllables in the poem@div_numShortSyll
: number of short syllables in the poem@div_numLongSyll
: number of long syllables in the poem@div_rhyme
: the rhyme pattern of the poem@div_syllPattern
: syllable numbers of lines in the poem@lg_numLine
: number of lines in the stanza@lg_numWord
: number of words in the stanza@lg_numSyll
: number of syllables in the stanza@lg_numShortSyll
: number of short syllables in the stanza@lg_numLongSyll
: number of long syllables in the stanza@lg_syllPattern
: syllable numbers of lines in the stanza@l_numWord
: number of words in the line@l_numSyll
: number of syllables in the line@l_numShortSyll
: number of short syllables in the line@l_numLongSyll
: number of long syllables in the line@w_numSyll
: syllable number of word (conversion of level3's@subtype
in<span>
)@phonType
: type of vowels in the word (conversion of level3's@type
in<span>
)@phonStruct
: phonological representation of the word (conversion of level3's<span>
content)<rhymePairs>
: standoff annotation of rhyme pairs (conversion of level3's<linkGrp type="rhymePairs">
)<rhymePair>
: standoff annotation of a rhyme pair (conversion of level3's<link>
)<firstRhyme>
,<secondRhyme>
: standoff annotation of the first and second word of a rhyme pair- content of
<firstRhyme>
and<secondRhyme>
: the rhyming word form
- content of
@rhyme_lemma
: the lemma of the rhyming word@rhyme_pos
: the part of speech of the rhyming word@rhyme_msd
: the morphosyntactic features of the rhyming word (Universal Dependencies)@rhyme_numSyll
: the syllable number of the rhyming word@rhyme_phonType
: the type of vowels in the rhyming word@rhyme_phonStruct
: the phonological representation of the rhyming word<alliterations>
: standoff annotation of alliterations (conversion of level3's<spanGrp type="alliterations">
)<alliteration>
: standoff annotation of an alliteration (conversion of level3's<span>
)- content of
<alliteration>
: the alliterating word forms
- content of
@allStruct
: structure of the alliteration (conversion of level3's@type
in<span>
)@posTags
: the parts of speech of the alliterating words@msdTags
: the morphosyntactic features of the alliterating words@lemmas
: the lemmas of the alliterating words
- Gábor Palkó (design, data model)
- Péter Horváth (design, annotation scripts)
- Balázs Indig (linguistic analysis)
- Zsófia Fellegi (TEI XML specification)
- Eszter Szlávich (data checking)
- Bajzát Tímea Borbála (data checking)
- Zsófia Sárközi-Lindner (data checking)
- Bence Vida (data checking)
- Aslihan Karabulut (data checking)
- Mária Timári (data checking)
If you use ELTE Poetry Corpus, please cite one of the following articles:
Horváth Péter – Kundráth Péter – Indig Balázs – Fellegi Zsófia – Szlávich Eszter – Bajzát Tímea Borbála – Sárközi-Lindner Zsófia – Vida Bence – Karabulut Aslihan – Timári Mária – Palkó Gábor 2022. ELTE Verskorpusz – a magyar kanonikus költészet gépileg annotált adatbázisa. In: Berend Gábor – Gosztolya Gábor – Vincze Veronika (szerk.): XVIII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged: Szegedi Tudományegyetem TTIK, Informatikai Intézet. 375–388.
Horváth, Péter – Kundráth, Péter – Indig, Balázs – Fellegi, Zsófia – Szlávich, Eszter – Bajzát, Tímea Borbála – Sárközi-Lindner, Zsófia – Vida, Bence – Karabulut, Aslihan – Timári, Mária – Palkó, Gábor 2022. ELTE Poetry Corpus: A Machine Annotated Database of Canonical Hungarian Poetry. In: Calzolari, Nicoletta – Béchet, Frédéric – Blache, Philippe – Choukri, Khalid – Cieri, Christopher – Declerck, Thierry – Goggi, Sara – Isahara, Hitoshi – Maegaard, Bente – Mariani, Joseph – Mazo, Hélène – Odijk, Jan – Piperidis, Stelios (eds.): Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022). Paris: European Language Resources Association (ELRA). 3471–3478.
The content of the repository is licensed under the CC BY-NC-ND license.
All texts of the corpus are in the public domain.