-
-
Notifications
You must be signed in to change notification settings - Fork 3
Terms and model training
"Terms" are primarily used to represent proper nouns found in the Bible, for example names and places. It is a concept from Paratext.
At a high level, term data from Paratext projects is used to help models understand how to translate proper nouns from one language to another.
The extract_corpora.py script reads a Paratext project and creates extract files. In addition it can also construct up to 4 files related to terms which are put under MT/terms
in the standard nlp data directory.
Below is example output from running on an ABP Paratext project:
MT/terms/
├── abp-ABP-Project-renderings.txt
├── ABP-metadata.txt
├── ABP-vrefs.txt
└── en-ABP-glosses.txt
The 4 files have the same number of lines and line x across each file all relate to the same term. For example line 6 from the example above corresponds to the term Μεσσίας-1
:
// ABP-metadata.txt (line 6)
Μεσσίας-1 ? ? (line 6)
// en-ABP-glosses.txt (line 6)
Christ
// ABP-vrefs.txt (line 6)
JHN 1:41 JHN 4:25
// abp-ABP-Project-renderings.txt (line 6)
Mesias
The combination of the above tells us:
- the unique id for the term (Μεσσίας-1)
- the English gloss (Christ)
- where this particular term appears in the Bible
- the translation of the term in the target language ABP (Mesias)
The preprocess step appends terms to the end of the training data if the terms files exist for both the source and target languages.
Using the ABP example, here are snippets from the detokenized training files where it transitions from the end of Revelation to the start of the terms. In this context we are translating from English (NIV11R) to ABP:
// train.src.detok.txt (English)
He who testifies to these things says, “Yes, I am coming soon.” Amen. Come, Lord Jesus.
The grace of the Lord Jesus be with Godʼs people. Amen.
Mizraim
Assyria
Gad
Egyptian
...
// train.trg.detok.txt (ABP)
Hiyay Apo Jesus ye ampamapteg ha kaganaan a naihulat kananyatin libdo. Hinabi na po, <<Peteg a madanon akoynan lumateng!>> Hinabi ko met, <<Palyadiyen mo dayi Apo! Apo, makew kayna dayi ihti!>>
Mikakaanti ya dayi kanyon kaganaan ye kangedan nan Apo Jesus.
Egipsio
Asiria
Gad
taga Egipto
...
Pairing up the terms, it is making the model understand:
English ABP
--------------------
Mizraim -> Egipsio
Assyria -> Asiria
Gad -> Gad
Egyptian -> taga Egipto
In this example there are about 175 terms.
Note that terms aren't appended to the validation or test data.
In this example, the 4 term files came from the Paratext project for ABP by running extract_corpora
. No terms were generated from the NIV11R Paratext project as it is missing the term files below.
If the Paratext project has the file TermRenderings.xml
, then these files will be generated:
{project name}-metadata.txt
{iso}-{project name}-{list type}-renderings.txt
In the ABP example, TermRenderings.xml
has this block for "Μεσσίας-1":
<TermRendering Id="Μεσσίας-1" Guess="false">
<Renderings>Mesias (Messiah)</Renderings>
<Glossary>Mesias</Glossary>
<Changes>
<Change UserName="Roger Stone" TermId="Μεσσίας-1" Date="2016-11-30T05:59:17.3479807+08:00">
<Before>Mesias</Before>
<After>Mesias, 'Messiah'</After>
</Change>
<Change UserName="Roger Stone" Date="2019-04-24T14:06:49.5618139-05:00">
<Before>'Messiah'||Mesias</Before>
<After>Mesias</After>
</Change>
</Changes>
<Notes>Usage:
Other terms considered:</Notes>
<Denials />
</TermRendering>
This term entry from the example TermRenderings.xml
is reformatted and put into the ABP-metadata.txt
and abp-ABP-Project-renderings.txt
files, rendering the term "Μεσσίας-1"
as "Mesias"
for use in the training text.
If the Paratext project has the file ProjectBiblicalTerms.xml
, then the gloss and verse references for the term can be determined, generating the following files:
{project_name}-vrefs.txt
{iso}-{project name}-glosses.txt
For the ABP example, the term Μεσσίας-1
has this entry:
<Term Id="Μεσσίας-1">
<Transliteration>Μεσσίας-1</Transliteration>
<Language>Greek</Language>
<Gloss>[Christ] 2. As a title for Jesus as the Messiah.</Gloss>
<References>
<Verse>04300104100</Verse>
<Verse>04300402500</Verse>
</References>
</Term>
This term entry from the example ProjectBiblicalTerms.xml
is reformatted, with the <References>
section being decoded and put into ABP-vrefs.txt
and "Christ"
being put into en-ABP-glosses.txt
.
The verse references in <References>
are decoded in this way:
043 001 041 00
John 1 : 41 --
043 004 025 00
John 4 : 25 --
From this the file ABP-vrefs.txt
gets this line:
JHN 1:41 JHN 4:25
Terms are not exclusively proper nouns (like names and places). Sometimes a term is a short phrase or object.
Below are some examples from S:/terms/en-Major-glosses.txt
(note some are tab delimited):
- (line 2) papyrus reeds
- (line 46) bull stallion
- (line 49) strong-willed brave stubborn
- (line 68) Stone of Bohan the son of Reuben
- (line 180) miraculous sign miracle
- (line 343) cow head of cattle ox