Skip to content

Terms and model training

rminsil edited this page Feb 9, 2025 · 3 revisions

Terms

"Terms" are primarily used to represent proper nouns found in the Bible, for example names and places. It is a concept from Paratext.

At a high level, term data from Paratext projects is used to help models understand how to translate proper nouns from one language to another.

extract_corpora

The extract_corpora.py script reads a Paratext project and creates extract files. In addition it can also construct up to 4 files related to terms which are put under MT/terms in the standard nlp data directory.

Below is example output from running on an ABP Paratext project:

MT/terms/
├── abp-ABP-Project-renderings.txt
├── ABP-metadata.txt
├── ABP-vrefs.txt
└── en-ABP-glosses.txt

Understanding the files

The 4 files have the same number of lines and line x across each file all relate to the same term. For example line 6 from the example above corresponds to the term Μεσσίας-1:

// ABP-metadata.txt (line 6)
Μεσσίας-1	?	? (line 6)
// en-ABP-glosses.txt (line 6)
Christ
// ABP-vrefs.txt (line 6)
JHN 1:41	JHN 4:25
// abp-ABP-Project-renderings.txt (line 6)
Mesias

The combination of the above tells us:

  • the unique id for the term (Μεσσίας-1)
  • the English gloss (Christ)
  • where this particular term appears in the Bible
  • the translation of the term in the target language ABP (Mesias)

How the terms are used

The preprocess step appends terms to the end of the training data if the terms files exist for both the source and target languages.

Using the ABP example, here are snippets from the detokenized training files where it transitions from the end of Revelation to the start of the terms. In this context we are translating from English (NIV11R) to ABP:

// train.src.detok.txt (English)
He who testifies to these things says, “Yes, I am coming soon.” Amen. Come, Lord Jesus.
The grace of the Lord Jesus be with Godʼs people. Amen.
Mizraim
Assyria
Gad
Egyptian
...

// train.trg.detok.txt (ABP)
Hiyay Apo Jesus ye ampamapteg ha kaganaan a naihulat kananyatin libdo. Hinabi na po, <<Peteg a madanon akoynan lumateng!>> Hinabi ko met, <<Palyadiyen mo dayi Apo! Apo, makew kayna dayi ihti!>>
Mikakaanti ya dayi kanyon kaganaan ye kangedan nan Apo Jesus.
Egipsio
Asiria
Gad
taga Egipto
...

Pairing up the terms, it is making the model understand:

English      ABP
--------------------
Mizraim  ->  Egipsio
Assyria  ->  Asiria
Gad      ->  Gad
Egyptian ->  taga Egipto

In this example there are about 175 terms.

Note that terms aren't appended to the validation or test data.

How are term files constructed

In this example, the 4 term files came from the Paratext project for ABP by running extract_corpora. No terms were generated from the NIV11R Paratext project as it is missing the term files below.

TermRenderings.xml

If the Paratext project has the file TermRenderings.xml, then these files will be generated:

  • {project name}-metadata.txt
  • {iso}-{project name}-{list type}-renderings.txt

In the ABP example, TermRenderings.xml has this block for "Μεσσίας-1":

<TermRendering Id="Μεσσίας-1" Guess="false">
  <Renderings>Mesias (Messiah)</Renderings>
  <Glossary>Mesias</Glossary>
  <Changes>
    <Change UserName="Roger Stone" TermId="Μεσσίας-1" Date="2016-11-30T05:59:17.3479807+08:00">
      <Before>Mesias</Before>
      <After>Mesias, 'Messiah'</After>
    </Change>
    <Change UserName="Roger Stone" Date="2019-04-24T14:06:49.5618139-05:00">
      <Before>'Messiah'||Mesias</Before>
      <After>Mesias</After>
    </Change>
  </Changes>
  <Notes>Usage:
                                                                                                
Other terms considered:</Notes>
  <Denials />
</TermRendering>

This term entry from the example TermRenderings.xml is reformatted and put into the ABP-metadata.txt and abp-ABP-Project-renderings.txt files, rendering the term "Μεσσίας-1" as "Mesias" for use in the training text.

ProjectBiblicalTerms.xml

If the Paratext project has the file ProjectBiblicalTerms.xml, then the gloss and verse references for the term can be determined, generating the following files:

  • {project_name}-vrefs.txt
  • {iso}-{project name}-glosses.txt

For the ABP example, the term Μεσσίας-1 has this entry:

<Term Id="Μεσσίας-1">
  <Transliteration>Μεσσίας-1</Transliteration>
  <Language>Greek</Language>
  <Gloss>[Christ] 2. As a title for Jesus as the Messiah.</Gloss>
  <References>
    <Verse>04300104100</Verse>
    <Verse>04300402500</Verse>
  </References>
</Term>

This term entry from the example ProjectBiblicalTerms.xml is reformatted, with the <References> section being decoded and put into ABP-vrefs.txt and "Christ" being put into en-ABP-glosses.txt.

The verse references in <References> are decoded in this way:

           043  001  041  00
          John   1  : 41  --

           043  004  025  00
          John   4  : 25  --

From this the file ABP-vrefs.txt gets this line:

JHN 1:41	JHN 4:25

Other kinds of terms

Terms are not exclusively proper nouns (like names and places). Sometimes a term is a short phrase or object.

Below are some examples from S:/terms/en-Major-glosses.txt (note some are tab delimited):

  • (line 2) papyrus reeds
  • (line 46) bull stallion
  • (line 49) strong-willed brave stubborn
  • (line 68) Stone of Bohan the son of Reuben
  • (line 180) miraculous sign miracle
  • (line 343) cow head of cattle ox