Tibetan Corpus on Terminus #970

mikkokotila · 2021-06-15T07:51:07Z

mikkokotila
Jun 15, 2021

I was not sure where to put this kind of topic, so posting it here.

We have a non-profit project where we are looking for ways to make roughly 200,000 Tibetan language texts available in the most meaningful way. It is about 20,000,000 pages of text in total.

These are basically the texts that were transmitted from India to Tibet (from Sanskrit to Tibetan) over a few centuries about a thousand years ago. It almost entirely consists of various mind training and meditation manuals, and philosophical treatises. Most of the texts are no longer available in Sanskrit.

We want to store the text in tokens instead of full text because from tokens we get the full text, but from the full text, it takes time to get tokens.

The body of the text is static i.e. reading and any preprocessing must be only done once ever.

There is rich meta-data available from several sources.

Given the way, the Tibetan language follows a particular way of encoding meaning to words, similar to Sanskrit, where the words themselves concretely connect things and topics, and the way the body of knowledge the language is used to describe is actually an honest ontology looks like a really interesting use-case for Terminus.

A typical downstream use case is where the translator wants to understand the context for a given word or a scholar wants to find documents that are related semantically or otherwise.

At the moment we are planning to build our own datastore, but it dawned on me that there might be some really interesting opportunity in building on top of Terminus.

luke-feeney · 2021-06-15T09:25:53Z

luke-feeney
Jun 15, 2021
Maintainer

Interesting @mikkokotila - similar to one of the suggestions that came up during a recent brainstorming session (@matko was the proposer). I think it would be an interesting use case for terminus.

0 replies

mikkokotila · 2021-06-15T11:24:58Z

mikkokotila
Jun 15, 2021
Author

Beautiful proposal! How wonderful.

What would be the best way to explore this further? For example, if at some point there was an RFC or some other "the thing to be built" definition I think the resources for building, at least the majority of it, could come from us.

0 replies

matko · 2021-06-15T12:57:10Z

matko
Jun 15, 2021
Maintainer

Hey there!

This is a very exciting proposal. I was actually intending to build something like this, except for the Pali canon, which is the Theravada Buddhist corpus. Maybe we can work together on a common core, though I suspect the Tibetan and Pali languages are sufficiently different that we'll need slightly different things in the end.

Right now, we're working very hard on getting a new schema language and document interface released. I'd definitely wait for this to land before attempting to implement this, as it'll make your life a lot easier. This should happen somewhere in the next few weeks.

I agree on the benefit of dealing with tokens instead of raw strings. A string is just a value. A token is a node which, besides its written representation, can have various sorts of metadata associated with it, such as a grammatical case, a translation, an etymology, etc.

The Pali Canon use case and similarities with the Tibetan Corpus

Since I put some thought into this use case from the Pali Canon perspective, I'd like to share some of my thoughts on the requirements of a graph representation of a corpus. Not everything will be relevant for your use case I imagine, but hopefully, enough will be.

Pali is a highly inflected language. This means that the same concept will be expressed with a different word depending on the grammatical role in a sentence. It'd be good if it was possible to identify the various inflections of a word and group them together, so that we can quickly query not just for the exact way a particular word is written, but also all the other inflections. I believe Tibetan is far less inflected, but I'm sure there are some similar concerns there.

As you mention in your proposal, we're essentially dealing with a non-changing body of texts. Just as in the Pali canon, I imagine the Tibetan corpus employs a lot of stock phrases which you'll find throughout different texts, and it'd be very useful if we could somehow identify instances of this, and be able to provide some scholarly annotations for such occasions. In he most extreme case, entire suttas are equivalent except for one or two words. It'd be good to have a sutta template description that makes can make such occasions explicit.

Even though the Pali canon it is a non-changing body of texts, there are slight disagreements in the Theravada Buddhist world on some aspects of the canon. Different traditions group their texts slightly different, resulting in the same text being divided over a different amount of suttas. Furthermore, there are occasionally small disagreements on the exact word used in a particular sentence between the Thai, Burmese and Sri Lankan versions of the canon. For me, it'd be good to be able to register such discrepancies in a neutral way, not picking any sides.

I do not know what the situation is with classical Tibetan, but for Pali, the Pali Canon is pretty much the authoritative collection of texts that define what the language actually is. While there is secondary Pali literature, historically grammarians have considered the canon to be leading in what correct use of Pali is, and as far as I am aware there has been no major development of the language in secondary literature. This means that the Pali Canon is not just an interesting body of texts for the content, but also a linguistic study object for the Pali language itself. It would therefore be very interesting if a graph representation of the Pali Canon facilitated grammatical analysis.
I imagine that on top of the bare tokenized sentences, a researcher could build a grammatical analysis - an alternative description of the same sentence which is deeply structured along the grammar. A similar concern is there for individual words. Words in Pali are generally formed out of some root combined with a bunch of affixes, and possibly further combined into a compound word. It should be possible for a researcher to annotate a word with a description of how this construction happened.

it'd be good if a graph representation of a corpus facilitated translation efforts. There's different sorts of ways to translate a text. One can translate word for word, one can translate a sentence at a time, or one can take a whole range of text and write a more free-form translation. It'd be good to support translation annotations for all these sorts of translations.

Finally, it'd be good to be able to classify texts in various ways. An obvious one would be a classification based on their topics, so that it'd be easy to get a list of all texts that have to do with jhana for example. Another classification could be a hypothetical moment of composition, to differentiate earlier texts from later additions.

One graph, multiple tools

There are many use cases for a graph version of a corpus. Some people may just want to browse the texts. Some people may want to do active translation. Some people may be actively searching for parallels, and will be wanting to annotate the texts when they find them. With TerminusDB, it should be possible to have multiple tools all use the same rich data in different ways. Therefore, iIn my opinion, the first task of a project like this should be figuring out a high quality data model that is able to support a wide range of use cases, rather than designing the data model around one particular use case.

Final thoughts

I'm not sure if any of these thoughts are useful to you, but I strongly suspect that our use cases are similar enough that we should at least have a conversation on this. We can discuss this further in this issue, or if you like, you can reach me directly at [email protected].

0 replies

mikkokotila · 2021-06-15T14:08:44Z

mikkokotila
Jun 15, 2021
Author

@matko wonderful :)

I suspect that Panini Sanskrit was modeled closely on Pali for the part of the vocabulary that communicates Buddhadharma. This would be wonderful, as in Mahavyutpatti [1] the objective seems to be to transmit Sanskrit into Tibetan "perfectly". For example, in Sanskrit you would have Sadhana (means of accomplishment) and Sadhaka (the one who is accomplished), which in Tibetan became Drubtop and Drubtop (because "ka" from Sanskrit in Tibetan is handled without affecting the sound). So if we are lucky, Tibetan for this part is basically a "carbon copy" of Pali. I will research to find out more about the relationship between Pali and Sanskrit has for the Buddhadharma language.

In terms "canonical" vs "non-canonical", in the Tibetan corpus this is quite different. There is basically no other literature than Buddhadharma literature.

As a point of reference, we use https://github.com/OpenPecha as a source for the texts.

More broadly, the sharing above is very useful :) I will definitely be in touch directly. Will update status here for others to see as needed.

[1] https://en.wikipedia.org/wiki/Mah%C4%81vyutpatti

0 replies

mikkokotila · 2021-06-18T13:13:24Z

mikkokotila
Jun 18, 2021
Author

Maybe we can work together on a common core

It looks like from TerminusDB standpoint, most useful is "common core" which might then extend readily to other languages as well (assuming that downstream something else is done, which could later become part of the "common core"). I think the point I'm making is that simpler the better. But yes, this is very valuable I believe.

I'd definitely wait for this to land before attempting to implement this

Sure. This update sounds amazing, really look forward to this one!

Pali is a highly inflected language.

Yes :) It appears that in language processing there is an incredible number of things that create complications. My learning with language tech is that the simpler the implementation is, the better.

In the case of inflection, Tibetan way can be clearly witnessed here: : https://en.wikipedia.org/wiki/Classical_Tibetan#Inflection

It is a beautiful language for deterministic NLP (in the graph). See the previous message in the thread for more on possible Pali <--> Tibetan connection. There is a comprehensive corpus available for Tibetan verbs in: https://raw.githubusercontent.com/Esukhia/tibetan-verbs-database/master/db/db.csv

so that we can quickly query not just for the exact way a particular word is written, but also all the other inflections.

Yes, I think this level of ability is "core of the core". Then on top of that, downstream one is able to create further connections through meta-data.

I imagine the Tibetan corpus employs a lot of stock phrases which you'll find throughout different texts

Yes. Many expressions and such are there repeating frequently. There are many more that are specific to a certain practice lineage or certain kind of meditation.

The really interesting thing here, which is apparent in the Pali Canon as well, is that topics are presented as a description of an ontology (model) or a way to provide details about the topic. Everything in the corpus has been incredibly well thought out, and the way everything can be connected to everything else based on that.

For example, consider the following topics:

Four Noble Truths
Three Poisons
Karma
Ten Non-Virtues
Six Realms

Then let's consider how these connect.

In Four Noble Truths we learn about suffering. We learn that suffering is caused by Three Poisons. From there we learn that because of the Three Poisons we create negative Karma, more specifically the Ten Non-Virtues. All the negative Karmas of speech and body are explained to be caused by the non-virtues of the mind, which are the Three Poisons. The lower three of the Six Realms are explained as the psychological states we experience as a result of the Three Poisons.

In other words, the terms "Three Poisons", "Three Negative Karmas of Mind", and "Three Lower Realms" form a thermodynamically sound construct of knowledge. My understanding is that the entire Buddhist Canon "clicks" in a similar way. The entire thing is already a data model. In fact, the mother of all data models.

This is just a vague illustration of the incredible potential there is in terms of representing these Canons as a graph.

and it'd be very useful if we could somehow identify instances of this, and be able to provide some scholarly annotations for such occasions.

Yes, we are able to do this within our team. We have several topically specialized translators and one Khenpo (same as professor) working with us :) Also, there are many others with very high level understanding of these topics, that we can ask questions.

For me, it'd be good to be able to register such discrepancies in a neutral way, not picking any sides.

I think that is really interesting idea!

I do not know what the situation is with classical Tibetan, but for Pali, the Pali Canon is pretty much the authoritative collection of texts that define what the language actually is.

There are virtually no other literature in Tibetan. In the body of texts we will be working with, there will be none. It is 100% Buddhadharma.

It should be possible for a researcher to annotate a word with a description of how this construction happened.

Yes, this is very important. There should be the ability for different translators and scholars to create alternative meanings to words, and then those words should be optionally available.

Finally, it'd be good to be able to classify texts in various ways.

Yes, for the Tibetan texts, we already have quite a rich meta-data structure. For example, the era, the yana, the cycle, and so forth.

Therefore, iIn my opinion, the first task of a project like this should be figuring out a high quality data model that is able to support a wide range of use cases, rather than designing the data model around one particular use case.

Yes for sure. I think the nuance here is that maybe it is better to first implement a basic datastore capability, and then gradually add generic language features.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TerminusDB

Tibetan Corpus on Terminus #970

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

TerminusDB

Tibetan Corpus on Terminus #970

mikkokotila Jun 15, 2021

Replies: 5 comments

luke-feeney Jun 15, 2021 Maintainer

mikkokotila Jun 15, 2021 Author

matko Jun 15, 2021 Maintainer

The Pali Canon use case and similarities with the Tibetan Corpus

One graph, multiple tools

Final thoughts

mikkokotila Jun 15, 2021 Author

mikkokotila Jun 18, 2021 Author

mikkokotila
Jun 15, 2021

luke-feeney
Jun 15, 2021
Maintainer

mikkokotila
Jun 15, 2021
Author

matko
Jun 15, 2021
Maintainer

mikkokotila
Jun 15, 2021
Author

mikkokotila
Jun 18, 2021
Author