Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-country translations in taxonomies #880

Open
TomazErjavec opened this issue Oct 13, 2024 · 6 comments
Open

Per-country translations in taxonomies #880

TomazErjavec opened this issue Oct 13, 2024 · 6 comments
Assignees
Labels
bug Something isn't working Taxonomy
Milestone

Comments

@TomazErjavec
Copy link
Collaborator

Currently we have and support multilingual taxonomies but it can happen (esp. in the legislature taxonomy) that the translation of a term for one country differs from the translation for another, even though both use the same language (e.g. AT and DE).

For this reason the language of a translation should be - when needed - extended by the country code, e.g. de-AT vs. de-DE. For this, all the code dealing with @xml:lang (at least in connection with taxonomies, but ideally everyhere) would have to be revised, as well as the taxonomies themselves.

This issue becomes even more relevant if we were to add country-specific hyperlinks to Wikipedia as part of the description for particular categories.

@TomazErjavec TomazErjavec added bug Something isn't working Taxonomy labels Oct 13, 2024
@TomazErjavec TomazErjavec added this to the Future milestone Oct 13, 2024
@matyaskopp
Copy link
Collaborator

I checked the language tag documentation: https://www.rfc-editor.org/rfc/rfc5646.html#page-5, and it contains a more detailed structure than I expected.
I can see a problem with using the -region part of the language tag because it allows only 2-letter values (ISO 3166-1 code) or some 3-digit encoding that I don't know (UN M.49 code), so the value es-ES-CT is invalid.

It can probably be hacked with extensions es-ES-a-CT (https://www.rfc-editor.org/rfc/rfc5646.html#section-2.2.6) or private subtags es-ES-x-CT (https://www.rfc-editor.org/rfc/rfc5646.html#section-2.2.7). I hope this situation will not happen because it will complicate things even more...

@TomazErjavec
Copy link
Collaborator Author

can see a problem with using the -region part of the language tag because it allows only 2-letter values (ISO 3166-1 code) or some 3-digit encoding that I don't know (UN M.49 code), so the value es-ES-CT is invalid.

Interesting, and somewhat dissapointing...

However, it now occurred to me that maybe it is wrong to try to extend the language in this way because, e.g. de-AT means "the kind of German used in Austria". But if Austria has, say, a different name for their pairlament than Germany does, this does not mean that they name the same parliament differently but that they are talking about a different thing.

So it is not a question of the language at all but of what a term refers to. To put it another way, if a German person talks about the Austrian pairlament, they will use the term Austrians use, and not the term they would use for the German pairlament.

If this logic holds, then we don't need to do anything with the languages, but rather referr fom the description of a category to the appropriate corpus or country. We even have something similar already:

<desc n="ParlaMint-AT" xml:lang="de"><term>Legislative</term></desc>

@n is probably not the ideal attribute, and it might be that the value is not ideal either (maybe "AT" would be better, as otherwise we have to distinguish between #ParlaMint-AT and #ParlaMint-AT.ana and, also, these are dead pointers from the perspective of the common taxonomy), but something along these lines.

We would then, of course, also need to allow several desciption for the same language, and, of course, still have to modify scripts.

@matyaskopp , what do you think?

@matyaskopp
Copy link
Collaborator

So it is not a question of the language at all but of what a term refers to. To put it another way, if a German person talks about the Austrian pairlament, they will use the term Austrians use, and not the term they would use for the German parliament.

Good point! Agree!
So the xml:lang attribute should describe the language itself, not the domain of given country.

If this logic holds, then we don't need to do anything with the languages, but rather referr fom the description of a category to the appropriate corpus or country. We even have something similar already:

<desc n="ParlaMint-AT" xml:lang="de"><term>Legislative</term></desc>

@n is probably not the ideal attribute, and it might be that the value is not ideal either (maybe "AT" would be better, as otherwise we have to distinguish between #ParlaMint-AT and #ParlaMint-AT.ana and, also, these are dead pointers from the perspective of the common taxonomy), but something along these lines.

I have very often problems with @n attribute as it is commonly abused for misc/unsure values (I saw even URL in some TEI-like examples in @n), but also attribute @type is strange, so maybe @n is better...

Not sure if we don't want some multivalue attributes because some terms can be similar for two countries but different for the third one.
But using something like @corresp="#ParlaMint-AT #ParlaMint-DE" can cause multiple issues that needs to be resolved:

  1. link outside taxonomy (it seems that only validation scripts need to be fixed)
  2. the #ParlaMint-AT does not exist in ParlaMint-CZ or "ParlaMint-AT.ana" contexts
  3. maintenance of multiple translations (overwriting rules, automatic pregenerating empty/translated taxonomy for new ParlaMint ...)

The second issue can be fixed by introducing a new taxonomy and using different IDs, something like

<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0" xml:id="ParlaMint-domains" xml:lang="mul">
  <desc xml:lang="en"><term>ParlaMint</term></desc>
  <category xml:id="ParlaMint-CZ-domain">
    <catDesc xml:lang="en"><term>Parliament of the Czech Republic</term></catDesc>
  </category>
<!-- ... -->
</taxonomy>

really not sure, just an idea...

Or we can ignore multivalues and have duplicated translations in common taxonomy (probably safest option) = turn all translations into domain-specific, at least in problematic legislature taxonomy.

We would then, of course, also need to allow several desciption for the same language, and, of course, still have to modify scripts.

Yes, several descriptions, and also the setup of how the proper description/term is chosen. Should it be: first language-domain-specific, then language fall-back value if the translation is missing?

@TomazErjavec
Copy link
Collaborator Author

I agree that @n is bad, and that pointer attributes are worse.
So how about using elements? desc can contain country and region, which is exactly what we need.
Both elements also have @key explicitly meant for the country (/region) code.
With this, we would have something like

 <desc xml:lang="de"><country key="AT">Österreich</country>: <term>Legislative1</term></desc> 
 <desc xml:lang="de"><country key="DE">Deutschland</country>: <term>Legislative2</term></desc> 

I wouldn't make the country/region element mandatory, as it is not needed for most countries, also there are taxonomies where this is not relevant at all (like NER) and we then have backward compatibility.

As for using such descriptions: I'd propose that if desc[@xml:lang=$lang][2] is true, then it is disambiguated using the (country|region)/@key, otherwise use desc[@xml:lang=$lang].

How does this sound?

@matyaskopp
Copy link
Collaborator

So how about using elements? desc can contain country and region, which is exactly what we need. Both elements also have @key explicitly meant for the country (/region) code. With this, we would have something like

 <desc xml:lang="de"><country key="AT">Österreich</country>: <term>Legislative1</term></desc> 
 <desc xml:lang="de"><country key="DE">Deutschland</country>: <term>Legislative2</term></desc> 

I wouldn't make the country/region element mandatory, as it is not needed for most countries, also there are taxonomies where this is not relevant at all (like NER) and we then have backward compatibility.

As for using such descriptions: I'd propose that if desc[@xml:lang=$lang][2] is true, then it is disambiguated using the (country|region)/@key, otherwise use desc[@xml:lang=$lang].

How does this sound?

well that sounds better. Maybe, we can skip the text form and add just country/@key or region/@key:

<desc xml:lang="de"><country key="AT"/><term>Legislative1</term></desc> 
<desc xml:lang="de"><country key="DE"/><term>Legislative2</term></desc> 

this would simply allows single translation for multiple countries:

<desc xml:lang="de"><country key="AT"/><country key="DE"/><term>Legislative</term></desc> 

Not sure what is easier to maintain operations:

  • adding a new translation
  • updating old translation
  • separating two countries' translations

@TomazErjavec
Copy link
Collaborator Author

Maybe, we can skip the text form and add just country/@key or region/@key

Yes, why not. Easier then to insert in any case, as it is not necessary to figure out the name of the country/region in the local language. On the other hand, it doesn't hurt to have the name, as what is output in "normal" contexts is just the term, so it makes the text content of the desc more informative.

this would simply allows single translation for multiple countries

My idea was to use country/region only for ambiguous cases, rather than having it everywhere (like in NER taxonomy...).
But, it is true that this makes separating two countries translation more difficult (you would post-hoc need to retrive the key for the previously unambiguous country). So, not sure. In any case, even if we always have the country/region there, it seems to me better to have two descriptions for unabiguous terms, rather than one with several country/region elements.

Not sure what is easier to maintain operations:

  • adding a new translation
  • updating old translation
  • separating two countries' translations

Hard to say (esp. as you were doing the coding for this) but my guess would be that whichever way we choose, the effort to implement this will be similar...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Taxonomy
Projects
None yet
Development

No branches or pull requests

2 participants