Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid result after merge operation (related to Element#mergeText) #28

Closed
ajacob opened this issue Nov 1, 2023 · 10 comments
Closed

Invalid result after merge operation (related to Element#mergeText) #28

ajacob opened this issue Nov 1, 2023 · 10 comments

Comments

@ajacob
Copy link

ajacob commented Nov 1, 2023

Hello, thank you for sharing this tool.

I found a problem related to merge operation, I'll try to describe it in this issue

How to reproduce

test.skl

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1>%%%1%%%
</NIV1>
</ROOT>

test.xlf

<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" srcLang="fr" version="2.1" trgLang="en-US">
  <file original="test.xml" id="1">
    <skeleton href="test.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.15.0 20230913_0710</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <mda:metadata id="1">
        <mda:metaGroup category="attributes" id="ph0">
          <mda:meta type="ctype">x-bold</mda:meta>
        </mda:metaGroup>
      </mda:metadata>
      <originalData>
        <data id="ph0">&lt;G&gt;</data>
        <data id="ph1">&lt;/G&gt;</data>
      </originalData>
      <ignorable>
        <source xml:space="preserve">
          <ph id="ph0"/>
        </source>
      </ignorable>
      <segment state="final" id="1-0">
        <source xml:space="preserve"> Bonjour. </source>
        <target> Hello. </target>
      </segment>
      <ignorable>
        <source xml:space="preserve">
          <ph id="ph1"/>
        </source>
      </ignorable>
      <segment state="final" id="1-1">
        <source xml:space="preserve"> Ce text devrait être traduit </source>
        <target> This text should be translated </target>
      </segment>
    </unit>
  </file>
</xliff>

command

./merge.sh -xliff test.xlf -target result.xml

actual xml result

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1>
          <G>
         Bonjour. 
           Hello. 
          </G>
         Ce text devrait être traduit  This text should be translated </NIV1>
</ROOT>

expected xml result

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1><G> Hello. </G> This text should be translated </NIV1>
</ROOT>

Investigation

First, I observed that this issue appears because I formatted the xliff file.
If I rollback some changes, the result is correct. Rolled back changes are:

      <ignorable>
        <source xml:space="preserve">
          <ph id="ph0"/>
        </source>
      </ignorable>

      <ignorable>
        <source xml:space="preserve"><ph id="ph0"/></source>
      </ignorable>

(needed as well for ph1)

Looking into the code, I think I understand the problem.

In https://github.com/rmraya/OpenXLIFF/blob/v3.15.0/src/com/maxprograms/xliff2/FromXliff2.java#L294-L326

We have:

  • l. 301
joinedSource.addContent(src.getContent());
  • l. 308
joinedTarget.addContent(src.getContent());

At this point joinedSource and joinedTarget point to the same content (same java objects per references)

Then l. 326 we have:

src.setContent(harvestContent(joinedSource, tags, attributes));

Which internally calls joinedSource.getContent(); which itself calls Element#mergeText.

And this is precisely were we have an issue.
When harvesting for the source we call the mergeText function that mutates some XMLnodes that are also referenced by the target. After harvesting for the source, the content of target becomes invalid.

To confirm this hypothesis, I simply removed the call to #mergeText in Element#getContent and it actually fixes the problem.

Solution

I don't know what would be the best to fix this issue, I think there are different possibilities like:

  • Not mutating state in a method that apparently only reads data (we don't expect #getContent to mutate the state of Element)
    • I saw that this method was also called in the equals method, which can be dangerous as well
  • Not sharing data between 2 Element objets, in which case that would requires to copy XMLNodes

If you want me to share a PR with a fix proposal, feel free to ask.

By the way, I think it could be very helpful to introduce a pom.xml in order to have maven dependencies (in which case it would be necessary to publish XMLJava in maven central)

@rmraya
Copy link
Owner

rmraya commented Nov 1, 2023

Please provide the original XML file you are trying to translate so I can reproduce the issue.

There are no external dependencies for this project, so there is no need for Maven or a pom.xml file.

@rmraya
Copy link
Owner

rmraya commented Nov 2, 2023

FWIW, the lines of code that you mention are correct. They do exactly what they are supposed to do.

Please provide the source file for testing or close this issue.

@ajacob
Copy link
Author

ajacob commented Nov 2, 2023

Hello @rmraya,

Everything you need to reproduce this issue is provided above.

Here are step by step instructions:

  • Create a folder
  • In the folder, create 2 files: test.skl and test.xlf with the content from my first comment (you can simply copy/paste)
  • Then, simply run merge.sh utility like this: ~/OpenXLIFF/dist/merge.sh -xliff test.xlf -target result.xml

This will produce a result.xml file with duplicated text (as shown above).

To my understanding the produced result is not valid but if you think everything behaves as expected, feel free to close this issue

@rmraya
Copy link
Owner

rmraya commented Nov 2, 2023

@ajacob I need the original XML used to create the XLIFF. I want to perform a full roundtrip: generate XLIFF, translate and merge.

@rmraya
Copy link
Owner

rmraya commented Nov 2, 2023

@ajacob You mention that you made changes to the XLIFF, I want to see the version generated by OpenXLIFF

@rmraya
Copy link
Owner

rmraya commented Nov 2, 2023

@ajacob:

  1. Created a source file like this
<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1><G> Bonjour. </G> Ce text devrait être traduit </NIV1>
</ROOT>
  1. generated XLIFF with this command:

convert.sh -file original.xml -srcLang fr-FR -tgtLang en-US -2.0 -type XMLG

and obtained this:

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr-FR" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en-US" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/Users/rmraya/Desktop/original.xml" id="1">
    <skeleton href="/Users/rmraya/Desktop/original.xml.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <segment id="1-0">
        <source xml:space="preserve"> Bonjour.</source>
      </segment>
      <ignorable id="1-1">
        <source xml:space="preserve"> </source>
      </ignorable>
    </unit>
    <unit id="2">
      <segment id="2">
        <source xml:space="preserve"> Ce text devrait être traduit </source>
      </segment>
    </unit>
  </file>
</xliff>
  1. Translated like this:
<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr-FR" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en-US" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/Users/rmraya/Desktop/original.xml" id="1">
    <skeleton href="/Users/rmraya/Desktop/original.xml.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <segment state="final" id="1-0">
        <source xml:space="preserve"> Bonjour.</source>
        <target xml:space="preserve"> Hello.</target>
      </segment>
      <ignorable id="1-1">
        <source xml:space="preserve"> </source>
      </ignorable>
    </unit>
    <unit id="2">
      <segment state="final" id="2">
        <source xml:space="preserve"> Ce text devrait être traduit </source>
        <target xml:space="preserve"> This text should be translated </target>
      </segment>
    </unit>
  </file>
</xliff>
  1. Merged the translated file with this command:

merge.sh -xliff original.xml.xlf

and obtained this output file:

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
  <NIV1><G> Hello. </G> This text should be translated </NIV1>
</ROOT>

I don't know what you did, but the code seems to work.

@rmraya rmraya closed this as completed Nov 2, 2023
@ajacob
Copy link
Author

ajacob commented Nov 2, 2023

I don't know what you did, but the code seems to work.

I'll try to give you more information / STR, and hopefully help you reproduce this issue.

I've created a very simple sample that involves another tool that add translations into the xlf file.

SETUP

input.xml

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
    <NIV1><G>Bonjour.</G> Ceci est un text multi-ligne
qui a pour objectif de mettre en évidence
un simple bug. J'espère que ça aidera.</NIV1>
</ROOT>

config_ROOT.xml

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE ini-file PUBLIC "-//MAXPROGRAMS//Converters 2.0.0//EN" "configuration.dtd" >
<ini-file>
    <tag hard-break="segment">NIV1</tag>
    <tag ctype="x-bold" hard-break="inline">G</tag>
</ini-file>

CONVERT

./convert.sh -type XML -file input.xml -xliff test.xlf -srcLang fr -tgtLang en-US -skl test.skl -xmlfilter . -2.1

test.xlf

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.1" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en-US" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/input.xml" id="1">
    <skeleton href="test.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <mda:metadata id="1">
        <mda:metaGroup category="attributes" id="ph0">
          <mda:meta type="ctype">x-bold</mda:meta>
        </mda:metaGroup>
      </mda:metadata>
      <originalData>
        <data id="ph0">&lt;G&gt;</data>
        <data id="ph1">&lt;/G&gt;</data>
      </originalData>
      <ignorable>
        <source xml:space="preserve"><ph id="ph0"/></source>
      </ignorable>
      <segment id="1-0">
        <source xml:space="preserve">Bonjour.</source>
      </segment>
      <ignorable>
        <source xml:space="preserve"><ph id="ph1"/></source>
      </ignorable>
      <segment id="1-1">
        <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
      </segment>
      <segment id="1-2">
        <source xml:space="preserve"> J'espère que ça aidera.</source>
      </segment>
    </unit>
  </file>
</xliff>

test.skl

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
    <NIV1>%%%1%%%
</NIV1>
</ROOT>

TRANSLATION

This is an important step as the xlf file is modified by another tool that:

  • adds <target> elements
  • but also reformats the content

test.xlf

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.1" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en-US" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/home/alexandre/hello/input.xml" id="1">
    <skeleton href="test.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <mda:metadata id="1">
        <mda:metaGroup category="attributes" id="ph0">
          <mda:meta type="ctype">x-bold</mda:meta>
        </mda:metaGroup>
      </mda:metadata>
      <originalData>
        <data id="ph0">&lt;G&gt;</data>
        <data id="ph1">&lt;/G&gt;</data>
      </originalData>
      <ignorable>
        <source xml:space="preserve">
            <ph id="ph0"/>
        </source>
      </ignorable>
      <segment id="1-0">
        <source xml:space="preserve">Bonjour.</source>
        <target>Hello.</target>
      </segment>
      <ignorable>
        <source xml:space="preserve">
            <ph id="ph1"/>
        </source>
      </ignorable>
      <segment id="1-1">
        <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
        <target>This is a multi-line text that is supposed to demo a simple bug.</target>
      </segment>
      <segment id="1-2">
        <source xml:space="preserve"> J'espère que ça aidera.</source>
        <target> I hope this helps.</target>
      </segment>
    </unit>
  </file>
</xliff>

MERGE

./merge.sh -xliff test.xlf -target result.xml

result.xml

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
    <NIV1>
			<G>
		Bonjour.
			Hello.
			</G>
		 Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug. J'espère que ça aidera.This is a multi-line text that is supposed to demo a simple bug. I hope this helps.</NIV1>
</ROOT>

And as you can see, we now have duplicated content.

To help you better understand the changes in the xlf file, here is the diff (generated by another tool)

--- test.xlf	2023-11-02 23:50:14.943475711 +0100
+++ test.xlf	2023-11-02 23:50:19.660208235 +0100
@@ -26,20 +26,27 @@
         <data id="ph1">&lt;/G&gt;</data>
       </originalData>
       <ignorable>
-        <source xml:space="preserve"><ph id="ph0"/></source>
+        <source xml:space="preserve">
+            <ph id="ph0"/>
+        </source>
       </ignorable>
       <segment id="1-0">
         <source xml:space="preserve">Bonjour.</source>
+        <target>Hello.</target>
       </segment>
       <ignorable>
-        <source xml:space="preserve"><ph id="ph1"/></source>
+        <source xml:space="preserve">
+            <ph id="ph1"/>
+        </source>
       </ignorable>
       <segment id="1-1">
         <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
+        <target>This is a multi-line text that is supposed to demo a simple bug.</target>
       </segment>
       <segment id="1-2">
         <source xml:space="preserve"> J'espère que ça aidera.</source>
+        <target> I hope this helps.</target>
       </segment>
     </unit>
   </file>

@rmraya
Copy link
Owner

rmraya commented Nov 2, 2023

@ajacob

Your sample does not need a custom filter configuration.

I used this command to convert input.xml to XLIFF:

convert.sh -file input.xml -srcLang fr -tgtLang en -type XMLG -2.0

and got this XLIFF:

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/Users/rmraya/Desktop/input.xml" id="1">
    <skeleton href="/Users/rmraya/Desktop/input.xml.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <segment id="1">
        <source xml:space="preserve">Bonjour.</source>
      </segment>
    </unit>
    <unit id="2">
      <segment id="2-0">
        <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
      </segment>
      <segment id="2-1">
        <source xml:space="preserve"> J'espère que ça aidera.</source>
      </segment>
    </unit>
  </file>
</xliff>

I translated the XLIFF with Swordfish and got this translated version:

<?xml version="1.0" encoding="UTF-8" ?>
<xliff srcLang="fr" xmlns:mtc="urn:oasis:names:tc:xliff:matches:2.0" version="2.0" xmlns:mda="urn:oasis:names:tc:xliff:metadata:2.0" trgLang="en" xmlns="urn:oasis:names:tc:xliff:document:2.0">
  <file original="/Users/rmraya/Desktop/input.xml" id="1">
    <skeleton href="/Users/rmraya/Desktop/input.xml.skl"/>
    <mda:metadata>
      <mda:metaGroup category="format">
        <mda:meta type="datatype">xml</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="tool">
        <mda:meta type="tool-id">OpenXLIFF</mda:meta>
        <mda:meta type="tool-name">OpenXLIFF Filters</mda:meta>
        <mda:meta type="tool-version">3.16.0 20231029_1141</mda:meta>
      </mda:metaGroup>
      <mda:metaGroup category="PI">
        <mda:meta type="encoding">UTF-8</mda:meta>
      </mda:metaGroup>
    </mda:metadata>
    <unit id="1">
      <segment state="final" id="1">
        <source xml:space="preserve">Bonjour.</source>
        <target xml:space="preserve">Hello.</target>
      </segment>
    </unit>
    <unit id="2">
      <segment state="final" id="2-0">
        <source xml:space="preserve"> Ceci est un text multi-ligne qui a pour objectif de mettre en évidence un simple bug.</source>
        <target xml:space="preserve"> This is a multi-line text that aims to highlight a simple bug.</target>
      </segment>
      <segment state="final" id="2-1">
        <source xml:space="preserve"> J'espère que ça aidera.</source>
        <target xml:space="preserve"> I hope it will help.</target>
      </segment>
    </unit>
  </file>
</xliff>

Then, I merged the file using this command:

merge.sh -xliff translated.xlf -target translated.xml

and obtained this translated file:

<?xml version="1.0" encoding="UTF-8"?>
<ROOT>
    <NIV1><G>Hello.</G> This is a multi-line text that aims to highlight a simple bug. I hope it will help.</NIV1>
</ROOT>

Note that the segments of the XLIFF that you merged are not confirmed.

@ajacob
Copy link
Author

ajacob commented Nov 3, 2023

@rmraya

Your sample does not need a custom filter configuration.

The sample doesn't, the 200KB+ XML file I was working on and on which I noticed the bug does
The goal of the sample is to demonstrate the bug with very concise and simple data

When I created this issue, my objective was not to find a workaround or find a way to make the sample work, the purpose was to show you that the merge operation can fail with valid data (not produced by Swordfish).

@rmraya
Copy link
Owner

rmraya commented Nov 3, 2023

@ajacob

As far as I can see, the merge operation works as designed. I don't see a bug. The lines of code that you mentioned are doing exactly what is intended, recover text and tags from <source> when <ignorable> does not have <target>.

If you want me to fix anything, provide a real test case that I can use to reproduce the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants