Releases: bitextor/bitextor-testing-output
Bitextor testing output
Testing output files which differ from v11, has been generated using a commit very close to this commit.
Changes has been caused due to:
- New version of bicleaner-hardrules
Bitextor testing output
Testing output files which differ from v10, has been generated using a commit very close to this commit.
Changes has been caused due to:
- Fixed bug in documents output rule
- Fixed dictionary-based docalign feature: mutually linked documents
Bitextor testing output
Testing output files which differ from v9, has been generated using a commit very close to this commit.
Changes has been caused due to:
- New method for scoring in the TF-IDF MT-based document aligner
Bitextor testing output
Testing output files which differ from v8, has been generated using a commit very close to this commit.
Changes has been caused due to:
- New model for the dictionary-based document aligner, trained due to bump Scikit-learn version.
- New bicleaner models due to the same Scikit-learn version bump.
- New version of bicleaner-hardrules (making lowercase before scoring, new fastpell version, etc.), making Bicleaner and Bicleaner-AI scores to be different too.
- New document output test (number 40 in
run-tests-min
and 80 inrun-tests
)
Bitextor testing output
Testing output files which differ from v7, has been generated using a commit very close to this commit.
Changes has been caused due to:
- New
text2prevertical
change (avoiding strip to preserve original WARC spaces) introduced in bitextor/bitextor#245 modifies test 11 results. - New metadata added to the output files in tests 13 and 73
- Test 70, 71, 72, 73 and 102 in
run-tests.gz
has been run under CPU and old architectures Nvidia GPU (which gives the same result), instead of new architecture GPU (Nvidia A100), having different precision. - Test 102 in
run-tests.gz
use--disable_minimal_length
in Bicleaner through new Bitextor option--bicleanerExtraArgs
. This modified three sentence pairs, which were having Bicleaner score 0 by minimal length hardrule (source or target or both were 2 tokens long) but now they are filtered anyway because score is still lower than 0.5.
Bitextor testing output
Testing output files which differ from v6, has been generated using a commit very close to this commit.
Changes has been caused due to:
- New
bicleaner-hardrules
version 2.5, now disabling URL filtering by default, given the new more aggressive URL filtering.
25/11/2022: Changed run-test-min.tgz
to add dir2warc test outputs.
Bitextor testing output
Testing output files which differ from v5, has been generated using a commit very close to this commit.
Changes, apparently, has been caused due to:
- Adding the number of the final paragraph in a document, if option is enabled.
Bitextor testing output
Testing output files which differ from v4, has been generated using a commit very close to this commit.
Changes, apparently, has been caused due to:
- Removing default tokenizer from Bicleaner (now it is provided only if the user provides a tokenizer)
- Due to the different scores of Bicleaner, the number of sentences in some tests have been altered due to a configured threshold.
- Bicleaner AI submodule was updated, and scores might have been altered for this reason as well.
- Some output files have different order since the condition for sorting has been lightly changed (e.g.
run-deferred-tests.tgz
).
Update (after the release was published):
- Tests 40 and 50 have been enabled again: bitextor/bicleaner#72
- Test 40.1 was failing, what led to think that, specifically, hunalign was returning non-deterministic values depending on the machine that the tests were executed. Actually, we didn't notice that a different dictionary was being used, which was the reason why there were different values. The real reason why different results were being obtained was that in GHA, the tests are executed concurrently and in separate machines, while locally all the tests were being executed concurrently but in the same machine. This situation caused that, locally, the dictionary was being replaced. Fix: bitextor/bitextor@2a69167
- Older tests had been uploaded for
run-tests.tgz
file. It's been fixed.
Bitextor testing output
Testing output files which differ from v3, has been generated using a commit very close to this commit.
Changes, apparently, has been caused due to:
- Sentence splitter was printing the total of found paragraphs when paragraph identification was being processed. This was an issue since we might not have the total of paragraphs as input (e.g. paragraphs removed due to boilerplate removal), so the total count of paragraphs might be lower that the last paragraph id. This count has been removed.
- Vecalign was printing the target URL as the source URL (commit).
Bitextor testing output
Testing output files which differ from v2, has been generated using a commit very close to this commit.
Changes, apparently, has been caused due to: