Releases · bitextor/bitextor-testing-output

New model for the dictionary-based document aligner, trained due to bump Scikit-learn version.
New bicleaner models due to the same Scikit-learn version bump.
New version of bicleaner-hardrules (making lowercase before scoring, new fastpell version, etc.), making Bicleaner and Bicleaner-AI scores to be different too.
New document output test (number 40 in run-tests-min and 80 in run-tests)

Assets 6

30 Nov 10:41

lpla

19b0958

Bitextor testing output

Testing output files which differ from v7, has been generated using a commit very close to this commit.

Changes has been caused due to:

New text2prevertical change (avoiding strip to preserve original WARC spaces) introduced in bitextor/bitextor#245 modifies test 11 results.
New metadata added to the output files in tests 13 and 73
Test 70, 71, 72, 73 and 102 in run-tests.gz has been run under CPU and old architectures Nvidia GPU (which gives the same result), instead of new architecture GPU (Nvidia A100), having different precision.
Test 102 in run-tests.gz use --disable_minimal_length in Bicleaner through new Bitextor option --bicleanerExtraArgs. This modified three sentence pairs, which were having Bicleaner score 0 by minimal length hardrule (source or target or both were 2 tokens long) but now they are filtered anyway because score is still lower than 0.5.

Assets 6

23 Nov 11:56

lpla

19b0958

Bitextor testing output

Testing output files which differ from v6, has been generated using a commit very close to this commit.

Changes has been caused due to:

New bicleaner-hardrules version 2.5, now disabling URL filtering by default, given the new more aggressive URL filtering.

25/11/2022: Changed run-test-min.tgz to add dir2warc test outputs.

Assets 6

23 Nov 08:34

lpla

19b0958

Bitextor testing output

Testing output files which differ from v5, has been generated using a commit very close to this commit.

Changes, apparently, has been caused due to:

Adding the number of the final paragraph in a document, if option is enabled.

Assets 6

13 Oct 14:29

cgr71ii

19b0958

Bitextor testing output

Testing output files which differ from v4, has been generated using a commit very close to this commit.

Changes, apparently, has been caused due to:

Removing default tokenizer from Bicleaner (now it is provided only if the user provides a tokenizer)
- Due to the different scores of Bicleaner, the number of sentences in some tests have been altered due to a configured threshold.
Bicleaner AI submodule was updated, and scores might have been altered for this reason as well.
Some output files have different order since the condition for sorting has been lightly changed (e.g. run-deferred-tests.tgz).

Update (after the release was published):

Tests 40 and 50 have been enabled again: bitextor/bicleaner#72
Test 40.1 was failing, what led to think that, specifically, hunalign was returning non-deterministic values depending on the machine that the tests were executed. Actually, we didn't notice that a different dictionary was being used, which was the reason why there were different values. The real reason why different results were being obtained was that in GHA, the tests are executed concurrently and in separate machines, while locally all the tests were being executed concurrently but in the same machine. This situation caused that, locally, the dictionary was being replaced. Fix: bitextor/bitextor@2a69167
Older tests had been uploaded for run-tests.tgz file. It's been fixed.

Assets 6

23 Sep 11:13

cgr71ii

19b0958

Bitextor testing output

Testing output files which differ from v3, has been generated using a commit very close to this commit.

Changes, apparently, has been caused due to:

Sentence splitter was printing the total of found paragraphs when paragraph identification was being processed. This was an issue since we might not have the total of paragraphs as input (e.g. paragraphs removed due to boilerplate removal), so the total count of paragraphs might be lower that the last paragraph id. This count has been removed.
Vecalign was printing the target URL as the source URL (commit).

Assets 6

06 Sep 10:48

cgr71ii

19b0958

Bitextor testing output

Testing output files which differ from v2, has been generated using a commit very close to this commit.

Changes, apparently, has been caused due to:

Bifixer submodule has been updated (log).

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: bitextor/bitextor-testing-output

Bitextor testing output

Bitextor testing output

Bitextor testing output

Bitextor testing output

Bitextor testing output

Bitextor testing output

Bitextor testing output

Bitextor testing output

Bitextor testing output

Bitextor testing output