Skip to content
Change the repository type filter

All

    Repositories list

    • scrawl

      Public
      Playwright-based web crawler
      Python
      GNU General Public License v3.0
      0000Updated Nov 14, 2024Nov 14, 2024
    • bitextor

      Public
      Bitextor generates translation memories from multilingual websites
      Python
      GNU General Public License v3.0
      4329034Updated Nov 11, 2024Nov 11, 2024
    • Pre-filtering step for bicleaner
      Python
      GNU General Public License v3.0
      2400Updated Oct 8, 2024Oct 8, 2024
    • PDF parser and converter to HTML
      Java
      GNU General Public License v3.0
      148341Updated Oct 3, 2024Oct 3, 2024
    • bifixer

      Public
      Tool to fix bitexts and tag near-duplicates for removal
      Python
      GNU General Public License v3.0
      32900Updated Aug 19, 2024Aug 19, 2024
    • warc2text

      Public
      Extracts plain text, language identification and more metadata from WARC records
      C++
      MIT License
      520123Updated Aug 8, 2024Aug 8, 2024
    • Bicleaner fork that uses neural networks
      Python
      GNU General Public License v3.0
      43810Updated Jul 26, 2024Jul 26, 2024
    • bicleaner

      Public
      Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
      Python
      GNU General Public License v3.0
      2215001Updated Jun 18, 2024Jun 18, 2024
    • biroamer

      Public
      Utility that will help you to ROAM (Random Omit Anonymize and Mix) your parallel corpus.
      Python
      GNU General Public License v3.0
      2900Updated Feb 26, 2024Feb 26, 2024
    • Python
      GNU General Public License v3.0
      1610Updated Sep 6, 2023Sep 6, 2023
    • Python
      GNU General Public License v3.0
      1600Updated May 31, 2023May 31, 2023
    • Repository for storing testing outputs from Bitextor
      GNU General Public License v3.0
      0000Updated May 29, 2023May 29, 2023
    • Extracts plain text, language identification and more metadata from Spiderling prevertical files
      C++
      MIT License
      0200Updated May 17, 2023May 17, 2023
    • fastText

      Public
      Library for fast text representation and classification.
      HTML
      MIT License
      4.7k000Updated May 4, 2023May 4, 2023
    • Reconstructs sentences using deferred crawling standoff annotations from Bitextor
      Python
      MIT License
      0000Updated May 4, 2023May 4, 2023
    • Repository of Bicleaner AI models
      Other
      0500Updated Mar 28, 2023Mar 28, 2023
    • C++
      GNU General Public License v3.0
      2720Updated Mar 10, 2023Mar 10, 2023
    • Repository for data models, dictionaries and more resources for Bicleaner
      GNU General Public License v3.0
      0600Updated Dec 15, 2022Dec 15, 2022
    • vecalign

      Public
      Improved Sentence Alignment in Linear Time and Space
      Python
      Apache License 2.0
      30200Updated Dec 4, 2022Dec 4, 2022
    • Python interface to Apache Tika, HTML extraction from PDF
      Python
      Other
      143000Updated Nov 30, 2022Nov 30, 2022
    • Python module to interface with Java Loomchild sentence segmenter
      Python
      GNU General Public License v3.0
      1110Updated Nov 28, 2022Nov 28, 2022
    • Fork of glove-python to distribute binary builds
      Python
      Apache License 2.0
      319000Updated Aug 12, 2022Aug 12, 2022
    • Document aligner which uses neural technologies to search matches across bilingual documents
      Python
      GNU General Public License v3.0
      3700Updated Jun 9, 2022Jun 9, 2022
    • bitextor-neural

      Public archive
      Bitextor Neural generates translation memories from multilingual websites using state-of-the-art Machine Learning tools
      Python
      GNU General Public License v3.0
      0300Updated Jun 3, 2022Jun 3, 2022
    • Monocleaner models repository
      GNU General Public License v3.0
      0100Updated Nov 18, 2021Nov 18, 2021
    • hunalign

      Public
      Sentence aligner
      C++
      GNU Lesser General Public License v3.0
      38000Updated May 21, 2021May 21, 2021
    • cld2

      Public
      Compact Language Detector 2
      C++
      Apache License 2.0
      128000Updated May 4, 2021May 4, 2021
    • Python interface to pdf-extract, HTML extraction from PDF
      Python
      Other
      143600Updated Sep 3, 2020Sep 3, 2020
    • Repository for data models, dictionaries and more resources for Bitextor
      GNU General Public License v3.0
      0500Updated Feb 7, 2020Feb 7, 2020