Skip to content

Latest commit

 

History

History
166 lines (124 loc) · 5.94 KB

README.md

File metadata and controls

166 lines (124 loc) · 5.94 KB

wikiplaintext

Get plain text from Wikipedia pages, as clean as possible.

Based on the latest versions of the Wikimedia dumps, the principle is to parse the HTML pages and get the cleanest version possible of a text, with markdown format for headers, lists, and tables.

Examples of output can be found in the folder tests/examples_markdown:

This code was used to generate the HuggingFace datasets:

Those datasets are supposed to be cleaner and more complete than French subsets of Wikimedia datasets:


Documentation

Installation

git clone [email protected]:OpenLLM-France/wikiplaintext.git
cd wikiplaintext
pip install -r requirements.txt

All the scripts in the following are in the subfolder wikiplaintext.

Dump Wikipedia

Download the latest version available

The following command will

  1. Download the latest version of Wikipedia dump from Wikimedia Enterprise HTML dump
  2. Extract then ndjson files from the dump
  3. Extract one HTML file per Wikipedia page
  4. Parse each HTML file to get a clean plain text, and save it in a file
python dump_wiki_html.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --source wiki

This will generate plain text files in subfolder /path/to/Wikipedia/{YYYYMMDD}/frwiki_txt/frwiki_namespace_0_* where {YYYYMMDD} is the latest version available.

One file per Wikipedia page, with the page id and title as a filename.

Download a given version

python dump_wiki_html.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --source wiki \
    --date 20231201

This will generate plain text files in subfolder /path/to/Wikipedia/20231201/frwiki_txt/frwiki_namespace_0_*.

How to parallelize

The process can be parallelized by launching several time the same command using option --subset {i}/{n}. For example, 5 processes can be launched with the following commands:

python dump_wiki_html.py ... --subset 1/5 &
python dump_wiki_html.py ... --subset 2/5 &
python dump_wiki_html.py ... --subset 3/5 &
python dump_wiki_html.py ... --subset 4/5 &
python dump_wiki_html.py ... --subset 5/5 &

We recommend to run that in several windows of a tmux session (or screen session).

Dump Wiktionary

Download the latest version available

The process is very similar to Wikipedia (see above).

python dump_wiki_html.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --source wiktionary

This will generate plain text files in subfolder /path/to/Wikipedia/{YYYYMMDD}/frwiktionary_txt/frwiktionary_namespace_0_* where {YYYYMMDD} is the latest version available.

One file per Wikipedia page, with the page id and title as a filename.

Download a given version

python dump_wiki_html.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --source wiktionary \
    --date 20231201

This will generate plain text files in subfolder /path/to/Wikipedia/20231201/frwiktionary_txt/frwiktionary_namespace_0_*.

Dump Wikisource

For Wikisource, it is a bit different because the Wikimedia dump is quite incomplete.

So the process consists in the following:

  1. get all the page titles from the latest HuggingFace dataset from Wikimedia
  2. download the HTML pages from the Wikimedia API
  3. parse the HTML pages and get the plain text

It can be run with the following command:

python dump_wikisource_api.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --version 20231201 \
    --dump_html

This will generate plain text files in the folder /path/to/Wikipedia/20231201/frwikisource_txt/frwikisource_namespace_0_0.

Also, with option --dump_html it will dump all HTML pages in the folder /path/to/Wikipedia/20231201/frwikisource_html/frwikisource_namespace_0_0. This is useful to restart the process later, if the cleaning code evolves, using:

python dump_wikisource_api2.py \
    --output_dir /path/to/Wikipedia \
    --language fr \
    --version 20231201

Acknowledgements