XML Wikipedia dump to multiple JSON files

Tool to help with the DTU Course Computational Tools For Big Data but can be freely used. This tool consists of two parts. A Wikipedia Reader and a Wikipedia splitter. If you are just interested in converting the dump to JSON, just use the Wikipedia splitter.

This has only been tested on the dump enwiki-20170820-pages-articles-multistream.xml.

Install

Run pip install -r requirements.txt

Wikipedia Reader

When downloading Wikipedia you receive about 60GB of XML data. It is impossible to keep this all in memory on most machines therefore you need to stream it. Streaming makes accessing the data a bit harder as you can't just iterate over the tree for each page. Therefore we have created a reader for the XML file, which will parse each page into a Python dictionary. This has been built with Python iterators. This means that you easily iterate over pages with the following code:

wiki_reader = WikiReader(input_file)
for page_dict in wiki_reader:
	# Do something with the page

Wikipedia Splitter

Having the entire Wikipedia in a single 60GB file is not very easy to work with. We'd much rather have the file split into several smaller files.

To do this we have created a Wikipedia Splitter. This uses the Wikipedia Reader to read the XML data page by page into dictionaries. After this it takes a number of pages at a time and writes them to JSON files. At the same time it writes a JSON index file, which maps each page title to a JSON file. This way we can pick out individual pages for testing.

The Splitter consists of three commands: split-wiki, get-page, get-pages.

The split-wiki command splits a Wikipedia XML file into several JSON files.
The get-page command gets the JSON version of a single page. You must run the split wiki command first.
The get-pages command collects several pages defined in a txt file, into a single JSON file. This file should just contain the name of each article needed alone on each line. This is useful for testing.

To use any of the commands run python src/WikiSplitter.py {command}.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XML Wikipedia dump to multiple JSON files

Install

Wikipedia Reader

Wikipedia Splitter

About

Releases

Packages

Languages

License

MTelling/WikiXMLDumpToJSON

Folders and files

Latest commit

History

Repository files navigation

XML Wikipedia dump to multiple JSON files

Install

Wikipedia Reader

Wikipedia Splitter

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages