-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/wiki scraper #51
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which wikis have you tried this on? Can it be easily parallelized to get more than 35 pages at once?
I think it's a good idea to replace Wikipedia's custom mathematics syntax with LaTeX. Does it make sense to do it at this stage of the pipeline, or later? |
@craffel Yeah it should be easy to parallelize this, it runs off files which list page titles (one per line) so you can parallelize over the files (and we are already parallelizing over wiki's) and we can split the inputs pretty easily for more parallelism. @StellaAthena I think it would best to have that happen later. I was thinking after this export there would be a step that converts from these xml to dolma which would have raw wiki markup as the |
d2dadf0
to
a2c7008
Compare
227138a
to
517bca9
Compare
05d64f2
to
f364ea8
Compare
450e6b1
to
77a1b68
Compare
3e31f23
to
e7d562c
Compare
Add tools to scrape mediawiki wikis that don't publish dumps Add tool that exports the xml based on the list of pages. Add the ability to convert wikis to dolma Download and extract script supports multiworker Create WTF Wikipedia parsing server which uses a worker pool to allow for timeouts Creation of script that removes html tags we found in many wiki dumps. Added Shadow Paging to the creation of wikitext dolma files Added Shadow Paging to dolma preprocessing. Added script that remove `None` lines from dolma files. Added script that can combine dolma shards while tracking what was used where to allow for aligned combinations of later versions.
59590a8
to
4e571b6
Compare
Datasets have been uploaded to https://huggingface.co/datasets/blester125/wiki-dolma WikiMedia + Talk pages are cleaner and have 14.6 Billion tokens. |
This PR adds scripts that can be used to get an xml export of mediawiki sites that don't provide dumps. The resulting dump will contain a list of
<page>
, one for each exported page. Each page has multiple<revision>
which can be used to create an author list. The most recent<revision>
's<text>
can be used to get the mediawiki markup representation of the page to use as the document text.An index of pages is built using the
Special:AllPages
query url and then exports are made usingSpecial:Export
.