read-gooder-wikiparse

Parsing wikipedia content into plain text and creating small reading comprehension questions

Running

usage: simplify_wiki_html.py [-h] [-t TITLE | -f FILE]

Convert a wikipedia page into OpenMind JSON format

optional arguments:
  -h, --help            show this help message and exit
  -t TITLE, --title TITLE
                        fetch the wikipedia article with this title from the
                        Wikimedia REST API
  -f FILE, --file FILE  convert the local HTML file at this path

The t option fetches the HTML version of the wikipedia article from the Wikimedia REST API and thus requires internet connectivity. For development, it's probably nicer to use the -f option.

Dependencies

nltk

Install nltk and nltk data

Stanford Parser

Text is parsed using the Stanford parser. Follow instructions to install the Stanford parser and use the nltk interface nltk interface.

Wikimedia REST API

This utility makes heavy use of the Wikimedia REST API. In particular, we use the HTML endpoint which allows you to retrieve the latest html for a wikipedia page title.

JSON Format

I'll try to document this a couple ways to see which one makes more sense.

Grammar-like documentation

<document> ::=
	"header": STRING,
	"sections": <sections> | <paragraphs>

<section> ::=
	?"header": STRING,
	<paragraphs> | <section> | <paragraphs>,<section>

<paragraphs> ::=
	"sentences": [<sentences>]

<sentences> ::=
	[<sentence>]

<sentence> ::=
	"num_words": INT,
	"sentence_parts": [<sentence_parts>]

<sentence_parts> ::=
	"indent": INT,
	"text": STRING

English-like documentation

{
    "header": "Train", # Title of document
    "section": { # Sections are made up of paragraphs or subsections or both
        "paragraphs": [ # Paragraphs is a list of paragraph
            { # A paragraph
                "sentences": [ # Sentences is a list of sentence
                    { # sentence has num_words and a list of sentence_parts
                        "num_words": 26, 
                        "sentence_parts": [
                            { # sentence_parts have an indent amount and text
                                "indent": 0, 
                                "text": "A train is a"
                            }, 
                            {
                                "indent": 0, 
                                "text": "form of rail transport"
                            }, 
                            {
                                "indent": 0, 
                                "text": "consisting of a series"
                            },
                            .
                            .
                            .
                        ] # End sentence_part
                    } # End sentence
                ] # End sentences
            }, # End paragraph
            .
            .
            .
        ], # End paragraphs
        "section": [
            {
                "header": "Types", # Title of the section
                "paragraphs": [ # Paragraphs in that section
                    {
                        "sentences": [
                            {
                                "num_words": 12, 
                                "sentence_parts": [
                                    {
                                        "indent": 0, 
                                        "text": "There are various types"
                                    }, 
                                    {
                                        "indent": 0, 
                                        "text": "of trains that are"
                                    }, 
                                    {
                                        "indent": 0, 
                                        "text": "designed for particular purposes."
                                    }
                                ]
                            }, 
                        ]
                    }
                ]
            }
        ]
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
config		config
format		format
output		output
resources		resources
scripts		scripts
static		static
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

read-gooder-wikiparse

Running

Dependencies

nltk

Stanford Parser

Wikimedia REST API

JSON Format

Grammar-like documentation

English-like documentation

About

Releases

Packages

Contributors 2

Languages

zimmeee/read-gooder-wikiparse

Folders and files

Latest commit

History

Repository files navigation

read-gooder-wikiparse

Running

Dependencies

nltk

Stanford Parser

Wikimedia REST API

JSON Format

Grammar-like documentation

English-like documentation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages