Skip to content

Parsing wikipedia content into plain text and creating small reading comprehension questions

Notifications You must be signed in to change notification settings

zimmeee/read-gooder-wikiparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

read-gooder-wikiparse

Parsing wikipedia content into plain text and creating small reading comprehension questions

Running

usage: simplify_wiki_html.py [-h] [-t TITLE | -f FILE]

Convert a wikipedia page into OpenMind JSON format

optional arguments:
  -h, --help            show this help message and exit
  -t TITLE, --title TITLE
                        fetch the wikipedia article with this title from the
                        Wikimedia REST API
  -f FILE, --file FILE  convert the local HTML file at this path

The t option fetches the HTML version of the wikipedia article from the Wikimedia REST API and thus requires internet connectivity. For development, it's probably nicer to use the -f option.

Dependencies

nltk

Install nltk and nltk data

Stanford Parser

Text is parsed using the Stanford parser. Follow instructions to install the Stanford parser and use the nltk interface nltk interface.

Wikimedia REST API

This utility makes heavy use of the Wikimedia REST API. In particular, we use the HTML endpoint which allows you to retrieve the latest html for a wikipedia page title.

JSON Format

I'll try to document this a couple ways to see which one makes more sense.

Grammar-like documentation

<document> ::=
	"header": STRING,
	"sections": <sections> | <paragraphs>

<section> ::=
	?"header": STRING,
	<paragraphs> | <section> | <paragraphs>,<section>

<paragraphs> ::=
	"sentences": [<sentences>]

<sentences> ::=
	[<sentence>]

<sentence> ::=
	"num_words": INT,
	"sentence_parts": [<sentence_parts>]

<sentence_parts> ::=
	"indent": INT,
	"text": STRING

English-like documentation

{
    "header": "Train", # Title of document
    "section": { # Sections are made up of paragraphs or subsections or both
        "paragraphs": [ # Paragraphs is a list of paragraph
            { # A paragraph
                "sentences": [ # Sentences is a list of sentence
                    { # sentence has num_words and a list of sentence_parts
                        "num_words": 26, 
                        "sentence_parts": [
                            { # sentence_parts have an indent amount and text
                                "indent": 0, 
                                "text": "A train is a"
                            }, 
                            {
                                "indent": 0, 
                                "text": "form of rail transport"
                            }, 
                            {
                                "indent": 0, 
                                "text": "consisting of a series"
                            },
                            .
                            .
                            .
                        ] # End sentence_part
                    } # End sentence
                ] # End sentences
            }, # End paragraph
            .
            .
            .
        ], # End paragraphs
        "section": [
            {
                "header": "Types", # Title of the section
                "paragraphs": [ # Paragraphs in that section
                    {
                        "sentences": [
                            {
                                "num_words": 12, 
                                "sentence_parts": [
                                    {
                                        "indent": 0, 
                                        "text": "There are various types"
                                    }, 
                                    {
                                        "indent": 0, 
                                        "text": "of trains that are"
                                    }, 
                                    {
                                        "indent": 0, 
                                        "text": "designed for particular purposes."
                                    }
                                ]
                            }, 
                        ]
                    }
                ]
            }
        ]
    }
}

About

Parsing wikipedia content into plain text and creating small reading comprehension questions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages