Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve remove_markup handling of Wikipedia headers #2622

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions gensim/corpora/wikicorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@
r'(^.{0,2}((bgcolor)|(\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=))(.*))',
re.UNICODE
)
"""Remove headings """
RE_P18 = re.compile(r'(#{1,6}) +|^\s*=*|\=*$|^\s*-*|\-*$', re.UNICODE)
"""Table markup"""
IGNORED_NAMESPACES = [
'Wikipedia', 'Category', 'File', 'Portal', 'Template',
Expand Down Expand Up @@ -184,7 +186,7 @@ def find_interlinks(raw):
return legit_interlinks


def filter_wiki(raw, promote_remaining=True, simplify_links=True):
def filter_wiki(raw, promote_remaining=True, simplify_links=True, retain_heading_markup=True):
"""Filter out wiki markup from `raw`, leaving only text.

Parameters
Expand All @@ -195,6 +197,8 @@ def filter_wiki(raw, promote_remaining=True, simplify_links=True):
Whether uncaught markup should be promoted to plain text.
simplify_links : bool
Whether links should be simplified keeping only their description text.
retain_heading_markup: bool
Whether heading markups should be preserved or removed. The heading text itself is retained in either case.

Returns
-------
Expand All @@ -206,10 +210,10 @@ def filter_wiki(raw, promote_remaining=True, simplify_links=True):
# contributions to improving this code are welcome :)
text = utils.to_unicode(raw, 'utf8', errors='ignore')
text = utils.decode_htmlentities(text) # ' ' --> '\xa0'
return remove_markup(text, promote_remaining, simplify_links)
return remove_markup(text, promote_remaining, simplify_links, retain_heading_markup)


def remove_markup(text, promote_remaining=True, simplify_links=True):
def remove_markup(text, promote_remaining=True, simplify_links=True, retain_heading_markup=True):
"""Filter out wiki markup from `text`, leaving only text.

Parameters
Expand All @@ -220,6 +224,8 @@ def remove_markup(text, promote_remaining=True, simplify_links=True):
Whether uncaught markup should be promoted to plain text.
simplify_links : bool
Whether links should be simplified keeping only their description text.
retain_heading_markup: bool
Whether heading markups should be preserved or removed. The heading text itself is retained in either case.

Returns
-------
Expand Down Expand Up @@ -256,6 +262,9 @@ def remove_markup(text, promote_remaining=True, simplify_links=True):
text = re.sub(RE_P13, '\n', text) # leave only cell content
text = re.sub(RE_P17, '\n', text) # remove formatting lines

if not retain_heading_markup:
text = re.sub(RE_P18, '', text) # remove headings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this retain the words of the heading? (I don't think so.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gojomo,
The re.sub function using RE_P18 does retain the words in the heading. Please refer to the screenshot attached where I test if this works on a jupyter notebook.
image

A comment below asks for unit tests; If you could guide me, I could do that and fix any issues that could have been introduced due to the changes. I would appreciate any pointers and your expectations from the tests.


# remove empty mark-up
text = text.replace('[]', '')
# stop if nothing changed between two iterations or after a fixed number of iterations
Expand Down