-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicate webaratas #45
base: master
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
#!/usr/bin/env python3 | ||
# -*- coding: utf-8 -*- | ||
|
||
""" | ||
Takes a list of urls from a file and transforms them using one of the | ||
predefined transformations, writing the results out to another txt file. | ||
""" | ||
|
||
from argparse import ArgumentParser | ||
import logging | ||
from pathlib import Path | ||
from urllib import parse | ||
|
||
|
||
def parse_arguments(): | ||
parser = ArgumentParser(description=__doc__) | ||
parser.add_argument('--input', '-i', type=Path, required=True, | ||
help='the input file.') | ||
parser.add_argument('--output', '-o', type=Path, required=True, | ||
help='the output file.') | ||
parser.add_argument('--transformation', '-t', type=str, required=True, | ||
help='the transformation pattern to apply') | ||
parser.add_argument('--log-level', '-L', type=str, default='info', | ||
choices=['debug', 'info', 'warning', | ||
'error', 'critical'], | ||
help='the logging level.') | ||
args = parser.parse_args() | ||
if not args.input.is_file(): | ||
parser.error('The input file must exist.') | ||
return args | ||
|
||
|
||
def nepszava_transformation(input_url: str): | ||
""" | ||
Transformation example: | ||
input: | ||
https://nepszava.hu/json/cikk.json?id=1001322_elindult-a-bekemenet | ||
output: | ||
http://nepszava.hu/1001322_elindult-a-bekemenet | ||
""" | ||
parsed = parse.urlparse(input_url) | ||
article_title = parsed.query.split('=', 1)[1] | ||
new_parsed = ('http', parsed.netloc, article_title, '', '', '',) | ||
output_url = parse.urlunparse(new_parsed) | ||
return output_url | ||
|
||
|
||
def hu888_transformation(input_url: str): | ||
""" | ||
Transformation example: | ||
input: | ||
https://888.hu/ketharmad/orban-viktor-5-pontjat-vitatjak-meg-visegradon-4089046/ | ||
output: | ||
http://888.hu/article-orban-viktor-5-pontjat-vitatjak-meg-visegradon | ||
""" | ||
parsed = parse.urlparse(input_url) | ||
new_title = parsed.path.split('/')[2] | ||
new_title = new_title.split('-')[: -1] | ||
new_title.insert(0, 'article') | ||
new_title = '-'.join(new_title) | ||
new_parsed = ('http', parsed.netloc, new_title, '', '', '',) | ||
output_url = parse.urlunparse(new_parsed) | ||
return output_url | ||
|
||
|
||
def main(): | ||
args = parse_arguments() | ||
|
||
logging.basicConfig( | ||
level=getattr(logging, args.log_level.upper()), | ||
format='%(asctime)s - %(process)s - %(levelname)s - %(message)s' | ||
) | ||
logging.info(f'Transforming input file {args.input} with pattern ' | ||
f'{args.transformation}') | ||
|
||
if args.transformation == 'nepszava': | ||
transf = nepszava_transformation | ||
elif args.transformation == '888hu': | ||
transf = hu888_transformation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these the only domains we can handle? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. There are no further patterns to use in the query params parts of urls. There might be useful patterns in the path parts of urls... |
||
else: | ||
raise ValueError(f'Unknown transformation pattern: ' | ||
f'{args.transformation}') | ||
|
||
with open(args.output, 'wt') as out_f, open(args.input, 'rt') as in_f: | ||
for line in in_f: | ||
print(transf(line), file=out_f) | ||
|
||
|
||
if __name__ == '__main__': | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering: is this systematic in the sense that we don't get the json-type URLs in our data? Would it make sense to keep both the input and the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm...
In nepszava we have the following patterns:
These three are duplicates of each other, but not every article exists in all three forms (some do).
I just realized that if we want to keep the documents from webaratás, then we should filter out all three variants. :(
A fourth pattern is:
http://nepszava.hu/articles/article.php?id=238077
These have 6 digit numbers, while the rest have 7 digit numbers, so they don't seem to match.
The common crawl has no urls with nepszava.hu/json