This project implements a pipeline to acquire, clean, and structure the Kannada news dataset.
ElasticSearch is used to store the extracted URLs and the text data. Separate indexes are used for different purposes. Generally,
id
field indicates the unique IDsource
field contains the short name of the origin news paper
For details about the indices, check config/sys_config.yml
.
The ElasticSearch Container is configured to store the data at ES_SAMPLE_DATA/
directory.
The elasticsearch docker container version 7.6.1
is used. More details: https://hub.docker.com/_/elasticsearch/
Start the elastic search container using
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -v <LOCAL DIR FULL PATH>:/usr/share/elasticsearch/data -e "discovery.type=single-node" elasticsearch:7.6.1
Note: Make sure <LOCAL DIR FULL PATH> is present
The src/util/
folder includes various cleanup scripts to be run on the website dump, before running the extractors.
- To fix the directories with name ending with
.html
and containsindex.html
inside it, usesrc/util/move_html_directory_to_file.py
This component loads the website-dump from local disk, extracts and cleans all the valid HTML URLs. The extracted links are then indexed as well.
python3 src/link_extractor_runner.py
This component loads the valid HTML pages from the local disk, extracts the article information. Article document includes article text, publish date, title, description and keywords. The articles are saved on the configured storage.
This component first filters out the URLs present in the seed-url index whose HTML is available, and whose the article has not been extracted. Only such urls will be considered for the extraction.
python3 src/article_extractor.py
Fetch the data from article-index and save to a JL file on local system.
python3 src/get_index_dump.py
[X] Index the URLs (along with origin-page URL for reference)
[X] Fix common issues in article-extraction
[ ] Space in the html file paths! During extraction:
Ex: "cricket/rishabh-pant-surpasses-ms dhoni-creates-another-record/330223.html"
[X] Avoid small texts, duplicate text snippets
[X] Filter out the the documents where text_len < 100 for both websites
[X] Fix the html parsers
[X] Re-run the extractor on both websites and create new article indices (index suffix _v2)
[ ] Crawl the uncrawled URLs and save HTML
[X] There are html files not ending with .html(like .cms)