A Generic Scrapper Tool used to prepare the Kannada News Dataset

This project implements a pipeline to acquire, clean, and structure the Kannada news dataset.

Storage

ElasticSearch is used to store the extracted URLs and the text data. Separate indexes are used for different purposes. Generally,

id field indicates the unique ID
source field contains the short name of the origin news paper

For details about the indices, check config/sys_config.yml.

The ElasticSearch Container is configured to store the data at ES_SAMPLE_DATA/ directory.

easticsearch Container Setup

The elasticsearch docker container version 7.6.1 is used. More details: https://hub.docker.com/_/elasticsearch/

Start the elastic search container using

docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -v <LOCAL DIR FULL PATH>:/usr/share/elasticsearch/data -e "discovery.type=single-node" elasticsearch:7.6.1

Note: Make sure <LOCAL DIR FULL PATH> is present

Steps to crawl and extract the news article text

Preprocessing

The src/util/ folder includes various cleanup scripts to be run on the website dump, before running the extractors.

To fix the directories with name ending with .html and contains index.html inside it, use src/util/move_html_directory_to_file.py

Task 1: Link Extractor

This component loads the website-dump from local disk, extracts and cleans all the valid HTML URLs. The extracted links are then indexed as well.

Entry Point

python3 src/link_extractor_runner.py

Task 2: Article Extractor

This component loads the valid HTML pages from the local disk, extracts the article information. Article document includes article text, publish date, title, description and keywords. The articles are saved on the configured storage.

This component first filters out the URLs present in the seed-url index whose HTML is available, and whose the article has not been extracted. Only such urls will be considered for the extraction.

Entry Point

python3 src/article_extractor.py

Task 3: Save to File

Fetch the data from article-index and save to a JL file on local system.

Entry Point

python3 src/get_index_dump.py

TODO

[X] Index the URLs (along with origin-page URL for reference)
[X] Fix common issues in article-extraction
[ ] Space in the html file paths! During extraction:
Ex: "cricket/rishabh-pant-surpasses-ms dhoni-creates-another-record/330223.html"
[X] Avoid small texts, duplicate text snippets
[X] Filter out the the documents where text_len < 100 for both websites
[X] Fix the html parsers
[X] Re-run the extractor on both websites and create new article indices (index suffix _v2)

[ ] Crawl the uncrawled URLs and save HTML

[On Going] Vijayakarnataka TODOs

Link Extractor related issues

[X] There are html files not ending with .html(like .cms)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
config		config
notebooks		notebooks
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
v1.0.0		v1.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Generic Scrapper Tool used to prepare the Kannada News Dataset

Storage

easticsearch Container Setup

Steps to crawl and extract the news article text

Preprocessing

Task 1: Link Extractor

Entry Point

Task 2: Article Extractor

Entry Point

Task 3: Save to File

Entry Point

TODO

[On Going] Vijayakarnataka TODOs

Link Extractor related issues

About

Releases

Packages

Languages

DarshanAdiga/html-scrapper-for-news

Folders and files

Latest commit

History

Repository files navigation

A Generic Scrapper Tool used to prepare the Kannada News Dataset

Storage

easticsearch Container Setup

Steps to crawl and extract the news article text

Preprocessing

Task 1: Link Extractor

Entry Point

Task 2: Article Extractor

Entry Point

Task 3: Save to File

Entry Point

TODO

[On Going] Vijayakarnataka TODOs

Link Extractor related issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages