CNN Scraper

This script scrapes CNN article pages for their word frequency and then creates a data matrix which is later subjected to various similarity functions to analyze the similarity of the articles.

How to Run

Scripted in Python 3.6 but needs python 2.7+ too :)

Run "python3 scrapper.py"
Use the Data.csv in Similarity Analyzer Folder
Run "python3 Parallel.py" (This step needs python 2.7, so make sure you've installed them both)

Requirements:

"article_list" contains all the list of urls which can be obtained by running the crawler "article_url"
beautifulsoup4 (4.5.1)
lxml
nltk
SciPy

Output:

A data file called data.csv is saved. It contains a list of word frequencies associated with each article. Output files of Euclidean, Jaccard and Cosine Distances are generated to analyze the similarity of the articles.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
Research		Research
Scrapper		Scrapper
Similarity Analyzer		Similarity Analyzer
README.md		README.md
ReadMeFirst!!!.txt		ReadMeFirst!!!.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CNN Scraper

How to Run

Requirements:

Output:

About

Releases

Packages

Contributors 2

Languages

mohammedjasam/CNN-Scrapper

Folders and files

Latest commit

History

Repository files navigation

CNN Scraper

How to Run

Requirements:

Output:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages