Skip to content

HassankSalim/DocCluster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocCluster

Clustering scraped techcrunch articles with Spark

Data Collecting Part

  1. singletechcrunchpaper.py

    • Python script to scrap a single TechCrunch Page / Article and write to MongoDb hosted in mlab
  2. techcrunch.py

    • Find all the latest post url and pass it to singletechcrunchpaper.py.
  3. scrapyTechCrunch.sh

    • Script for the crontab job, run excatly one time everyday.

Data Read Part

  1. SparkMongoConnector.scala
    • Scala singleton class to connect and perform basic operation on data

Technology Used

  1. Python libs

  2. DB Used

  3. DB Connector

  4. Data Processing

  5. Os scheduling

To Run files needed

  1. application.conf file which contains the mongoDb username and password and link

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published