Skip to content

Scrape major news websites in China and extract news (url, title, source... etc)

License

Notifications You must be signed in to change notification settings

EMUNES/all_news_titles

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

all_news_titles

Based on BeautifulSoup, requests, Selenium & sqlite. You can add as many websites as you want.

I write this to get news titles for Topic Modeling. However I can't do the Topic Modeling because I just can't. So I upload this for furture usuage when I'm capable to finish my project. I'm not a professional for python scraping but I try to make this simple and suitable for extension and I will keep refactor and refine those codes.

usage

Run run.py in the folder and get all the news infromation in your database --> sqlite.db

configuration

  • The websites.py under spider folder is where you add any news website you want for scraping. Also remember to add your new classes in scraping --> spider.py
  • You can set how many web pages you want in scraping --> utils --> handler.py
  • See scraping --> utils --> requester.py for proxy settings. Proxy pool's main folder should be extracted just under scraping folder and I recommand using this: https://github.com/Python3WebSpider/ProxyPool
  • Set Selenium to headless mode under scraping --> utils --> requester.py --> jsHtmlLoader

If you feel this intereting, very welcome to pull requests as I could be very troubled by those deep learning stuff in recent future.

改日再翻译中文。

About

Scrape major news websites in China and extract news (url, title, source... etc)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages