Skip to content

uva-bi-sdad/rss_scraper_copy

Repository files navigation

dspg22_rss-scraper

Description

  • A toolbox of scripts to retrieve rss data as well as the news article text via keywords and time ranges
  • rss-get.py retrieves rss feeds given keywords and time ranges and saves to an output directory
  • news-get.py takes inputs of a csv or a directory of csvs with a url column and appends a new column with the specified text

rss-get

To use

usage: rss-get.py [-h] -k KEYWORDS [KEYWORDS ...] [-s SOURCES [SOURCES ...]] [-o OUTDIR] [-sd START_DATE] [-ed END_DATE]
                  [-v | --verbose | --no-verbose]

DSPG pull from RSS sources

options:
  -h, --help            show this help message and exit
  -k KEYWORDS [KEYWORDS ...], --keywords KEYWORDS [KEYWORDS ...]
                        Keywords to parse
  -s SOURCES [SOURCES ...], --sources SOURCES [SOURCES ...]
                        Sources (['google', 'bing']) to parse from. If None, parses all possible sources
  -o OUTDIR, --outdir OUTDIR
  -sd START_DATE, --start_date START_DATE
                        Start date of the search in Y-m-d format. If none and end date provided, or if provided date further than
                        now, raises error
  -ed END_DATE, --end_date END_DATE
                        End date of the search in Y-m-d format. If none provided, uses the current time
  -v, --verbose, --no-verbose

Examples

Run a search on the keywords apples, oranges, bananas, and export the csvs to the output directory "output":

python rss-get.py -k apples oranges bananas -o output

Run a search on the keywords apple for everything after March 3rd, 2022 without saving in verbose mode (note Bing does not have time-based searches)

python rss-get.py -k apples -sd 2022-03-03 -v

Run a search on the keywords apple, bananas, oranges for everything after March 14th, 2021 only using google

python rss-get.py -k apples bananas oranges -sd 2021-03-14 -s google

Sources check list

  • Completion is defined as Complete and Visited
Completion Source Type Keyword Snippet Notes
C Google Search Engine Y Y
C Bing Search Engine Y Y
V Baidu Search Engine N N
  • I found a website https://www.baidu.com/search/rss.html that seems to describe the existance of rss functioning. However, upon clicking into the keyword search field, I kept being returned the same news in non-RSS format
V Yahoo News News Channel N N
  • Does not seem to allow keyword searches. If you do a yahoo search with news it automatically returns search.yahoo.com.
  • You can manipulate the URL to do news page sources, but it does not seem like they will convert it into an RSS format for us
V Yandex Search Engine N N
  • robots.txt disallows /company/*.rss, /company/search. Returns results in Russian?
  • Upon searching sitemaps, found an rss source at https://zen.yandex.ru/. However, to subscribe to any of the feeds require signing in
V Ask Search Engine N Y
C ABC News General Media Y N
  • Does not have usable rss feed. Web scraping with keyword search is possible, but most of the content is videos without transcript. I think we do NOT need to dig any deeper.

news-get

Challenges

We have implemented the text extractor with newspaper, but the following sites cannot be reached or scrapped.

Site URL Description
Barron's (https://www.barrons.com) Access denied
Wall Street Journal (https://www.wsj.com) Access denied
Forbes (https://www.forbes.com) Access denied
Bloomberg (https://www.bloomberg.com) Not-a-robot test
Reuters (https://www.reuters.com) Account required to view full text

Notes

Acknowledgement

This project was built as part of the 2022 Data Science for the Public Good (DSPG) internship program

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published