dspg22_rss-scraper

Description

A toolbox of scripts to retrieve rss data as well as the news article text via keywords and time ranges
rss-get.py retrieves rss feeds given keywords and time ranges and saves to an output directory
news-get.py takes inputs of a csv or a directory of csvs with a url column and appends a new column with the specified text

rss-get

To use

usage: rss-get.py [-h] -k KEYWORDS [KEYWORDS ...] [-s SOURCES [SOURCES ...]] [-o OUTDIR] [-sd START_DATE] [-ed END_DATE]
                  [-v | --verbose | --no-verbose]

DSPG pull from RSS sources

options:
  -h, --help            show this help message and exit
  -k KEYWORDS [KEYWORDS ...], --keywords KEYWORDS [KEYWORDS ...]
                        Keywords to parse
  -s SOURCES [SOURCES ...], --sources SOURCES [SOURCES ...]
                        Sources (['google', 'bing']) to parse from. If None, parses all possible sources
  -o OUTDIR, --outdir OUTDIR
  -sd START_DATE, --start_date START_DATE
                        Start date of the search in Y-m-d format. If none and end date provided, or if provided date further than
                        now, raises error
  -ed END_DATE, --end_date END_DATE
                        End date of the search in Y-m-d format. If none provided, uses the current time
  -v, --verbose, --no-verbose

Examples

Run a search on the keywords apples, oranges, bananas, and export the csvs to the output directory "output":

python rss-get.py -k apples oranges bananas -o output

Run a search on the keywords apple for everything after March 3rd, 2022 without saving in verbose mode (note Bing does not have time-based searches)

python rss-get.py -k apples -sd 2022-03-03 -v

Run a search on the keywords apple, bananas, oranges for everything after March 14th, 2021 only using google

python rss-get.py -k apples bananas oranges -sd 2021-03-14 -s google

Sources check list

Completion is defined as Complete and Visited

Completion	Source	Type	Keyword	Snippet	Notes
C	Google	Search Engine	Y	Y	Successful RSS keyword extraction using: https://news.google.com/rss/search?q={0} Column summary_detail.value might contain one sentence description of the news
C	Bing	Search Engine	Y	Y	Successful RSS keyword extraction using: https://www.bing.com/news/search?q={0}&format=rss Column summary_detail.value might contain 2-3 sentences of the news
V	Baidu	Search Engine	N	N	I found a website https://www.baidu.com/search/rss.html that seems to describe the existance of rss functioning. However, upon clicking into the keyword search field, I kept being returned the same news in non-RSS format
V	Yahoo News	News Channel	N	N	Does not seem to allow keyword searches. If you do a yahoo search with news it automatically returns search.yahoo.com. You can manipulate the URL to do news page sources, but it does not seem like they will convert it into an RSS format for us
V	Yandex	Search Engine	N	N	robots.txt disallows /company/*.rss, /company/search. Returns results in Russian? Upon searching sitemaps, found an rss source at https://zen.yandex.ru/. However, to subscribe to any of the feeds require signing in
V	Ask	Search Engine	N	Y	Found https://www.ask.com/rss, but so far haven't found a way to add a keyword. Looked through: https://www.ask.com/sitemap_index.xml and https://www.ask.com/robots.txt but did not find anything rss-related Column metadescription contains 2 sentences
C	ABC News	General Media	Y	N	Does not have usable rss feed. Web scraping with keyword search is possible, but most of the content is videos without transcript. I think we do NOT need to dig any deeper.

news-get

Challenges

We have implemented the text extractor with newspaper, but the following sites cannot be reached or scrapped.

Site	URL	Description
Barron's	(https://www.barrons.com)	Access denied
Wall Street Journal	(https://www.wsj.com)	Access denied
Forbes	(https://www.forbes.com)	Access denied
Bloomberg	(https://www.bloomberg.com)	Not-a-robot test
Reuters	(https://www.reuters.com)	Account required to view full text

Notes

https://www.overleaf.com/4259256516zdhfhfqcshjj

Acknowledgement

This project was built as part of the 2022 Data Science for the Public Good (DSPG) internship program

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
misc		misc
output		output
.gitignore		.gitignore
README.md		README.md
find_rss.py		find_rss.py
get_search_engines.py		get_search_engines.py
merge-csv.py		merge-csv.py
news-get.py		news-get.py
requirements.txt		requirements.txt
rss-get-test.sh		rss-get-test.sh
rss-get.py		rss-get.py
search_engine_robots.csv		search_engine_robots.csv
search_engines.csv		search_engines.csv
settings.py		settings.py
summarizer.py		summarizer.py
text_getter.py		text_getter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dspg22_rss-scraper

Description

rss-get

To use

Examples

Sources check list

news-get

Challenges

Notes

Acknowledgement

About

Releases

Packages

Languages

uva-bi-sdad/rss_scraper_copy

Folders and files

Latest commit

History

Repository files navigation

dspg22_rss-scraper

Description

rss-get

To use

Examples

Sources check list

news-get

Challenges

Notes

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages