- A toolbox of scripts to retrieve rss data as well as the news article text via keywords and time ranges
- rss-get.py retrieves rss feeds given keywords and time ranges and saves to an output directory
- news-get.py takes inputs of a csv or a directory of csvs with a url column and appends a new column with the specified text
usage: rss-get.py [-h] -k KEYWORDS [KEYWORDS ...] [-s SOURCES [SOURCES ...]] [-o OUTDIR] [-sd START_DATE] [-ed END_DATE]
[-v | --verbose | --no-verbose]
DSPG pull from RSS sources
options:
-h, --help show this help message and exit
-k KEYWORDS [KEYWORDS ...], --keywords KEYWORDS [KEYWORDS ...]
Keywords to parse
-s SOURCES [SOURCES ...], --sources SOURCES [SOURCES ...]
Sources (['google', 'bing']) to parse from. If None, parses all possible sources
-o OUTDIR, --outdir OUTDIR
-sd START_DATE, --start_date START_DATE
Start date of the search in Y-m-d format. If none and end date provided, or if provided date further than
now, raises error
-ed END_DATE, --end_date END_DATE
End date of the search in Y-m-d format. If none provided, uses the current time
-v, --verbose, --no-verbose
Run a search on the keywords apples, oranges, bananas
, and export the csvs to the output directory "output":
python rss-get.py -k apples oranges bananas -o output
Run a search on the keywords apple
for everything after March 3rd, 2022 without saving in verbose mode (note Bing does not have time-based searches)
python rss-get.py -k apples -sd 2022-03-03 -v
Run a search on the keywords apple, bananas, oranges
for everything after March 14th, 2021 only using google
python rss-get.py -k apples bananas oranges -sd 2021-03-14 -s google
- Completion is defined as Complete and Visited
Completion | Source | Type | Keyword | Snippet | Notes |
---|---|---|---|---|---|
C | Search Engine | Y | Y |
|
|
C | Bing | Search Engine | Y | Y |
|
V | Baidu | Search Engine | N | N |
|
V | Yahoo News | News Channel | N | N |
|
V | Yandex | Search Engine | N | N |
|
V | Ask | Search Engine | N | Y |
|
C | ABC News | General Media | Y | N |
|
We have implemented the text extractor with newspaper, but the following sites cannot be reached or scrapped.
Site | URL | Description |
---|---|---|
Barron's | (https://www.barrons.com) | Access denied |
Wall Street Journal | (https://www.wsj.com) | Access denied |
Forbes | (https://www.forbes.com) | Access denied |
Bloomberg | (https://www.bloomberg.com) | Not-a-robot test |
Reuters | (https://www.reuters.com) | Account required to view full text |
This project was built as part of the 2022 Data Science for the Public Good (DSPG) internship program