Due to copyright, we are unable to provide the unprocessed scraped data used in this analysis. To ensure reproducability without violating copyright, we provide the word frequencies for each news article, the coreNLP output, all analyzed names with an article identifier, as well as any other associated data used in the analyses such as quotes and citations. We provide all of this data on github. We also provide data descriptions in our github README, under the header "Quick data folder overview".
This manuscript was written using Manubot [@doi:10.1371/journal.pcbi.1007128] and is available on github: manuscript repository link. All code and metadata is also available on github, full analysis repository link, under a BSD 3-Clause License. The code to generate all main and supplemental figures are available as R markdown documents within our main analysis github, in the following subfolder: notebooks. We provide a docker image that can re-run the analysis pipeline using intermediate, pre-processed data and produce all the main and supplemental figures. To re-run the entire pipeline (including scraping), the docker image contains all necessary packages and code. Scraping code is available on our main analysis github, in the following subfolder: scraper. The shell scripts to re-run the entire analysis are provided in the README file in the github repository.
We would like to thank Jeffrey Perkel for asking thoughtful questions that spurred this line of research, and providing feedback and insight into the news-gathering process during the course of this project.