-
Notifications
You must be signed in to change notification settings - Fork 4
Elhadad
The goal was to predict when an article is newsworthy. To do that they built two datasets:
- Articles from a diverse range of biomedical and health journal and kept the articles that were considered newsworthy because they were covered by Reuters. (1,431 rows)
- Articles from JAMA journals and kept the one that were they were given press release by the journal editors. (1,007 rows)
Columns for *_article_info.csv:
- PubMed ID,
- citation title
- journal,
- authors,
- affiliation (or missing),
- abstract,
- MeSH terms: Medical Subject Headings (MeSH) is a comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it serves as a thesaurus that facilitates searching.
To add columns to the dataframe create by pd.read CSV do the following: reuters.columns=['PubMed_ID', 'citation_title', 'journal', 'authors', 'affiliation', 'abstract', 'MeSH terms']
There are 2 other datasets that they call negative instances. These contain articles having some similarities but not considered newsworthy. The two datasets are :
- Negative instances linked to the article from Reuters. (28,359 rows)
- Negative instances linked to articles from JAMA. (10,026 rows)
These datasets are the datasets jama_article_info.csv and all_reuters_artcile_info.csv.
Columns for matched_articles_.csv:
- PubMed ID,[ This is the pmid of the corresponding article for which the row was 'matched']
- PubMed ID,
- citation title
- journal,
- authors,
- affiliation (or missing),
- abstract,
- MeSH terms
There are 2 other datasets that they call negative instances. These contain articles having some similarities but not considered newsworthy. The two datasets are :
- Negative instances linked to the article from Reuters. (28,359 rows)
- Negative instances linked to articles from JAMA. (10,026 rows)
These are the datasets all_reuters_matched_articles_filtered.csv and jama_pmids.txt_matched_articles_filtered.csv. Negative instances contain X articles for each positive article (1) or (2) published in the same journal in the same year that did not receive coverage in the Reuters corpus or no press release was issued for JAMA. They also used several filtering heuristics to do this « matched sampling ». Not more explanations on how they matched this negative instances.
The columns are the same as (1) and (2) except that the first column corresponds to the PubMed ID from the newsworthy corresponding article.
They built citation features: journal name, institution of first author, extract words (uni and bi-grams) from titles, abstracts and MeSH terms. At the end they had 14,614 features. Then they used a L2 regularized regression to predict the newsworthiness.