Merge pull request #69 from phosphofructo/master

Adding in data dictionaries
Data4Democracy · Jan 8, 2018 · 1f34529 · 1f34529
2 parents 353597d + c328b0c
commit 1f34529
Show file tree

Hide file tree

Showing 11 changed files with 12,168 additions and 18 deletions.
diff --git a/R/datawrangling/atc_codes_clean.csv b/R/datawrangling/atc_codes_clean.csv
diff --git a/R/datawrangling/atc_codes_cleanup.rmd b/R/datawrangling/atc_codes_cleanup.rmd
@@ -0,0 +1,156 @@
+---
+title: "atc_file_cleanup"
+author: "Darya Akimova"
+date: "December 23, 2017"
+output: html_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+#### Goal: Cleanup the atc-codes dataset that is available on data.world
+
+The data is available at: https://data.world/data4democracy/drug-spending
+It was collected from: http://www.genome.jp/kegg-bin/get_htext?br08303.keg
+
+But it was poorly structured and not user friendly in this format. This is part of the work to address Issue #52 on creating a data dictionary for all existing datasets.
+
+Packages:
+
+```{r packages}
+library(data.world)
+library(tidyverse)
+library(stringr)
+```
+
+I've already setup my R session to use my data.world token using: 
+
+> data.world::set_config(save_config(auth_token = "YOUR API TOKEN"))
+
+Running this file for yourself won't work if you haven't setup the data.world package properly with your own API token.
+
+Data: 
+
+```{r}
+drug.spend.ds <- "https://data.world/data4democracy/drug-spending"
+atc.codes <- data.world::query(
+  data.world::qry_sql("SELECT * FROM `atc-codes`"),
+  dataset = drug.spend.ds 
+)
+dim(atc.codes)
+glimpse(atc.codes)
+```
+
+The KEGG website has a series of dropdown menus, which is probably why the data has an odd staggered format. The information for the columns in the data.world dictionary is not accurate. 
+
+For reference on what ATC codes are:
+
+https://en.wikipedia.org/wiki/Anatomical_Therapeutic_Chemical_Classification_System
+
+https://www.whocc.no/atc_ddd_index/
+
+In the state that it currently is, the columns are (wikipedia definitions):
+1: First level - main anatomical group
+2: Second level - therapeutic subgroup
+3: Third level - therapeutic/pharmacological subgroup
+4: Fourth level - chemical/therapeutic/pharmacological subgroup
+5: Fifth level - the chemical substance (the column includes the full drug ID and one of the names for it)
+6: A KEGG ID if it is in the KEGG database
+
+Basically, this is a funnel way of grouping drugs, starting at the most broad category at the first level and becoming more specific with each tier.
+
+First, fix colnames and switch up the order, also get rid of useless rows:
+```{r col_cleanup}
+colnames(atc.codes) <- c("level1", "level2", "level3", "level4", "level5", "kegg")
+atc.codes.c <- atc.codes %>% 
+  select(level5, kegg, everything()) %>% 
+  filter(!(is.na(level5))) %>% 
+  arrange(level5) %>% 
+  mutate(level1 = str_to_lower(level1))
+glimpse(atc.codes.c)
+```
+
+Both the level5 and the kegg columns may contain IDs that can be useful for connecting drugs to therapeutic uses. Need to think of an appropriate method for storing the data. 
+
+Technically, the level1-level4 columns contain potentially redundant information, but it may be useful for grouping in the future. 
+
+```{r}
+atc.codes.c <- atc.codes.c %>% 
+  mutate(level5_code = str_sub(level5, 1, 7)) %>%
+  mutate(level5 = str_sub(level5, 9, -1)) %>% 
+  select(level5_code, everything())
+glimpse(atc.codes.c)
+length(unique(atc.codes.c$level5_code))
+nrow(atc.codes.c)
+# a sign that some trimming needs to be done or drugs are used for multiple indications
+atc.codes.c <- atc.codes.c %>% 
+  mutate(level1_code = recode(
+    level1,
+    "alimentary tract and metabolism" = "A",
+    "blood and blood forming organs" = "B",
+    "cardiovascular system" = "C",
+    "dermatologicals" = "D",
+    "genito urinary system and sex hormones" = "G",
+    "systemic hormonal preparations, excl. sex hormones and insulins" = "H",
+    "antiinfectives for systemic use" = "J",
+    "antineoplastic and immunomodulating agents" = "L",
+    "musculo-skeletal system" = "M",
+    "nervous system" = "N",
+    "antiparasitic products, insecticides and repellents" = "P",
+    "respiratory system" = "R",
+    "sensory organs" = "S",
+    "various" = "V"
+    )
+  ) %>% 
+  select(level5_code:kegg, level1_code, everything())
+glimpse(atc.codes.c)
+unique(atc.codes.c$level1_code)
+# everything is recoded
+```
+
+Now to split off the second level through the fourth level codes and to convert the remaining strings to lowercase:
+
+```{r}
+atc.codes.c <- atc.codes.c %>% 
+  mutate(level2_code = str_sub(level2, 1, 3),
+         level3_code = str_sub(level3, 1, 4),
+         level4_code = str_sub(level4, 1, 5))
+glimpse(atc.codes.c)
+atc.codes.c <- atc.codes.c %>% 
+  mutate(level2 = str_to_lower(str_sub(level2, 5, -1)),
+         level3 = str_to_lower(str_sub(level3, 6, -1)),
+         level4 = str_to_lower(str_sub(level4, 7, -1))) %>% 
+  select(level5_code:level1, level2_code, level2, level3_code, level3, level4_code, level4)
+glimpse(atc.codes.c)
+```
+
+Happy with the structure of it for now. A bit of sanity checking before uploading the cleaned version to data.world:
+
+```{r}
+sapply(atc.codes.c, function(x) length(unique(x)))
+```
+
+Starting at the third level, there are more codes than descriptions. What's going on?
+
+```{r}
+mult.level3 <- atc.codes.c %>% 
+  select(starts_with("level3")) %>% 
+  unique() %>% 
+  group_by(level3) %>% 
+  summarize(tot = n()) %>% 
+  filter(tot > 1)
+atc.codes.c %>% 
+  filter(level3 %in% mult.level3$level3) %>%
+  select(level1:level3) %>% 
+  unique()
+```
+
+There are some categories that are reused at certain levels, this should explain why there are more codes than there are descriptions.
+
+Will leave the level5 and kegg columns as is for now, will cleanup later when the time comes to match columns because the cleanup process will depend on the matching process.
+
+```{r}
+# write_csv(atc.codes.c, "atc_codes_clean.csv")
+```
+
diff --git a/datadictionaries/FDA_NDC_Product.md b/datadictionaries/FDA_NDC_Product.md
@@ -0,0 +1,59 @@
+# FDA's National Drug Code Directory
+
+## Data files ([available at data.world](https://data.world/data4democracy/drug-spending))
+* CSV: `FDA_NDC_Product.csv`
+
+## Link(s) to code used for scraping, tidying, etc, if applicable:
+
+* `NA`
+
+## Data types
+* **string**: a sequence of characters
+* **integer**: whole numbers
+
+
+## Field listing
+|Name                       |Type     |Description|
+|---------------------------|---------|-----------|
+|productid                  |string   |ID unique to the product, starts with the 8 or 9 digit productndc identifier |
+|productndc                 |string   |Unique 8 or 9 digit, 2-segment number in the forms 4-4, 5-3, or 5-4, that is a universal product identifier in the United States. First segment represents the labeler, second segment represents the product. Dosage is embedded in the second NDC code segment and the same drug may have multiple NDC codes if it is available at a number of doses. Missing is the third segment that is needed for a full NDC code, which represents the package code (see link under other resources) |
+|producttypename            |string   |Product type (prescription, over the counter, vaccine, etc) |
+|proprietaryname            |string   |Brand name or trade name of the drug under which it's marketed, chosen by the manufacturer |
+|proprietarynamesuffix      |string   |Add-on to propriety name, may have info about route of administration or dose |
+|nonproprietaryname         |string   |Active ingredient(s) and/or nonproprietary name(s), likely similar to substancename column. May contain multiple compounds separated by `,` (or possibly another symbol), if the drug has multiple active ingredients |
+|dosageformname             |string   |What form the drug is administered in (capsule, tablet, injection, etc) |
+|routename                  |string   |What route the drug is administered by (oral, intravenous, topical, etc) |
+|startmarketingdate         |integer  |When the drug was first marketed (column needs to be transformed into date format) |
+|endmarketingdate           |integer  |The expiration date of the last lot distributed. Actively marketed drugs will not have a marketing end date (column needs to be transformed into date format) |
+|marketingcategoryname      |string   |What FDA application type the drug is marketed under (see link under other resources) |
+|applicationnumber          |string   |FDA application number |
+|labelername                |string   |Manufacturer, repackager, or distributor |
+|substancename              |string   |Active ingredient(s) and/or nonproprietary name(s), likely similar to nonproprietaryname column. May contain multiple compounds separated by `;` (or possibly another symbol), if the drug has multiple active ingredients |
+|active_numerator_strength  |string   |Amount of active ingredient in the drug. May contain several values separated by `;` (or possibly another symbol), if the drug has multiple active ingredients |
+|active_ingred_unit         |string   |The units that the amount of active ingredient is measured in. May contain several unit values, separated by `;` (or possibly another symbol), if the drug has multiple active ingredients |
+|pharm_class                |string   |Pharmacological class - information on the drug's mechanism of action and/or drug target. Basically, info on how the drug does what it does |
+|deaschedule                |string   |The Drug Enforcement Administration (DEA) schedule applicable to the drug as reported by the labeler |
+
+
+## Important notes
+This dataset was created as part of The Drug Listing Act of 1972, which requires drug manufacturers/distributors to provide a full list of currently marketed drugs. The information is submitted by the labeler (manufacturer, repackager, or distributor) to the FDA. It seems that inclusion in the NDC directory does not mean that the drug is FDA approved. The dataset includes information about active ingredients in a drug and their dosing, who produces the drug, and the pharmacological mechanism by which it acts.
+
+#### Excluded from this dataset:
+* Animal drugs
+* Blood products
+* Human drugs not in their final marketed form (as in no standalone active ingredients if they're not marketed)
+* Drugs for further processing 
+* Drug manufactured exclusively for a private label distributor (not commercially available)
+* Drugs that are marketed exclusively as part of a kit or combination or some other part of a multi-level packaged product
+
+
+#### Dataset source:
+* Organization: Food and Drug Administration
+* URL to dataset and more info: https://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm
+
+#### Other Resources related to this dataset:
+* Explanation of NDC Code info: https://www.drugs.com/ndc.html
+* More information on FDA labeling: https://www.fda.gov/ForIndustry/DataStandards/StructuredProductLabeling/default.htm
+* FDA Marketing Category Name acronym explanation: https://www.fda.gov/Drugs/DevelopmentApprovalProcess/FormsSubmissionRequirements/ElectronicSubmissions/DataStandardsManualmonographs/ucm071826.htm
+* FDA Listing and Registration instructions (more info on columns in the dataset): https://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/DrugRegistrationandListing/ucm078801.htm
+* DEA website for (deaschedule column info): https://www.deadiversion.usdoj.gov/schedules/index.html
diff --git a/datadictionaries/Pharma_Lobby_and_lobbying_keyed.md b/datadictionaries/Pharma_Lobby_and_lobbying_keyed.md
@@ -0,0 +1,55 @@
+# Pharma Lobby and Lobbying Keyed
+
+## Data files ([available at data.world](https://data.world/data4democracy/drug-spending))
+* CSV: `Pharma_Lobby.csv`
+* CSV (modified): `lobbying_keyed.csv`
+
+## Link(s) to code used for scraping, tidying, etc, if applicable:
+
+* `NA`
+
+## Data types
+* **string**: a sequence of characters
+* **integer**: whole numbers
+* **money**: fractional numbers representing currency
+* **date**: string formatted YYYY
+
+## Field listing
+#### `Pharma_Lobby.csv`
+|Name      |Type    |Description|
+|----------|--------|-----------|
+|column_a  |integer |Row number |
+|client    |string  |Client/Parent. Person/Entity that employed or retained lobbying services |
+|sub       |string  |Subsidiary/Affiliate. Any entity other than the client that contributed over $5,000 toward lobbying activities and actively participated in the lobbying process. For most sub will be the same as client (this value was missing in the original dataset, but the original uploader seems to have copied over the client name if the sub was missing) |
+|total     |money   |Total amount of money spent on lobbying that year |
+|year      |date    |Year for which information is contained in that row |
+|catcode   |string  |Code that denotes industry and/or sector (see links) |
+
+#### `lobbying_keyed.csv`
+|Name      |Type    |Description|
+|-------------|--------|-----------|
+|column_a     |integer |Row number (different from column_a in `Pharma_Lobby.csv`) |
+|client       |string  |Client/Parent. Person/Entity that employed or retained lobbying services |
+|x            |integer |column_a value from `Pharma_Lobby.csv` |
+|sub          |string  |Subsidiary/Affiliate. Any entity other than the client that contributed over $5,000 toward lobbying activities and actively participated in the lobbying process. For most sub will be the same as client (this value was missing in the original dataset, but the original uploader seems to have copied over the client name if the sub was missing) |
+|total        |money   |Total amount of money spent on lobbying that year |
+|year         |date    |Year for which information is contained in that row |
+|catcode      |string  |Code that denotes industry and/or sector (see links) |
+|company_key  |integer |Unique value representing the company (likely for client company) |
+
+
+## Important notes
+These datasets contain information on yearly lobbying by pharmaceutical groups over the years. `lobbying_keyed.csv` was an attempt to clean the `Pharma_Lobby.csv` file, but the two files contain essentially the same information.
+
+#### To do to clean up this data:
+* Remove column_a in both files, remove x in the lobbying_keyed file
+* Replace the sub value with `NA` if the Subsidiary/Affiliate was missing in the original dataset (instead of inserting client value)
+* Convert the client and sub columns to all lowercase 
+* Update with new lobbying information
+
+
+#### Useful links:
+* Dataset source: https://www.opensecrets.org/lobby/indusclient.php?id=h04&year=2016
+* Definitions of lobbying terms: https://lobbyingdisclosure.house.gov/amended_lda_guide.html
+* Catcode reference: https://www.opensecrets.org/downloads/crp/CRP_Categories.txt
+* OpenSecrets.org Resource center: https://www.opensecrets.org/resources/dollarocracy/
diff --git a/datadictionaries/atc_codes_clean.md b/datadictionaries/atc_codes_clean.md
@@ -0,0 +1,45 @@
+# ATC Codes (Clean) (Anatomical The4rapeutic Chemical Classification System)
+
+## Data files ([available at data.world](https://data.world/data4democracy/drug-spending))
+
+* Raw KEG file: `br08303.keg`
+* Tidied CSV (intermediate): `atc-codes.csv`
+* CSV: `atc_codes_clean.csv`
+
+## Link(s) to code used for scraping, tidying, etc.:
+
+* original source: http://www.genome.jp/kegg-bin/get_htext?br08303.keg
+* `atc_codes_cleanup.rmd`
+
+## Data types
+* **string**: a sequence of characters
+
+## Field listing
+|Name         |Type    |Description|
+|-------------|--------|-----------|
+|level5_code  |string  |Fifth level full classification code (should be unique to each drug)|
+|field5       |string  |Generic drug name/active ingredient|
+|kegg         |string  |KEGG drug name for some compounds and KEGG Drug ID|
+|level1_code  |string  |First level classification code|
+|level1       |string  |First level ATC classification - Main anatomical group|
+|level2_code  |string  |Second level classification code|
+|level2       |string  |Second level ATC classification - Therapeutic subgroup|
+|level3_code  |string  |Third level classification code|
+|level3       |string  |Third level ATC classification - Therapeutic/pharmacological subgroup|
+|level4_code  |string  |Fourth level classification code|
+|level4       |string  |Fourth level ATC classification - Chemical/therapeutic/pharmacological subgroup|
+
+
+## Important notes
+
+### General Information:
+
+This dataset contains Anatomical Therapeutic Chemical (ATC) Classification System codes for therapeutic compounds. This code system is a method by which the World Health Organization (WHO) classifies the active ingredients of drugs based on where in the body they act and their method of action, drug target, and/or chemical properties. It is not a therapeutic classification per se, and it may not provide information on what disease a drug is used for. However, this dataset may be useful in combination with other data sources for linking drugs to therapeutic uses.
+
+The classification system is a funnel type of classification, with the broadest category assigned at the First Level (where in the body the drug acts), and is more specific with each tier. The Fifth Level code should be specific for each chemical compound, altough some compounds may have multiple classifications.
+
+### Sources to better understand the ATC classification system:
+
+* Wikipedia: https://en.wikipedia.org/wiki/Anatomical_Therapeutic_Chemical_Classification_System
+* WHO portal: http://apps.who.int/medicinedocs/en/d/Js4876e/6.2.html
+* WHO Collaborating Centre: https://www.whocc.no/atc_ddd_index/
diff --git a/datadictionaries/drug_list.md b/datadictionaries/drug_list.md
@@ -0,0 +1,34 @@
+# Drug List (FDA Approved drugs by date and therapeutic indication)
+
+## Data files ([available at data.world](https://data.world/data4democracy/drug-spending))
+
+* JSON: `drug_list.json`
+
+## Link(s) to code used for scraping, tidying, etc, if applicable:
+
+* `NA`
+
+## Data types
+* **string**: a sequence of characters
+
+
+## Field listing
+|Name                |Type    |Description|
+|--------------------|--------|-----------|
+|name                |string  |Drug name, often formatted "brand (generic/nonproprietary)." |
+|approval_status     |string  |Date approved (month and year). Most rows start with "Approved", although not all. |
+|company             |string  |Drug maker/distributor |
+|specific_treatment  |string  |Therapeutic area of drug. Most descriptions mention a specific disease. |
+
+
+## Important notes
+
+Data source: http://www.centerwatch.com/drug-information/fda-approved-drugs/therapeutic-areas
+
+Dataset contains information on FDA-approved drugs, date approved, company that produces/distributes the drug, and therapeutic use of drug.
+
+### To do to improve this dataset:
+* Could be useful to re-scrape/collect the data again to get the broad categories for the drugs as seen on the website (Cardiology/Vascular Diseases, Dental and Oral Health, Dermatology, etc)
+* Split the name column into two columns: brand name and the generic/nonproprietary name
+* Clean up the approval_status column (maybe split into two) and turn date approved into its own separate column
+* Need some kind of text analysis to turn the specific_treatment into usable categories. For example, as the data is right now, HIV drugs are described in a number of ways, including (but not limited to): "AIDS, HIV Infection", "AIDS/HIV infection", "HIV", "HIV infection", etc. and they can all be broadly categorized as "HIV" or "HIV/AIDS"