-
Notifications
You must be signed in to change notification settings - Fork 75
/
Copy pathunichem-map.Rmd
99 lines (76 loc) · 3.12 KB
/
unichem-map.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "Mapping DrugBank Compounds with UniChem"
output:
html_document:
theme: cosmo
highlight: pygments
---
For more info or as a citation, please see:
+ Himmelstein DS, Brueggeman LA, Baranzini SE (2015) **Repurposing drugs on a heterogeneous network**. *ThinkLab*. [doi:10.15363/thinklab.4](https://dx.doi.org/10.15363/thinklab.4)
Specifically, this notebook corresponds to a dicussion on [unifying compound vocabularies using UniChem](//thinklab.com/discussion/unifying-drug-vocabularies/40).
First, we used [UniChem to match DrugBank compounds](//nbviewer.ipython.org/url/git.dhimmel.com/drugbank/unichem-map.ipynb) to external resources. We used the UniChem connectivity search which allows fuzzy matching.
```{r, message=FALSE}
library(dplyr)
library(ggplot2)
library(DT)
library(reshape2)
```
## Read all DrugBank approved small molecules with annotated structures.
```{r}
drugbank.df <- file.path('data', 'drugbank.tsv') %>%
read.delim(stringsAsFactors=FALSE, na.strings='') %>%
dplyr::filter(type == 'small molecule') %>%
dplyr::filter(grepl('approved', groups)) %>%
dplyr::filter(! is.na(inchikey))
count.df <- file.path('data', 'mapping-counts.tsv') %>%
read.delim(stringsAsFactors=FALSE, check.names=FALSE)
count.df <- drugbank.df %>%
dplyr::rename(drugbank_name = name) %>%
dplyr::left_join(count.df)
```
## The external sources that [we want to map to](//thinklab.com/discussion/unifying-drug-vocabularies/40)
```{r}
sources <- c('chembl', 'drugbank', 'fdasrs', 'pubchem', 'lincs')
```
## Percent of approved small molecules in DrugBank matched to external resource
```{r}
count.df %>%
dplyr::select(one_of(sources)) %>%
dplyr::summarise_each(funs(mean(. > 0) * 100)) %>%
knitr::kable()
```
A small number of DrugBank approved small molecules (`r sum(count.df$drugbank == 0)`) do not map to DrugBank. This appears to occur because these compounds do not contain structural information in the DrugBank database.
## The number of compounds matching each approved small molecules in DrugBank
```{r}
count.df %>%
dplyr::select(drugbank_id, drugbank_name, one_of(sources)) %>%
DT::datatable()
```
## The distribution of compounds mapped per approved small molecule in DrugBank
```{r, message=FALSE, fig.height=3, fig.width=9}
count.df %>%
dplyr::select(one_of(sources)) %>%
reshape2::melt(variable.name = 'source', value.name = 'count') %>%
ggplot(aes(count)) + theme_bw() +
geom_histogram(binwidth = 1, origin = -0.5, alpha = 0.6, col='black') +
facet_wrap(~ source, scales='free_x', nrow=1) +
xlab('Matches per DrugBank compound') + ylab('Count')
```
## Read the mapping file
```{r}
mapping.df <- file.path('data', 'mapping.tsv.gz') %>%
read.delim(stringsAsFactors=FALSE)
```
## Small molecules in DrugBank that matched multiple DrugBank IDs
```{r}
mapping.df %>%
dplyr::filter(source_name == 'drugbank') %>%
dplyr::inner_join(drugbank.df %>% dplyr::select(drugbank_id, type, groups)) %>%
dplyr::group_by(drugbank_id, drugbank_name) %>%
dplyr::summarise(
n_matches = n(),
matches = paste(src_compound_id, collapse = '|')
) %>%
dplyr::filter(n_matches > 1) %>%
DT::datatable()
```