The problem: you have a molecule of interest (a protein, perhaps) and you want to collect drug-like molecules that target it. ChEMBL (https://www.ebi.ac.uk/chembl/) is a curated database for drug-like molecules. You go in there, you look up your molecule of interest, and now you have to manually download 7.000 drugs.
This is a script to query the ChEMBL database and retrieve all small molecules that correspond to a set of CHEMBL IDs, in SMILES string format. The input is a txt file containing the CHEMBL IDs that you want to use as targets and retrieve molecules for. The outputs are:
- a CSV file, named data.csv, containing the retrieved data from the ChEMBL API in a tabular format with the following columns:
ChEMBL ID, Name, Max Phase, Molecular Formula, Molecular Weight, and SMILES.
Each row in the CSV file represents a molecule, with values populated for each column based on the retrieved data. - a text file, named stats.txt, containing a brief statistical summary for the retrieved data in a tab-separated format with the following columns:
ChEMBL ID, Output File, and Number of Retrieved SMILES.
The ChEMBL ID column contains the ChEMBL ID for which data was retrieved, the Output File column contains the name of the CSV file where the data was saved (data.csv in this case), and the Number of Retrieved SMILES column contains the total number of retrieved SMILES (i.e., the total number of rows) in the data.csv file. The values for each column will be separated by tabs ('\t') in the text file.
- Python installed on your system and added to your system's PATH environment variable.
- the
scrape.py
file in the current directory. - a text file called
input.txt
in the same directory as thescrape.py
file. Alternatively, you can specify the full path to theinput.txt
file in the command line. Theinput.txt
file contains the ChEMBL IDs (targets) that you wish to retrieve molecules for.
Correct formatting for the input.txt file: It should contain a list of ChEMBL IDs, with one ID per line. Each line should contain a single ChEMBL ID, and there should be no extra characters or spaces in the file. For example:
CHEMBL12345
CHEMBL67890
CHEMBL54321
By default, the parameters are:
Molecular weight <=600
R05 violations = 0
Max phase = 4
For the moment, you can adjust the filters for the ChEMBL results (molecular weight, RO5 violations, maximum phase etc) by changing the values of the variables in the url (eg max_phase=3 to get molecules that got to the 3rd trial phase). In a future commit, the parameters will be determined by the user in the input.txt file. https://github.com/IoDiakou/Scraping-data-using-the-ChEMBL-database-API/blob/96f7b06b7bcedd72d61cbe552908fe052097863c/scrape.py#L32
You can use the script via the command line, as such:
python scrape.py input.txt
Alternatively, if you have the input.txt
file in a different directory than the scrape.py
file:
python scrape.py /path/to/input.txt
Where '/path/to/input.txt' should be replaced with the actual full path to the input.txt
file on your system.
- extend error handling
- dry up some code parts
- incorporate API throttling