Information_for_User.txt

READ ME INFORMATION

In the zip file contained in this report, there is a collection of materials designed to allow users to utilize the QSAR models for Clint and Fup presented in Dawson et al. 2021 (https://doi.org/10.1021/acs.est.0c06117) to make predictions for novel chemicals. Included is the information necessary to reproduce the predictions for the Tox21 chemicals included in the paper, and as shown in the correction submitted to the journal in June of 2021.  

The contents of the zip file include 
1. An R-markdown document named "NovelChemicalFupClintPrediction_052421.Rmd"
2. A Clint subfolder
3. A Fup subfolder 
4. An Example Data subfolder
5. A copy of the Supplemental materials file "S2_Dawson et al. Supporting_Information_Revision_Final_Sharing.xlsx"
6. An html file called  "NovelChemicalFupClintPrediction_052421.html"


The Rmarkdown file reads in necessary data (means and standard deviations of the training data) and model files contained in the Clint and Fup subfolders to make predictions for novel chemicals. The means and standard deviations are used to standardize descriptors of new chemicals according to the training data, and determine if chemicals are within the applicability domain of the models. The model files are the model objects themselves. Via the Random Forest predict function, predictions for Clint and Fup are made. 

As detailed in the Rmarkdown document, users will need to provide Padel and Opera descriptors for new chemicals. We recommend using the standalone Opera application downloadable from https://github.com/NIEHS/OPERA to produce both sets of descriptors. Users will need chemical identifiers and/or SMILES codes for each of their chemicals for the Opera application to work. 
 
In the ExampleData folder, we have included information for the Tox21 chemicals included in the paper as an example. In the folder are:
1. Output from the USEPA CompTox Dashboard 
2. SMILES codes for the chemicals
3. Padel descriptors output by the Opera application
4. Opera descritpors output by the Opera application

The copy of the supplemental materials from the paper, "S2_Dawson et al. Supporting_Information_Revision_Final_Sharing.xlsx", is utilized in the script to translate Clint bin predictions to median Clint predictions in units of uL per min/10^6 hepatic cells 

Rmarkdown is a simple markup language that allows for easier exposition of material. This file type is used with Rstudio, the free interactive R-gui. We used Rmarkdown here because it allows for an easier way to combine documentation and coding in R. In addition, it allows for the "knitting" together of complete code run as HTML, word document, or PDF files. Rmarkdown typically runs code in shaded blocks, called "Chunks". Code not written in a shaded chunk is interpreted as text. To run a chunk, simply press the green arrow in the upper right corner of the chunk. R output is shown in the console. 

By running the code chunks sequentially or all at once, the included Rmarkdown file will automatically pull in the necessary information and output Clint and Fup predictions for the chemicals in separate files, along with files of chemicals for which Padel and Opera predictions are missing. An HTML file of a complete run of this script is included in the .zip file. 

Please note that for new chemicals, users may need to adapt the script in parts, especially if they are using Opera and Padel descriptors not output from the Opera application, to ensure that column names of descriptors are the same as what the models are expecting. 

--Update 022024--

A folder called NewPredictions has been added to facilitate making new model predictions.  Consult README in NewPredictions folder for more information.

Please contact the corresponding author tornero-velez.rogelio@epa.gov for questions about this script.