Clarification on how to use FS-MOL in a regression context #57

juliabuhmann · 2023-05-02T09:34:25Z

Thanks a lot for providing FS-mol. Very valuable to the community!

I am bit confused about the nature of the numerical_values in the FS-Mol dataset. The paper says, those are IC50/EC50 values:

ChEMBL contains the results of many experiments, termed “assays”, each having a unique experiment ID. We retained only those measurements referring to small molecule activity (IC50 or EC50).

However, the code in here points to the fact that percentage as a unit might also have been used during the creation of the dataset:

FS-Mol/fs_mol/preprocessing/utils/cleaning_utils.py

Line 144 in fa336ae

if df.iloc[0]["standard_units"] == "%":

which is totally fine I guess when only using it to extract activity / non-activity :)
When checking some assays in the train task list (anecdotally), there are indeed assays that uses % as unit, eg.:

https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/assay_chembl_id%3ACHEMBL3591894.
and the corresponding distribution of the numeric_label (and bool_label):

Not sure to which extent it make sense to apply a log-transformation to percentage values ranging from [0-100]. However, this is done in the FS-MOL dataset, and also the community slowly starts to do that (I guess because only IC50 / EC50 values are assumed??) --> https://github.com/Wenlin-Chen/ADKF-IFT/blob/c96919d553313b267240dc1409ae65160c629aab/fs_mol/data/dkt.py#L111 (the corresponding paper: https://arxiv.org/pdf/2205.02708.pdf)

but we include the regression task (for the actual numeric activity target IC50 or EC50) in our evaluation as well

The community is slowly using FS-MOL also in a regression context. It would be great if we get clarification around this IC50 / EC50 versus percentage issue, or have those assays explicitly labeled maybe?
Thanks a lot for looking into that. Greatly appreciated!

megstanley · 2023-05-02T14:08:44Z

Hi,

You are correct, there are % values also used in the dataset, and with a threshold this is fine for the binary classification task. However, the log-transformation is not used on these values -- that transformation is only used on IC50 or EC50 data (please see data cleaning and preprocessing scripts).

juliabuhmann · 2023-05-02T15:12:50Z

thanks for your quick reply and looking into that. Yes, agree that the thresholding makes sense. All good on that side.

My sentence "However, this is done in the FS-MOL dataset" was probably a bit confusing.
What I meant was, that the assay-dump (the json-files) has by default the entry: LogRegressionProperty, even for the %-assays - which could be misleading.

Just as a background - I assumed IC50/EC50-potency-data that I log-transformed on my side XC50 --> pXC50, and the resulting distribution for the train-set (naturally, when knowing that some data comes from %-assays) looks weird, and the activity-label (bool-label) does not make sense at a first glance for the compounds coming from the %-assays (the binary activity-label does make sense with the additional knowledge of course).

(Note that I only plotted the trainset here, this might only be an issue for assays in the trainset.)

I understand that this is all good in the context of classification, but in the context of regression, it might be important to label the %-assays, otherwise downstream metrics could by incorrectly computed or wrong models built (by assuming the same underlying activity unit for all assays).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on how to use FS-MOL in a regression context #57

Clarification on how to use FS-MOL in a regression context #57

juliabuhmann commented May 2, 2023

megstanley commented May 2, 2023

juliabuhmann commented May 2, 2023

Clarification on how to use FS-MOL in a regression context #57

Clarification on how to use FS-MOL in a regression context #57

Comments

juliabuhmann commented May 2, 2023

megstanley commented May 2, 2023

juliabuhmann commented May 2, 2023