Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on how to use FS-MOL in a regression context #57

Open
juliabuhmann opened this issue May 2, 2023 · 2 comments
Open

Clarification on how to use FS-MOL in a regression context #57

juliabuhmann opened this issue May 2, 2023 · 2 comments

Comments

@juliabuhmann
Copy link

Thanks a lot for providing FS-mol. Very valuable to the community!

I am bit confused about the nature of the numerical_values in the FS-Mol dataset. The paper says, those are IC50/EC50 values:

ChEMBL contains the results of many experiments, termed “assays”, each having a unique experiment ID. We retained only those measurements referring to small molecule activity (IC50 or EC50).

However, the code in here points to the fact that percentage as a unit might also have been used during the creation of the dataset:

if df.iloc[0]["standard_units"] == "%":
which is totally fine I guess when only using it to extract activity / non-activity :)
When checking some assays in the train task list (anecdotally), there are indeed assays that uses % as unit, eg.:

Not sure to which extent it make sense to apply a log-transformation to percentage values ranging from [0-100]. However, this is done in the FS-MOL dataset, and also the community slowly starts to do that (I guess because only IC50 / EC50 values are assumed??) --> https://github.com/Wenlin-Chen/ADKF-IFT/blob/c96919d553313b267240dc1409ae65160c629aab/fs_mol/data/dkt.py#L111 (the corresponding paper: https://arxiv.org/pdf/2205.02708.pdf)

but we include the regression task (for the actual numeric activity target IC50 or EC50) in our evaluation as well

The community is slowly using FS-MOL also in a regression context. It would be great if we get clarification around this IC50 / EC50 versus percentage issue, or have those assays explicitly labeled maybe?
Thanks a lot for looking into that. Greatly appreciated!

@megstanley
Copy link
Contributor

Hi,

You are correct, there are % values also used in the dataset, and with a threshold this is fine for the binary classification task. However, the log-transformation is not used on these values -- that transformation is only used on IC50 or EC50 data (please see data cleaning and preprocessing scripts).

@juliabuhmann
Copy link
Author

thanks for your quick reply and looking into that. Yes, agree that the thresholding makes sense. All good on that side.

My sentence "However, this is done in the FS-MOL dataset" was probably a bit confusing.
What I meant was, that the assay-dump (the json-files) has by default the entry: LogRegressionProperty, even for the %-assays - which could be misleading.

Just as a background - I assumed IC50/EC50-potency-data that I log-transformed on my side XC50 --> pXC50, and the resulting distribution for the train-set (naturally, when knowing that some data comes from %-assays) looks weird, and the activity-label (bool-label) does not make sense at a first glance for the compounds coming from the %-assays (the binary activity-label does make sense with the additional knowledge of course).

image
(Note that I only plotted the trainset here, this might only be an issue for assays in the trainset.)

I understand that this is all good in the context of classification, but in the context of regression, it might be important to label the %-assays, otherwise downstream metrics could by incorrectly computed or wrong models built (by assuming the same underlying activity unit for all assays).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants