You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks a lot for providing FS-mol. Very valuable to the community!
I am bit confused about the nature of the numerical_values in the FS-Mol dataset. The paper says, those are IC50/EC50 values:
ChEMBL contains the results of many experiments, termed “assays”, each having a unique experiment ID. We retained only those measurements referring to small molecule activity (IC50 or EC50).
However, the code in here points to the fact that percentage as a unit might also have been used during the creation of the dataset:
which is totally fine I guess when only using it to extract activity / non-activity :)
When checking some assays in the train task list (anecdotally), there are indeed assays that uses % as unit, eg.:
but we include the regression task (for the actual numeric activity target IC50 or EC50) in our evaluation as well
The community is slowly using FS-MOL also in a regression context. It would be great if we get clarification around this IC50 / EC50 versus percentage issue, or have those assays explicitly labeled maybe?
Thanks a lot for looking into that. Greatly appreciated!
The text was updated successfully, but these errors were encountered:
You are correct, there are % values also used in the dataset, and with a threshold this is fine for the binary classification task. However, the log-transformation is not used on these values -- that transformation is only used on IC50 or EC50 data (please see data cleaning and preprocessing scripts).
thanks for your quick reply and looking into that. Yes, agree that the thresholding makes sense. All good on that side.
My sentence "However, this is done in the FS-MOL dataset" was probably a bit confusing.
What I meant was, that the assay-dump (the json-files) has by default the entry: LogRegressionProperty, even for the %-assays - which could be misleading.
Just as a background - I assumed IC50/EC50-potency-data that I log-transformed on my side XC50 --> pXC50, and the resulting distribution for the train-set (naturally, when knowing that some data comes from %-assays) looks weird, and the activity-label (bool-label) does not make sense at a first glance for the compounds coming from the %-assays (the binary activity-label does make sense with the additional knowledge of course).
(Note that I only plotted the trainset here, this might only be an issue for assays in the trainset.)
I understand that this is all good in the context of classification, but in the context of regression, it might be important to label the %-assays, otherwise downstream metrics could by incorrectly computed or wrong models built (by assuming the same underlying activity unit for all assays).
Thanks a lot for providing FS-mol. Very valuable to the community!
I am bit confused about the nature of the numerical_values in the FS-Mol dataset. The paper says, those are IC50/EC50 values:
However, the code in here points to the fact that percentage as a unit might also have been used during the creation of the dataset:
FS-Mol/fs_mol/preprocessing/utils/cleaning_utils.py
Line 144 in fa336ae
When checking some assays in the train task list (anecdotally), there are indeed assays that uses % as unit, eg.:
Not sure to which extent it make sense to apply a log-transformation to percentage values ranging from [0-100]. However, this is done in the FS-MOL dataset, and also the community slowly starts to do that (I guess because only IC50 / EC50 values are assumed??) --> https://github.com/Wenlin-Chen/ADKF-IFT/blob/c96919d553313b267240dc1409ae65160c629aab/fs_mol/data/dkt.py#L111 (the corresponding paper: https://arxiv.org/pdf/2205.02708.pdf)
The community is slowly using FS-MOL also in a regression context. It would be great if we get clarification around this IC50 / EC50 versus percentage issue, or have those assays explicitly labeled maybe?
Thanks a lot for looking into that. Greatly appreciated!
The text was updated successfully, but these errors were encountered: