Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XlogP node generates different values #5

Open
webbres opened this issue Jun 23, 2022 · 9 comments
Open

XlogP node generates different values #5

webbres opened this issue Jun 23, 2022 · 9 comments

Comments

@webbres
Copy link
Contributor

webbres commented Jun 23, 2022

image

And why is it calling a copied version of the class named SmartXLogPDescriptor instead of calling the CDK implementation?

@johnmay
Copy link
Member

johnmay commented Jun 29, 2022

The cdk one now uses smarts as well, I will check back in see if there was a regression but TBH it should really just use JPLogP. Again could be something subtle like the aromatic bond status if coming from molfile or missing expl H which reading the CDK needs to be present.

yeah Xlogp sucks in my opinion

@webbres
Copy link
Contributor Author

webbres commented Jun 29, 2022

👍 - I was going to expose the new descriptors but at the moment I'm working my way through the 'breaking' changes by switching out to 2.7.

@johnmay
Copy link
Member

johnmay commented Jun 30, 2022

More investigation needed.

The LogP on PubChem is 0.72, PC also reports XLogP3 as 0.8 so shows the implementation still isn't perfect but 0.47 is closer :-).

For ref: ALogP - 0.7122999999999997
For ref: JPLogP - 0.47

The JP is probably over fitted on XLogP, IRRC he left out CDK ALogP because it was broken at the time.

There is a test case for this though that notes:

        IAtomContainer mol = sp.parseSmiles("O=C(O)c1[nH0]cccc1"); // xlogp training set molecule no688

It was ignored as [nH0] used to be rejected.

@johnmay
Copy link
Member

johnmay commented Jun 30, 2022

@johnmay
Copy link
Member

johnmay commented Jun 30, 2022

There were no changes between cdk/cdk@f8f22b0 and cdk/cdk@142caf1 in the XLogP descriptor so it is something external. 2 year gap... so need a better approach to find the issue.

@johnmay
Copy link
Member

johnmay commented Jun 30, 2022

OK tracked it down to the AminoAcid smarts check

atom[0] 0.00 => -0.40
atom[1] -0.40 => -0.43
atom[2] -0.43 => -0.34
atom[3] -0.34 => -0.17
atom[4] -0.17 => -0.66
atom[5] -0.66 => -0.54
atom[6] -0.54 => -0.20
atom[7] -0.20 => 0.14
atom[8] 0.14 => 0.47
atom[9] 0.47 => 0.47
atom[10] 0.47 => 0.47
atom[11] 0.47 => 0.47
atom[12] 0.47 => 0.47
atom[13] 0.47 => 0.47
 PAIR CHECK => 0.47
 alpha amino acid  => -1.69
 paba  => -1.69
 salicylFlag  => -1.69
 OccO  => -1.69

Hmm I'm not sure that molecule should be an amino acid :-)

@johnmay
Copy link
Member

johnmay commented Jun 30, 2022

OK found it and can see how it happened. I'll can post a PR but not super happy as in my world this pattern is NOT an amino acid, indeed checking the XLogP paper it shows it should NOT be -1.69. Also shows the implementation is still not quite right and TBH so much of the CDK descriptors are partially finished I wouldn't trust them at all.

For reference:
Screenshot 2022-06-30 at 12 37 28

How the regression happened. We updated the query atom container creation "QueryAtomContainer.createBasic" to include aromaticity status on the atoms (cdk/cdk@b154fea#diff-6ef2498d7f5198cfda274535ea47ff7b916aa4c6f1a9aa5b50e9e26b3915cde5R50). Previously it would create queries like this "[#6]:[#6]" - since an aromatic bond implies it's atoms here are aromatic it's cleaner to write this as "c:c". This is all fine except the case where the caller then tries to mess around with the pattern after the fact. In XLogP it would convert NCC(=O)O to [#7]-[#6]-[#6](=[#8])-[#8] then loop through and change the NC bond to allow aromatic [#7]-,:[#6]-[#6](=[#8])-[#8].

Here was where it was dicking around with the query:

if ((bondAtom0.getAtomicNumber() == IElement.C && bondAtom1.getAtomicNumber() == IElement.N)
          || (bondAtom0.getAtomicNumber() == IElement.N && bondAtom1.getAtomicNumber() == IElement.C)
          && bond.getOrder() == IBond.Order.SINGLE) {
      aminoAcid.removeBond(bondAtom0, bondAtom1);
      QueryBond qbond = new QueryBond(bondAtom0, bondAtom1, Expr.Type.SINGLE_OR_AROMATIC);
      aminoAcid.addBond(qbond);
      break;
  }

Then later on I can and removed all this special query manipulation and replaced it with a SMARTS pattern. Since this was now a no-op (can't have aromatic bond between aliphatic atoms) I made it:

N-C-C(=O)-[O;X2H1+0,X1H0-]

A more accurate pattern to what it needed would have been before the basicQueryContainer change would be:

[#7]-,:[#6]-C(=O)-[O;X2H1+0,X1H0-]

Indeed changing it to that reverts the value to -1.69 but as I highlight above this is wrong anyways.

@johnmay
Copy link
Member

johnmay commented Jun 30, 2022

Ref molecule is picolinic acid 689 not 688 (as per test case). What a mess :-)

@webbres
Copy link
Contributor Author

webbres commented Jul 26, 2022

Thanks for the info John.

If there's already a going to be a some loss in functionality with the upgrade of the KNIME nodes to CDK v2 (ambit/sketcher) maybe there's not all that much utility in bringing some of the nodes with better replacements forward.

Could just bin the XLogP node and create a JPLogP (or just a molecular descriptors node that allows the selection of JPLogP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants