XlogP node generates different values #5

webbres · 2022-06-23T22:26:02Z

And why is it calling a copied version of the class named SmartXLogPDescriptor instead of calling the CDK implementation?

johnmay · 2022-06-29T09:29:35Z

The cdk one now uses smarts as well, I will check back in see if there was a regression but TBH it should really just use JPLogP. Again could be something subtle like the aromatic bond status if coming from molfile or missing expl H which reading the CDK needs to be present.

yeah Xlogp sucks in my opinion

webbres · 2022-06-29T11:11:21Z

👍 - I was going to expose the new descriptors but at the moment I'm working my way through the 'breaking' changes by switching out to 2.7.

johnmay · 2022-06-30T09:44:27Z

More investigation needed.

The LogP on PubChem is 0.72, PC also reports XLogP3 as 0.8 so shows the implementation still isn't perfect but 0.47 is closer :-).

For ref: ALogP - 0.7122999999999997
For ref: JPLogP - 0.47

The JP is probably over fitted on XLogP, IRRC he left out CDK ALogP because it was broken at the time.

There is a test case for this though that notes:

        IAtomContainer mol = sp.parseSmiles("O=C(O)c1[nH0]cccc1"); // xlogp training set molecule no688

It was ignored as [nH0] used to be rejected.

johnmay · 2022-06-30T10:09:49Z

cdk/cdk@1d71596 before cdk/cdk@30738c7 - 0.47
cdk/cdk@142caf1 before cdk/cdk@bf386e2 - 0.47
... (somewhere here) ...
cdk/cdk@f8f22b0 - -1.69
cdk/cdk@76d03d7 - -1.69
9779b93 (before cdk/cdk@76d03d7) - -1.69

Regression somewhere here:
https://github.com/cdk/cdk/compare/f8f22b01a2534d915abdc39f84c0eac4accfe14d..142caf10b66f4fc29adda1bb5a72f54e2593a129

johnmay · 2022-06-30T11:10:54Z

There were no changes between cdk/cdk@f8f22b0 and cdk/cdk@142caf1 in the XLogP descriptor so it is something external. 2 year gap... so need a better approach to find the issue.

johnmay · 2022-06-30T11:18:20Z

OK tracked it down to the AminoAcid smarts check

atom[0] 0.00 => -0.40
atom[1] -0.40 => -0.43
atom[2] -0.43 => -0.34
atom[3] -0.34 => -0.17
atom[4] -0.17 => -0.66
atom[5] -0.66 => -0.54
atom[6] -0.54 => -0.20
atom[7] -0.20 => 0.14
atom[8] 0.14 => 0.47
atom[9] 0.47 => 0.47
atom[10] 0.47 => 0.47
atom[11] 0.47 => 0.47
atom[12] 0.47 => 0.47
atom[13] 0.47 => 0.47
 PAIR CHECK => 0.47
 alpha amino acid  => -1.69
 paba  => -1.69
 salicylFlag  => -1.69
 OccO  => -1.69

Hmm I'm not sure that molecule should be an amino acid :-)

johnmay · 2022-06-30T11:50:13Z

OK found it and can see how it happened. I'll can post a PR but not super happy as in my world this pattern is NOT an amino acid, indeed checking the XLogP paper it shows it should NOT be -1.69. Also shows the implementation is still not quite right and TBH so much of the CDK descriptors are partially finished I wouldn't trust them at all.

For reference:

How the regression happened. We updated the query atom container creation "QueryAtomContainer.createBasic" to include aromaticity status on the atoms (cdk/cdk@b154fea#diff-6ef2498d7f5198cfda274535ea47ff7b916aa4c6f1a9aa5b50e9e26b3915cde5R50). Previously it would create queries like this "[#6]:[#6]" - since an aromatic bond implies it's atoms here are aromatic it's cleaner to write this as "c:c". This is all fine except the case where the caller then tries to mess around with the pattern after the fact. In XLogP it would convert NCC(=O)O to [#7]-[#6]-[#6](=[#8])-[#8] then loop through and change the NC bond to allow aromatic [#7]-,:[#6]-[#6](=[#8])-[#8].

Here was where it was dicking around with the query:

if ((bondAtom0.getAtomicNumber() == IElement.C && bondAtom1.getAtomicNumber() == IElement.N)
          || (bondAtom0.getAtomicNumber() == IElement.N && bondAtom1.getAtomicNumber() == IElement.C)
          && bond.getOrder() == IBond.Order.SINGLE) {
      aminoAcid.removeBond(bondAtom0, bondAtom1);
      QueryBond qbond = new QueryBond(bondAtom0, bondAtom1, Expr.Type.SINGLE_OR_AROMATIC);
      aminoAcid.addBond(qbond);
      break;
  }

Then later on I can and removed all this special query manipulation and replaced it with a SMARTS pattern. Since this was now a no-op (can't have aromatic bond between aliphatic atoms) I made it:

N-C-C(=O)-[O;X2H1+0,X1H0-]

A more accurate pattern to what it needed would have been before the basicQueryContainer change would be:

[#7]-,:[#6]-C(=O)-[O;X2H1+0,X1H0-]

Indeed changing it to that reverts the value to -1.69 but as I highlight above this is wrong anyways.

johnmay · 2022-06-30T13:16:15Z

Ref molecule is picolinic acid 689 not 688 (as per test case). What a mess :-)

webbres · 2022-07-26T19:15:57Z

Thanks for the info John.

If there's already a going to be a some loss in functionality with the upgrade of the KNIME nodes to CDK v2 (ambit/sketcher) maybe there's not all that much utility in bringing some of the nodes with better replacements forward.

Could just bin the XLogP node and create a JPLogP (or just a molecular descriptors node that allows the selection of JPLogP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XlogP node generates different values #5

XlogP node generates different values #5

webbres commented Jun 23, 2022

johnmay commented Jun 29, 2022

webbres commented Jun 29, 2022

johnmay commented Jun 30, 2022

johnmay commented Jun 30, 2022 •

edited

Loading

johnmay commented Jun 30, 2022

johnmay commented Jun 30, 2022 •

edited

Loading

johnmay commented Jun 30, 2022

johnmay commented Jun 30, 2022

webbres commented Jul 26, 2022

XlogP node generates different values #5

XlogP node generates different values #5

Comments

webbres commented Jun 23, 2022

johnmay commented Jun 29, 2022

webbres commented Jun 29, 2022

johnmay commented Jun 30, 2022

johnmay commented Jun 30, 2022 • edited Loading

johnmay commented Jun 30, 2022

johnmay commented Jun 30, 2022 • edited Loading

johnmay commented Jun 30, 2022

johnmay commented Jun 30, 2022

webbres commented Jul 26, 2022

johnmay commented Jun 30, 2022 •

edited

Loading

johnmay commented Jun 30, 2022 •

edited

Loading