-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster MCS Calculation #6
Comments
Hi @kexul - have you tried matchAtoms on molecules without hydrogens, and then using the resulting MCS to build a prematch to pass as keyword argument for another call to matchAtoms using the complete molecules ? I think that could work. Note that this approach may not give you the same MCS that you get calling matchAtoms() on the full molecules. If you can share examples of molecules that are slow to matchI may be able to advise on ways to speed things up. |
Hi @jmichel80 - Yes! I've used this approach for a long time. But it often leads to prematch not found in rdkit search results and then Sire is used to find the full MCS (add hydrogen to MCS). I'm not sure what algorithm Sire is used, but it's painfully slow.
I'll try to provide some examples but it may take some time. 😢 |
Thanks, @kexul, I'll look into this. Given that we use the same tools behind the scenes it should be easy enough to implement (or test) the approach that you suggest. Before we switched to using RDKit for MCS (we used to just use the Sire functionality) we had custom code to do similar to what you suggest. The issue we found was, as @jmichel80 says, that the MCS might differ to the full molecule search. In order to reliably get the mappings that we wanted we ended up doing both approaches and taking the larger mapping, which is obviously very wasteful. |
@jmichel80: The prematch approach won't work since there's no way to use it as a seed for the MCS. Unfortunately we just use it retrospectively, i.e. checking that the MCS generated by RDKit does contain the prematch that the user wants. (If there are multiple MCS with the same size, then we return the one that matches the prematch, if it exists.) It could well be that the algorithm passed through a state containing the pre-match during its search, but we wont know this, since we only get the final result. If nothing matches the pre-match, then we fall back on Sire, which does check the pre-match at each stage of the MCS algorithm, i.e. as it grows in size. |
@kexul by how much do you have to increase the timeout argument to reach completion for your examples? |
Sorry, it takes much longer than I think to find a proper example. I'll need some time to dig my old files... |
I see the RDKit MCS code has an |
The current matchAtoms() uses the entire molecule to get MCS, a simple idea to accelerate it is to get MCS for heavy atoms and then match up hydrogens hanging off the MCS.
See this implementation for example:
https://github.com/OpenFreeEnergy/Lomap/blob/84e1bb0d20ceeed835166fe7218dd21f04248030/lomap/mcs.py#L1170
I didn't quantitatively test the speedup(no benchmark found) but it performed very well for large molecules in my project ( the current matchAtoms() often times out).
The text was updated successfully, but these errors were encountered: