Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactome Release 82 #176

Closed
16 tasks done
ukemi opened this issue Jun 6, 2022 · 25 comments
Closed
16 tasks done

Reactome Release 82 #176

ukemi opened this issue Jun 6, 2022 · 25 comments
Assignees

Comments

@ukemi
Copy link

ukemi commented Jun 6, 2022

  • 1. ONGOING- Tickets on GOC side to improve import process
    - [ ] - Replace transports_or_maintains_localization_of relations with has_primary_input Change all 'transports_or_maintains_localization_of' relations to 'has_primary_input'  #200. I believe we also still need to update the Shex. Update ShEX rules for transporter activity go-shapes#273
    - [X] - Align Reactome annotation of SLC9B2 functions with GO MF terms #173
    - [X] - Reexamine the filtering for drugs: revisit our curation practice and the GO-CAM conversion process. I suspect that we want to be able to assign broader BP terms to pathways and more specific BP terms to reactions contained in those pathways, but we don't always want to do this, so we will need to discuss with Dustin how to do this without confusing the GO-CAM conversion process. R-HSA-974878, R-HSA-5620971, Still say 'NO' to drugs Still say 'NO' to drugs #182
    - [X] - Update users.yaml go-site#1902
    - [X] - generate stats for fraction of GO-CAMs that is complete and the reasons for incompleteness (like the Venn diagram Figure 4(?) in the Good et al. paper)
  • 2. 15/09/2022- BioPax available for test load into Noctua-dev
  • 3. 11/10/2022- Load Biopax into noctua-dev
  • 4. 11/10/2022- Run Shex QC and Logical error QC on GO-CAMS in noctua-dev.
  • 5. DATE- Fix logical and QC errors at Reactome.
  • 6. 2022-07(18-29)- merge final changes that have been added to the GOC conversion/import code. Can occur after 8.
  • 7. 2022-07-22- Regenerate BioPax--- won't happen this round
  • 8. 2022-07-(25-29)- Rerun BioPax load into Noctua dev---won't happen this round
  • 9. 2022-08-01- Rerun Shex and logical checks- should be clean. If not, return to 4 or punt to next release.-- won't happen this round
  • 10. 2022-08-03- Fix any issues with the GOC conversion code and rerun load and checks. Must be done before 13.
  • 11. 2022-08-05 - Reactome Data freeze - target date to complete 1 - 9?
  • 12. 2022-08-24 - Reactome final slice - absolute deadline for completing 1 - 9.
  • 13. 2022-09-14 - Reactome Release
  • 14. 2022-09-22- Load Reactome Release BioPax into Noctua prod
  • 15. 2022-09-22/3- GO-CAM model release
  • 16. 2022-09-23/4 - spot-check released models for correctness
@ukemi
Copy link
Author

ukemi commented Jun 6, 2022

@deustp01 and @dustine32 we need to add dates to the task list above. All of the things under 5 should get turned into tickets. If we can do incremental testing, we can check them off as we go.

@dustine32
Copy link
Collaborator

Do we have the release 82 BioPAX Homo_sapiens.owl file available now?

@deustp01
Copy link
Collaborator

deustp01 commented Sep 9, 2022

Do we have the release 82 BioPAX Homo_sapiens.owl file available now?

Not yet. Should be available next week, maybe as early as 9/12. Note that getting a draft version of a BioPAX to use to generate a draft set of GO-CAMs to be checked while the Reactome source material is still available for updating, as outlined in the bulleted list at the top of this ticket is still something for the future. We hope to have that available in time for use for a coordinated release of Reactome version 83 both as Reactome pathway pages and as GO-CAMs, but that is a work in progress and depends on some machine upgrades and scripting that are still to be done.

@dustine32
Copy link
Collaborator

@deustp01 Thanks! I can just wait until that's available and then push the new models to noctua-dev for testing.

@ukemi
Copy link
Author

ukemi commented Sep 9, 2022

Thanks guys. So Wednesday we will need to talk about a modification of our SOP for the new loads since the BioPax isn't available until the release.

@deustp01
Copy link
Collaborator

deustp01 commented Sep 9, 2022

new loads since the BioPax isn't available until the release.

In the future, maybe as soon as the next release, we should have the BioPAX in advance, as planned, so I hope we are looking at a delay in our plans, not a change.

Cris Mungall question on pathways2GO call about intersection of Java versions and generation of BioPAX files, and the possibility that he / Dustin may already have code to get around the current Reactome problem.

@dustine32
Copy link
Collaborator

@ukemi @deustp01 ShEx and OWL consistency checks have been run. 5 models failed ShEx, all are logically consistent. Full main_report.txt is below along with explanations.txt for the failures:
main_report.txt
explanations.txt

You can find the five ShEx failing models in main_report.txt by sorting the shex_valid column. Here are model links for your convenience:
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9670095
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9708296
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1474151
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-4615885
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2514859

@ukemi
Copy link
Author

ukemi commented Oct 12, 2022

I've gone through these and here is what I think. I should probably still tweak the models and not save to see if I am right. @dustine32 There are a couple cases below where I don't see why some Reacto entities aren't passing.

http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9670095
Annotation to an obsolete term. This should be corrected?
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9708296
Endoribonuclease activity input and output of REACTO_R-HSA-9708818 and REACTO_R-HSA-9708815 are not typed as chemical entities or complexes. Why?
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1474151
oxidoreductase activity has input REACTO_R-HSA-9693721. This should be CHEBI:17804. Why isn't it?
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-4615885
RANBP2 SUMOylates CDCA8 (Borealin) and PIAS3 SUMOylates AURKB (Aurora-B). This is a black box reaction with two catalysts. I suspect this is what is causing the Shex failure. Both are mapped to SUMO transferase activity. MFs should only have one enabler.
http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2514859
glycine N-acyltransferase activity has an input of obo:go/shapes/ProteinContainingComplex, obo:go/shapes/InformationBiomacromolecule, but this reaction also has an input of Acyl-CoA. @vanaukenk, we might want to modify the Shex here and make the target proteins a primary input. These kinds of enzymes will have molecules other than the target protein/complex being modified.

@dustine32
Copy link
Collaborator

dustine32 commented Oct 15, 2022

Thank you @ukemi for testing and the feedback! Sorry for the delay in responding. This was quite the learning experience and my responses below went through several iterations of "ohhhh!" and then having to rewrite what's going on.

So, for our failing models:

  • http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9670095
    • Right, the term should be replaced but GO:1900051 doesn't have the term replaced by (IAO:0100001) annotation in the GO ontology, which is currently required by the conversion code. There two "consider" terms. To fix, should the conversion code be looking at these "consider" annotations or should the term get updated in the GO with the usual term replaced by?
    • PD comment - none of the above. If the GO-CAM generator finds an obsolete GO term, that is a Reactome curation / QA error. We update our local copy of GO before every release and then check every GO term used in the released material to ensure that it is still valid. If that check fails or a curator outwits it, that is a Reactome error, and the pathway should be returned to Reactome to be fixed before GO-CAM processing - GO-CAM should not try to guess what we should have done. In the future dream pipeline we are building, this fix can easily be accommodated within the week that we think will be available for GO-CAM/Reactome feedback QA and patching. In this particular case (R-HSA-9670095), the obsolete GO term has been replaced by a valid one in the ver 82 release, if that's of any use in figuring out how this error crept in.
  • http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-9708296
    • The input R-HSA-9708815 and output R-HSA-9708818 are both sets. There may be something missing in getting ShEx to recognize these. In reacto.owl, these classes do not have any subClassOf relationship to CHEBI or GO terms that ShEx could use to infer they are either a ChemicalEntity or ProteinContainingComplex. I wonder if there are other models that use sets w/o issue.
    • PD comment - these are sets of four specific tRNAs (inout) that are processed in the reaction to yield modified tRNAs. The sets have no crossRef to, e.g., a high-level ChEBI tRNA term, but each of the individual tRNAs in the set has an RNA_Central crossRef. Could the use of an RNA_Central ID for a crossRef be a problem? I will need to dig some more to see if we have used RNA_Central IDs elsewhere in Reactome - all the patches I remember making for Ben used high-level ChEBI terms to avoid use of RNA_Central. And the issue is not absence of a crossRef at the level of the set, because here is another set, R-ALL-964831, that is a participant in a pathway and GO-CAM that does not cause this complaint, as far as I can tell.
  • http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1474151
    • Yes, I see in the BioPAX that input PTHP has CHEBI:17804 xref'd to its entityReference (in the External Reference Information section). However, the conversion code currently doesn't look at xrefs from entityReference elements on a SmallMolecule object and instead just uses its Reactome ID. Same with output sepiapterin and likely every other small molecule in Reactome GO-CAMs. We can open a ticket to change this behavior to always fetch the CHEBI if that is desired.
    • For the enabled_by (I think this is the real ShEx violation), sepiapterin synthase (R-HSA-9693721) is in the BioPAX as a PhysicalEntity. See its entry at Reactome and notice it does not have a CHEBI cross reference. As a result, in reacto.owl, R-HSA-9693721 only has subClassOf continuant, which is not specific enough to be inferred as either InformationBiomacromolecule or ProteinContainingComplex.
    • PD comment Agreed - the sepiapterin synthase (R-HSA-9693721) genome encoded entity has neither a UniProt reference link nor a crossReference to ChEBI:36080 "protein". In contrast, for example the [MHDB decarboxylase (R-HSA-2167848)](https://reactome.org/content/detail/R-HSA-2167848 does have a ChEBI:36080 crossReference and its pathway Ubiquinol biosynthesis (R-HSA-2142789) yields a GO-CAM with no SHEX error. Now patched in Reactome for the ver 83 release. Bottom line: a Reactome curation mistake that occurred after the previous clean-up of physical entities with no acceptable link to UniProt or ChEBI. Again, something easy to flag and fix during the hypothetical future one-week clean-up period.
    • ) genome encoded entity in pathway
  • http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-4615885
    • Should this single activity be broken into two activities to accommodate both enablers? Or just keep the single activity and select the best enabler, however "best" is defined?
    • PD comment Agreed - this is incorrect curation practice at Reactome - most cases like this were fixed in the clean-up for the original set of GO-CAMs but this one escaped. Will try again.
  • http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-2514859
    • enabler "unknown NAT" R-HSA-2565924 is also missing a CHEBI cross ref in Reactome, causing it to only get a subClassOf continuant in reacto.owl.
    • PD comment Another protein like sepiapterin synthase that lacks a UniProt reference entity and so should get ChEBI:36080 as a cross-reference. Unlike the semiapterin case this one is old, so it should have been fixed in the original GO-CAM cleanup. Oops. Anyway, now done.

@ukemi
Copy link
Author

ukemi commented Oct 17, 2022

@dustine32 thanks for the follow up.

  • We shouldn't automatically replace terms with 'consider' terms. This should be done by a curator at Reactome as part of their normal QC. Right @deustp01 ? - yes, see comment above on R-HSA-9670095
  • These are RNAs, but I'm not sure the build has any way to know that. This is an interesting conundrum that we will have to discuss. I assume that this isn't occurring anywhere else, or we would see other failures. @deustp01 do you know if there is anywhere else that Reactome has RNAs either in sets or as standalone molecules as inputs or outputs? If so and those are passing we can see what makes them pass.
  • @deustp01, what do you think? It makes sense to me to go straight to Chebi????????
  • Split into two reactions at Reatcome?
  • Again, it we went straight to Chebi I think this would work. We could also say 'use Reactome first and Chebi second'.

@ukemi
Copy link
Author

ukemi commented Oct 17, 2022

@dustine32 I just met with @deustp01 and he reminded me of some things we had done with other models. He will comment here, but in the meantime don't work on my suggestion to use ChEBI just yet.

@ukemi
Copy link
Author

ukemi commented Oct 18, 2022

@dustine32 and @deustp01 Having a night to think this over, I think we should adopt a strategy to use the ChEBI identifiers. Peter, after our conversation yesterday the realization struck me that when I use the model copy functionality all of the Reactome chemicals get copied over to my models. See http://noctua.geneontology.org/editor/graph/gomodel:633b013300001469?
This means that I need to go in and clean up all my models where I've used model copy. Less than an ideal situation.

@deustp01
Copy link
Collaborator

@dustine32 and @ukemi Still thinking here about the correct role for ChEBI identifiers in these models - needs more discussion.

On the five models that failed ShEX and OWL, it looks like all have straightforward, known errors in the Reactome annotation / BioPAX export that are easy enough to fix on the Reactome side so that, if such turned up at QA time in a future fast Reactome-to-GO-CAM export, we could make the fixes within the seven-day window that should be available. Here, I have fixed four already; the fifth requires strong-arming a curator. Detailed comments for each are interpolated into Dustin's comment on 10/14.

@dustine32
Copy link
Collaborator

dustine32 commented Oct 20, 2022

OK, great! Thanks @deustp01 @ukemi for summing up the actions. If we decide to switch to extracting CHEBI IDs we can work off of that new code change ticket.

@ukemi
Copy link
Author

ukemi commented Oct 26, 2022

@deustp01, should we move forward? @dustine32, during a weeds call last week, we decided that we should go ahead and try to import the ChEBI identifiers that are xref'd in Reactome. The main reason for this is consistency. Right now if I copy a model to make the mouse version, the Reactome entities come into my model. If I then make the model production, the Reactome entities end up on the Alliance view. This wouldn't be terrible if we could link them back to Reactome from there, but the group decided that it would be better if they linked to ChEBI because the ChEBI entities are the ones that are used by curators when they make de novo models. I will open a new ticket for this task. Meanwhile, I think the next thing to do is to sanity check the models that are on dev? and determine whether or not we can push the new load to production. Is everyone good with that?

@deustp01
Copy link
Collaborator

Is everyone good with that?

I'm good with both (replace Reactome IDs for localized small molecules with ChEBI ID coupled if possible to GO CC term) and doing sanity checks on latest models before push to production.

@dustine32
Copy link
Collaborator

@ukemi Yep, good with both as well. Let me know when/if you are good with the models in noctua-dev and I can make a PR with the new models to noctua-models/master. This PR will then get merged/loaded to Noctua prod during the next Noctua maintenance outage (most likely 2022-11-10).

@ukemi
Copy link
Author

ukemi commented Oct 28, 2022

Models to check:

  • R-HSA-5661270 - A nice linear pathway
  • R-HSA-70171 - Because it's my favorite
  • R-HSA-75893 - An update TNF signaling pathway
  • R-HSA-112311 - Search for contraband chemicals (drugs)

@ukemi
Copy link
Author

ukemi commented Oct 28, 2022

Notes for future work in release 83:

  1. For links between pathways, it would be nice to infer a more specific relation than causally upstream of. In most cases, these are functions that are immediately upstream of and maybe even a directly-provides_input_for as long as that is valid for functions between processes. Something to discuss.
  2. We should implement the has_small_molecule_regulator for the next release glycolysis is a good pathway to look at.
  3. Don't forget that we want to convert chemicals to ChEBI
  4. TNF is an unsatisfying pathway because it is all molecular events. However, we did successfully filter the drugs away. Is there any way we can resolve the Greek letters? @deustp01 I notice that sometimes they are spelled out and sometimes they are Greek. Is there a rule for Reactome curation?

@ukemi
Copy link
Author

ukemi commented Oct 28, 2022

OK @dustine32, I checked the above four models as well as the ones that failed the Shex and didn't spot any glaring issues with the import. I made a couple of notes above just so that I can remember them. There are now only two tasks left in this ticket, both in your court:

  1. Go ahead with the release to production.
  2. Generate the stats for @deustp01. I think he needs them for accounting purposes. Much like pathway boundaries, I have included that in this ticket, but clearly the release can technically go through without it. So if it is a thorny task, we can go ahead and call this release done and move that ticket, but I think it should still be a priority to get those stats. @deustp01 please feel free to comment.

@dustine32
Copy link
Collaborator

@ukemi Great! I just found code specifically for the Venn diagram numbers in garage/Manuscript.java that I can revive.

@ukemi
Copy link
Author

ukemi commented Nov 8, 2022

@dustine32 I'm wondering where we stand on this release since @deustp01 might want to report on the status at tomorrow's PI meeting.

@dustine32
Copy link
Collaborator

@ukemi @deustp01 I figured out the dang stats!!

First, here's the Venn diagram based on the latest GO-CAM load of Reactome 82 going into Noctua prod today:
image
Note this Venn diagram is created by a free web tool at https://bioinformatics.psb.ugent.be/webtools/Venn/.

The commands to generate the input for this diagram tool are now incorporated into the Reactome -> GO-CAM conversion pipeline (along with the ShEx checks).

@deustp01
Copy link
Collaborator

Great! I will align with the published diagram.

@ukemi
Copy link
Author

ukemi commented Nov 11, 2022

Release 82 has been moved to the 'Done" column. Onward to 83.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants