Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add author and tools details in RO-Crate #18820

Open
wants to merge 17 commits into
base: dev
Choose a base branch
from

Conversation

Marie59
Copy link

@Marie59 Marie59 commented Sep 16, 2024

Here we go again !
I am launching this PR on a bit of work I did to ameliorate the RO-Crate plugin.

I updated the _add_worfklow function. I added the information on the author of the workflow:

  • Name
  • Orcid
    The update checks if the workflow attribute 'creator_metadata' exist. If it exists, it writes into the ro-crate the name of the creator (if there is no name it leaves a "") same for the idenifier (by default i put the ORCID maybe it's not a good Idea and if there is no Orcid it leaves a ""). If there is no 'creator_metadata' nothing is written on the creator.

I created a function _add_tools that checks the different info in workflow.steps to add some tools details:

  • Id of the tool
  • tool's name (label if there is one or id if there is no label)
  • tools' version
  • tool's description: the update checks if the tool has an annotations attribute if it exist it is written as the description if it does not the description remains "")
  • url to the the toolshed (not sure this part is relevant as I did not yet solve how to have the exact url to the tool in the toolshed and I just let the global url to the toolshed)

I am not sure if I wrote that the correct way corresponding to the run crate profile (this is a bit obscure to me)

@pauldg What do you think of this ?
Thanks !!

@github-actions github-actions bot added the area/database Galaxy's database or data access layer label Sep 16, 2024
@github-actions github-actions bot added this to the 24.2 milestone Sep 16, 2024
@Marie59 Marie59 changed the title Ro crate2 Add author and tools details in RO-Crate Sep 17, 2024
Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Any chance you could also add or extend tests ? 3dbe289#diff-9f3ad661ca1701467f48508ea79951efe90aa29a6d02dec3f2998d831e755bb4 contains the existing examples.

lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Show resolved Hide resolved
@Marie59
Copy link
Author

Marie59 commented Sep 18, 2024

Thanks a lot for the review and help @mvdbeek !

@stain
Copy link

stain commented Sep 19, 2024

@OliverWoolland could you review this? I think this solves many of the issues you had about the Workflow Run Crate from Galaxy

@Marie59
Copy link
Author

Marie59 commented Sep 19, 2024

TO DO in a 2nd time (maybe in a later PR but I'm just putting that here to keep it in mind):

  • Add xrefs of tools
  • Add citations of tools
  • Add edam ontologies of tools

@Marie59 Marie59 reopened this Oct 22, 2024
"name": tool_name,
"version": tool_version,
"description": tool_description,
"url": "https://toolshed.g2.bx.psu.edu", # URL if relevant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not always true. Tools could come from multiple toolsheds.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was wondering if this part was relevant or if I should remove it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a url for each tool would be great - but if we can't get this due to the toolbox limitations, it should be removed.


# Add tools used in the workflow
self._add_tools(crate)
self._add_steps(crate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workflows can have workflow inputs. Should we add them here as well? At least the types of the workflow inputs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice to have, yes.

This can even be done per tool (though more fiddly). There is already some capturing of the inputs and outputs as as formalParameters, but it seems only to happen if data files are included in the crate. There seem to be some helper functions further down the file for this purpose:

def _add_step_parameter_fp(self, step: WorkflowInvocationStep, crate: ROCrate):

@elichad
Copy link
Contributor

elichad commented Nov 12, 2024

Quick update from the RO-Crate side (with apologies from the long delay) - the metadata that's being added looks fine to me as I read the code, but I would like to look at an example RO-Crate generated using this PR to make sure the JSON-LD output looks like it should. I'm waiting for @pauldg or @OliverWoolland to provide me with such a crate, whoever gets there first

@mvdbeek mvdbeek self-requested a review November 12, 2024 15:34
@OliverWoolland
Copy link

Hi all, I've been trying to test these changes on a local 24.1.3 instance I have (base commit 71f5048)

I've not been able to get the revisions to run for me, with errors being thrown relating to the use of the toolbox.

I think I've tracked the problem down to the origin of this PR being usegalaxy-eu rather than galaxyproject. The two forks seem to be handling the toolbox a little differently.

I'd be curious if others have found the same? Or if there is an obvious solution?

Apologies if I have misunderstood anything!

@martenson
Copy link
Member

martenson commented Nov 15, 2024

@OliverWoolland what kind of errors are you seeing? There should only be minor differences on .eu compared to the upstream (release_24.1...usegalaxy-eu:galaxy:release_24.1_europe)
Plus the tests are running fine in CI, and those are run against the merge of this to galaxyproject/galaxy afaik.

@OliverWoolland
Copy link

OliverWoolland commented Nov 15, 2024

Thanks @martenson! This is the error I see

[2024-11-15 10:43:15,368: ERROR/main] Task galaxy.prepare_invocation_download[2d38756c-b893-4a20-8ca7-80e243608cc4] raised unexpected: AttributeError("'GalaxyManagerApplication' object has no attribute '_toolbox'")
Traceback (most recent call last):
  File "/galaxy/server/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 453, in trace_task
    R = retval = fun(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/galaxy/server/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 736, in __protected_call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/galaxy/server/lib/galaxy/celery/__init__.py", line 184, in wrapper
    rval = app.magic_partial(func)(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/galaxy/server/.venv/lib/python3.12/site-packages/lagom/wrapping.py", line 28, in _bound_func
    return inner_func(*bound_args, **bound_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/galaxy/server/.venv/lib/python3.12/site-packages/lagom/wrapping.py", line 45, in _error_handling_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/galaxy/server/lib/galaxy/celery/tasks.py", line 392, in prepare_invocation_download
    model_store_manager.prepare_invocation_download(request)
  File "/galaxy/server/lib/galaxy/managers/model_stores.py", line 156, in prepare_invocation_download
    with model.store.get_export_store_factory(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/galaxy/server/lib/galaxy/model/store/__init__.py", line 2481, in __exit__
    self._finalize()
  File "/galaxy/server/lib/galaxy/model/store/__init__.py", line 2850, in _finalize
    ro_crate = self._init_crate()
               ^^^^^^^^^^^^^^^^^^
  File "/galaxy/server/lib/galaxy/model/store/__init__.py", line 2518, in _init_crate
    invocation_crate_builder = WorkflowRunCrateProfileBuilder(self)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/galaxy/server/lib/galaxy/model/store/ro_crate_utils.py", line 49, in __init__
    self.toolbox = self.model_store.app.toolbox
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/galaxy/server/lib/galaxy/app.py", line 373, in toolbox
    return self._toolbox
           ^^^^^^^^^^^^^
AttributeError: 'GalaxyManagerApplication' object has no attribute '_toolbox'. Did you mean: 'toolbox'?

@davelopez
Copy link
Contributor

I'm not sure if that is the issue, but note that these changes are on the dev branch not release_24.1...

@OliverWoolland
Copy link

Fair enough! I'll try and cross-check myself

@mvdbeek mvdbeek modified the milestones: 24.2, 25.0 Nov 15, 2024
@elichad
Copy link
Contributor

elichad commented Nov 21, 2024

I don't have a local development Galaxy set up already, so I cloned Galaxy dev, checked out this PR, then installed Galaxy from scratch. When I created a simple workflow, invoked it, and tried to export as RO-Crate, I got the same error as @OliverWoolland above.

[2024-11-21 13:14:58,757: ERROR/main] Task galaxy.prepare_invocation_download[a4bab0c4-8c68-422a-bff2-e2a51b5c44a6] raised unexpected: AttributeError("'GalaxyManagerApplication' object has no attribute '_toolbox'")
Traceback (most recent call last):
  File "/home/eli/galaxy/galaxy/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 453, in trace_task
    R = retval = fun(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/home/eli/galaxy/galaxy/.venv/lib/python3.12/site-packages/celery/app/trace.py", line 736, in __protected_call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eli/galaxy/galaxy/lib/galaxy/celery/__init__.py", line 184, in wrapper
    rval = app.magic_partial(func)(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eli/galaxy/galaxy/.venv/lib/python3.12/site-packages/lagom/wrapping.py", line 28, in _bound_func
    return inner_func(*bound_args, **bound_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eli/galaxy/galaxy/.venv/lib/python3.12/site-packages/lagom/wrapping.py", line 45, in _error_handling_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/eli/galaxy/galaxy/lib/galaxy/celery/tasks.py", line 390, in prepare_invocation_download
    model_store_manager.prepare_invocation_download(request)
  File "/home/eli/galaxy/galaxy/lib/galaxy/managers/model_stores.py", line 156, in prepare_invocation_download
    with model.store.get_export_store_factory(
  File "/home/eli/galaxy/galaxy/lib/galaxy/model/store/__init__.py", line 2481, in __exit__
    self._finalize()
  File "/home/eli/galaxy/galaxy/lib/galaxy/model/store/__init__.py", line 2850, in _finalize
    ro_crate = self._init_crate()
               ^^^^^^^^^^^^^^^^^^
  File "/home/eli/galaxy/galaxy/lib/galaxy/model/store/__init__.py", line 2518, in _init_crate
    invocation_crate_builder = WorkflowRunCrateProfileBuilder(self)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eli/galaxy/galaxy/lib/galaxy/model/store/ro_crate_utils.py", line 48, in __init__
    self.toolbox = self.model_store.app.toolbox
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eli/galaxy/galaxy/lib/galaxy/app.py", line 397, in toolbox
    return self._toolbox
           ^^^^^^^^^^^^^
AttributeError: 'GalaxyManagerApplication' object has no attribute '_toolbox'. Did you mean: 'toolbox'?

I checked out galaxyproject:dev and the export works as expected.

Finally, I merged Marie59:ro-crate2 into galaxyproject:dev on a local branch and the error appeared again.

So the error appears to be somewhere in this PR, but I don't have the expertise required to understand or fix it.

It's visible in CI too - https://github.com/galaxyproject/galaxy/actions/runs/11610465172/job/32407941424?pr=18820 has some 500 errors which I think come from this problem.

Returns:
dict: A dictionary containing citations, xrefs, and EDAM operations for the tool.
"""
tool = self.toolbox.get_tool(tool_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work, you can't use the toolbox in this context. We can't load up the toolbox in celery workers, that would take too much time and memory. The best solution is to either get that data from a yet to be written toolshed endpoint or load up the tool on demand, neither of these are easy to do or fit into the context of this PR. Maybe you could focus on the other enhancements in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a shame that the toolbox can't be used in this way. But the most essential metadata about the tools is already captured without using the toolbox, so I think it's ok to remove this function and consider this as an enhancement which could be added in future.

It does mean we would have a greater need for the url on line 369, though, as without the xrefs there's no alternative links to find the tool (and the rest of its metadata). Still not absolutely mandatory to have url, but a definite best practice.

(I wondered if we could retrieve the citation/xref/EDAM metadata from something like job_attrs.txt which is already part of the crate, but it's not included there. Nor is the url.)

@davelopez
Copy link
Contributor

davelopez commented Nov 21, 2024

I had a closer look and this is what I found:

The export process happens in the context of a Celery task - a background task that runs out of the main galaxy process. The "galaxy app" in this context is a GalaxyManagerApplication. This is a "reduced" version of the UniverseApplication that strips out some functionality and services not required outside the main Galaxy process).

Here we can see that the toolbox is set up only in the UniverseApplication (full-fledged Galaxy app):

self._configure_toolbox()

Moving this call (along with possibly other toolbox-related calls) to the parent class GalaxyManagerApplication will probably fix it but I don't know if the toolbox was explicitly left out of the GalaxyManagerApplication for strong reasons (like performance, etc.) or if it can be safely set up at that level to make it available in the Celery context.

Maybe someone else from the @galaxyproject/wg-backend can answer this question?

Edit: lol, Marius was answering at the same time here #18820 (comment) so, that would be the answer.

@Marie59
Copy link
Author

Marie59 commented Nov 22, 2024

Hello all,
Thanks for all your feedback and reviews ! So if I understand correctly, for now, I leave this PR on the side 😢 (hopefully will be useful one day)
I can then open a PR for the things added that does not rely on the toolbox.
Does this sounds good ? Did I got everything right ?

Thanks again for all your help !!

@mvdbeek
Copy link
Member

mvdbeek commented Nov 22, 2024

Yes, or drop the things which require tool access from this PR.

@elichad
Copy link
Contributor

elichad commented Nov 22, 2024

@Marie59 please don't be discouraged, this is good work on the whole and I'm happy to see the RO-Crate outputs being improved :)

As I see it, only a small amount of lines need to be removed from this PR to take out the toolbox-dependent bits (namely the _get_tool_metadata function and the few lines that call it or its output), so that would be the easiest thing to do.

If you decide to make a new PR instead, so that the toolbox code is still visible here if someone wants to find it again, that's fine as well. In that case, please tag me in the new one so I can keep track of it!

@Marie59
Copy link
Author

Marie59 commented Nov 23, 2024

@Marie59 please don't be discouraged, this is good work on the whole and I'm happy to see the RO-Crate outputs being improved :)

As I see it, only a small amount of lines need to be removed from this PR to take out the toolbox-dependent bits (namely the _get_tool_metadata function and the few lines that call it or its output), so that would be the easiest thing to do.

If you decide to make a new PR instead, so that the toolbox code is still visible here if someone wants to find it again, that's fine as well. In that case, please tag me in the new one so I can keep track of it!

Not discouraged no worries ! Thanks for the encouragements ! I'll remove what need to be removed here (and keep a back up).

@Marie59
Copy link
Author

Marie59 commented Nov 25, 2024

Hello all !
So, normally I removed what needed to be removed. Locally I succeed to generate a Ro-crate with my galaxy test instance. I tried to add some tests (they are working too).
Let me know if something is not good and if I can improve anything.

@mvdbeek
Copy link
Member

mvdbeek commented Nov 25, 2024

Can you resolve the conflict in test/unit/data/model/test_model_store.py ?

@Marie59
Copy link
Author

Marie59 commented Nov 27, 2024

I don't understand the last errors I get ...

@nsoranzo

This comment was marked as resolved.

lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
@Marie59
Copy link
Author

Marie59 commented Nov 27, 2024

@nsoranzo thank you for the help ! I was a bit lost !!

Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great, thanks @Marie59

test/unit/data/model/test_model_store.py Outdated Show resolved Hide resolved
Copy link
Contributor

@elichad elichad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the output JSON-LD and I've made some suggestions based on the Provenance Run Crate profile - Galaxy RO-Crates don't strictly conform to this, but it offers some best practices that are applicable here.

From a Workflow Run Crate profile perspective (the less-strict one we actually intend to conform to), the metadata added in this PR looks good.

crate,
creator_data.get(
"url", ""
), # Use URL as identifier if available, otherwise empty string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also try to use identifier here, as that is an option in the UI too when setting an organization as a creator. For example an ROR identifier could be used. (though admittedly it's more likely for an average user to just input their institute's URL in the url field)

Copy link
Author

@Marie59 Marie59 Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this point I was not sure what to do I used identifier for the ORCID of the creator. Should I replace it by ROR or is something completely different you were suggesting ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the section above (L228-L251), if the creator is a Person, then we assume the identifier field is an ORCID and that's all good.

In this section, if the creator is an Organization, the identifier field won't represent an ORCID - instead it would (probably) be an ROR identifier. However, it's less likely to be used by users, as ROR isn't as widely known yet.
So for the entity id on this line I would suggest: if the identifier exists, use that, otherwise use the url, or if neither exist, use an empty string.

And in the properties you can include identifier as well as url if the identifier exists in the creator_data.

"name": tool_name,
"version": tool_version,
"description": tool_description,
"url": "https://toolshed.g2.bx.psu.edu", # URL if relevant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a url for each tool would be great - but if we can't get this due to the toolbox limitations, it should be removed.

lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Show resolved Hide resolved
lib/galaxy/model/store/ro_crate_utils.py Outdated Show resolved Hide resolved

# Add tools used in the workflow
self._add_tools(crate)
self._add_steps(crate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice to have, yes.

This can even be done per tool (though more fiddly). There is already some capturing of the inputs and outputs as as formalParameters, but it seems only to happen if data files are included in the crate. There seem to be some helper functions further down the file for this purpose:

def _add_step_parameter_fp(self, step: WorkflowInvocationStep, crate: ROCrate):

@Marie59
Copy link
Author

Marie59 commented Dec 23, 2024

I took a look at the output JSON-LD and I've made some suggestions based on the Provenance Run Crate profile - Galaxy RO-Crates don't strictly conform to this, but it offers some best practices that are applicable here.

From a Workflow Run Crate profile perspective (the less-strict one we actually intend to conform to), the metadata added in this PR looks good.

Thanks a lot for the review ! I tried to add as much as I could your suggestions !

I don't understand the last comment though on formalParameters what should I do ?

Is everything else good ?

Copy link
Contributor

@elichad elichad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting my earlier comments about formalParameters, I think we can leave it out of this PR. It's not essential and would take a bit of fiddly work to implement, so it can be looked at another time.

My other unresolved comment here is an optional improvement, not blocking.

As such, I'm happy for this PR to be merged now if the core devs approve :) thanks again for your work on this @Marie59 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backend area/database Galaxy's database or data access layer kind/enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants