Skip to content

Commit

Permalink
Start trying to formalize dataset collection semantics.
Browse files Browse the repository at this point in the history
  • Loading branch information
jmchilton committed Jan 8, 2025
1 parent b2ea1c3 commit 4466732
Show file tree
Hide file tree
Showing 2 changed files with 160 additions and 0 deletions.
15 changes: 15 additions & 0 deletions client/src/components/Workflow/Editor/modules/terminals.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,21 @@ describe("canAccept", () => {
expect(dataIn.canAccept(collectionOut).canAccept).toBe(true);
expect(dataIn.mapOver).toEqual(NULL_COLLECTION_TYPE_DESCRIPTION);
});
it("accepts paired data -> data connection", () => {
const collectionOut = terminals["paired input"]!["output"] as OutputCollectionTerminal;
const dataIn = terminals["simple data"]!["input"] as InputTerminal;
expect(dataIn.mapOver).toBe(NULL_COLLECTION_TYPE_DESCRIPTION);
expect(dataIn.canAccept(collectionOut).canAccept).toBe(true);
dataIn.connect(collectionOut);
expect(dataIn.mapOver).toEqual({ collectionType: "paired", isCollection: true, rank: 1 });
expect(dataIn.canAccept(collectionOut).canAccept).toBe(false);
expect(dataIn.canAccept(collectionOut).reason).toBe(
"Input already filled with another connection, delete it before connecting another output."
);
dataIn.disconnect(collectionOut);
expect(dataIn.canAccept(collectionOut).canAccept).toBe(true);
expect(dataIn.mapOver).toEqual(NULL_COLLECTION_TYPE_DESCRIPTION);
});
it("accepts mapped over data output on mapped over data input", () => {
const collectionOut = terminals["list input"]!["output"] as OutputCollectionTerminal;
const dataIn = terminals["multiple simple data"]!["input1"] as InputTerminal;
Expand Down
145 changes: 145 additions & 0 deletions lib/galaxy/model/dataset_collections/types/collection_semantics.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
- doc: |
# Collection Semantics
This document describes the semantics around working with Galaxy Dataset Collections.
In particular it describes how they operate within Galaxy tools and workflows.
If a tool consumes a simple dataset parameter and produces a simple dataset parameter,
then any collection type may be "mapped over" the data input to that tool. The result of
that is the tool being applied to each element of the collection and "implicit collections"
being created from the outputs that are produced from those operations. Those implicit
collections have the same element identifiers in the same order as the input collection that is
mapped over. Each element of the implicit collections correspond to their own job and
Galaxy very naturally and intuitively parallelizes jobs without extra work from the user
and without any knowledge of the tool.
- example:
label: BASIC_MAPPING_PAIRED
assumptions:
- "f, r are datasets"
- "tool = (i: dataset) => {o: dataset}"
- "C = CollectionInstance<paired,{forward=f, reverse=r}>"
then:
- "tool(i=map_over(C)) ~> {o: collection<paired,{forward=tool(i=f)[o], reverse=tool(i=r)[o]}>}"
tests:
- tool_runtime: "test_tool_execute.py::test_map_over_collection"

- example:
label: BASIC_MAPPING_LIST
assumptions:
- "d1,...,dn are 'dataset's"
- "tool = (i: dataset) => {o: dataset}"
- "C = CollectionInstance<list,{i1=d1, ..., in=dn}>"
then:
- "tool(i=map_over(C)) ~> {o: collection<list,[i1=tool(i=d1)[o],...,in=tool(i=dn)[o]]]}"
tests:
- tool_runtime: "test_tool_execute::test_map_over_list_collection"
- wf_editor: "accepts collection data -> data connection"

- doc: |
The above description of mapping over inputs works naturally and as expected for
nested collections.
- example:
label: NESTED_LIST_MAPPING
tests:
- tool_runtime: test_map_over_nested_collections
- wf_editor: "accepts list:list data -> data connection"

- doc: |
For tools with multiple data inputs, the tool can be executed with individual
datasets for the non-mapped over input and each tool execution will just be executed
with that dataset. The dataset not mapped over serves as the input for each execution.
- example:
label: BASIC_MAPPING_INCLUDING_SINGLE_DATASET
assumptions:
- "d1,...,dn are 'dataset's"
- "dother is a dataset"
- "tool = (i: dataset, i2: dataset) => {o: dataset}"
- "C = CollectionInstance<list,{i1=d1, ..., in=dn}>"
then:
- "tool(i=map_over(C),i2=dother) ~> {o: collection<list,{i1=tool(i=d1, i2=dother)[o],...,in=tool(i=dn, i2=dother)[o]]}>}"

- doc: |
If a tool consumes two input datasets and produces one output dataset, you can map two
collections with identical structure (same element identifiers in the same order) over
the respective inputs and the result is an implicit collection with the same structure
as the inputs and where each output in the implicit collection corresponds to the tool
being executed with the two inputs corresponding to that position in the input
collections.
The default behavior here is the collections are linked and the act of mapping over
inputs to the tool are sort of a flat map or a dot product. No extra dimensionality
in the resulting collections.
From a user perspective this means if you start with a collection and apply a bunch
of map over operations on tools - the results will all continue to match and work together
very naturally - again without extra work by the user and without extra knowledge
by the tool author.
- example:
label: BASIC_MAPPING_TWO_INPUTS_WITH_IDENTICAL_STRUCTURE
assumptions:
- "d11,...,d1n are 'dataset's"
- "d22,...,d2n are 'dataset's"
- "tool = (i: dataset, i2: dataset) => {o: dataset}"
- "C1 = CollectionInstance<list,[{i1=d1, ..., in=dn}]>"
- "C2 = CollectionInstance<list,[{i1=d1, ..., in=dn}]>"
then:
- "tool(i=map_over(C1), i2=map_over(C2)) ~> {o: collection<list,[i1=tool(i=d11, i2=d21)[o],...,in=tool(i=d1n, i2=d2n)[o]]]}"
tests:
- tool_runtime: test_tools.py::test_map_over_two_collections

- doc: |
Not all tool executions result in implicit collections and mapping
over inputs. Tool inputs of ``type`` ``data_collection`` can consume
collections directly and do not necessarily result in mapping over.
Tools that consume collections and output datasets effectively
reduce the dimension of the Galaxy data structure. When used at runtime
this is often referred to a "reduction" in the code.
- example:
label: COLLECTION_INPUT_PAIRED
assumptions:
- "r, f are datasets"
- "tool = (i: collection<paired>) => {o: dataset}"
- "C = CollectionInstance<paired,[{forward=f, reverse=r}]>"
then:
- "tool(i=C) -> {o: dataset}"
tests:
- tool_runtime: framework tests for collection_paired_test.xml

- doc: |
In addition to explicit collection inputs, tool inputs of ``type`` ``data``
where ``multiple="true"`` can consume lists directly. This is likewise a
"reduction".
- example:
label: LIST_REDUCTION
assumptions:
- "d1,...,dn are 'dataset's"
- "tool = (i: dataset<multiple=true>) => {o: dataset}"
- "C = CollectionInstance<list,[{i1=d1, ..., in=dn}]>"
then:
- "tool(i=C) == tool(i=[d1,...,dn])"
tests:
- tool_runtime: test_tools.py::test_reduce_collections
- wf_editor: "treats multi data input as list input"

- doc: |
Paired collections can not be reduced this way. ``paired`` is not meant
to represent a list/array/vector data structure it is more like a tuple.
- example:
assumptions:
- "r, f are datasets"
- "tool = (i: dataset<multiple=true>) => {o: dataset}"
- "C = CollectionInstance<paired,[{forward=f, reverse=r}]>"
then:
- "tool(i=C) is invalid"
tests:
- wf_editor: "rejects paired input on multi-data input"

0 comments on commit 4466732

Please sign in to comment.