Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Section Directives Expansion and Built-ins #29

Open
ptth222 opened this issue Jun 22, 2023 · 2 comments
Open

Section Directives Expansion and Built-ins #29

ptth222 opened this issue Jun 22, 2023 · 2 comments

Comments

@ptth222
Copy link
Collaborator

ptth222 commented Jun 22, 2023

This is new. To solve the issue of going from a string like "OBI:0500020:time series design:comment" to a dictionary I came up with the idea of "built-ins". We can change the name if "built-in" isn't good. Essentially, they are functions that are available in the module (built-in) that take a single variable in and return some output. For now I have created 2 built-ins, dumb_parse_ontology_annotation and to_dict that expect string values formatted in a certain way and return a dictionary. Here is an example of using it in a directive:

 "sample%annotation": {
     "no_id_needed": {
         "value_type": "section",
         "built-in": "dumb_parse_ontology_annotation(^annotation)"
         }
     }

Here a directive would call this nested directive with a sample record and supply that record's "annotation" field to the dumb_parse_ontology_annotation function and return its dictionary value. The obvious syntax being "function_name(input_value)" where input_value could be a field name or literal.

My primary motivation was to solve the ISA annotation issue, but I went on and expanded the "section" type directive to make it available for more than just nested directives. Basically, the "section" type directive can operate more like the "str" type directive now. Instead of just expecting a "code" attribute to run you can use "built-in" to run the built-in function on records that you can specify just like the "str" directive. Using the "for_each" attribute it will run on every record and return a list of the values returned by the built-in.

The built-ins give a generalized way to decode string formats or map other predefined data structures to new ones that are too complicated to do with directives.

Any issues or comments?

@hunter-moseley
Copy link
Member

hunter-moseley commented Jun 23, 2023 via email

@ptth222
Copy link
Collaborator Author

ptth222 commented Jun 26, 2023

You may want to read this on GitHub so the examples are easier to see #29.

We met and discussed some aspects of this issue. One was to rename "built-in" to "execute" which is reasonable to me. We also discussed some syntax normalization and possible expansion of nested directives. Specifically, we discussed allowing arguments to be passed with nested directives and making the syntax more like a function call. We also discussed letting "execute" call directives or built-in functions. I think some of our discussion was hindered by not having a clear view of the current state of directives, so I am going to first describe how things are with examples so I can be sure we are on the same page.

As things are in the current release version of MESSES there are 3 directive types: str, matrix, and section. The type indicates the return type of the directive, so str returns a single string, matrix returns a list of dictionaries, and section is a catch all that can return anything and is basically just a way for the user to inject Python code. A directive is a single dictionary, but they are expected to be in a table. For example:

"SUBJECT": {
  "SUBJECT_SPECIES": {
    "required": "True",
    "fields": [
      "species"
    ],
    "id": "SUBJECT_SPECIES",
    "table": "entity",
    "test": "type=subject",
    "value_type": "str"
  },
  "SUBJECT_TYPE": {
    "required": "True",
    "fields": [
      "species_type"
    ],
    "id": "SUBJECT_TYPE",
    "table": "entity",
    "test": "type=subject",
    "value_type": "str"
  },
  "TAXONOMY_ID": {
    "required": "True",
    "fields": [
      "taxonomy_id"
    ],
    "id": "TAXONOMY_ID",
    "table": "entity",
    "test": "type=subject",
    "value_type": "str"
  }
}

The "SUBJECT_SPECIES" is a directive, but it must be within the "SUBJECT" table. This creates a "SUBJECT" entity with properties for "SUBJECT_SPECIES", "SUBJECT_TYPE", etc. To get an entity that is simply "SUBJECT" with a string value you have to use the section type directive. The section type is not only unique in that it is a way to execute Python code it also ignores its name and the directive table name is set equal to its value. For example:

"SUBJECT": {
  "no_id_needed": {
    "code": "'asdf'",
    "value_type": "section"
  }
}

This directive would add a "SUBJECT" entity to the final output whose value was "asdf": {"SUBJECT": "asdf"}. One feature I have added but did not mention in any issues is that I have added "section_str" and "section_matrix" types so that you can get the section type output for the str and matrix type directives.

I don't think I need to explain in too much detail every keyword and behavior of the str and matrix type directives, but I need to highlight some things. In general those directives first go through the records of the indicated table and grab the indicated relevant ones. Then the str directive will build a string value from the indicated fields of those records if the "for_each" attribute is true, otherwise only the first record in the list of relevant records is used. The matrix directive always builds a dictionary for each record in the list and returns the list of dictionaries. The keystone fields for these directives are "fields" and "headers" for the str and matrix types, respectively. "fields" is just a list of either literal values (surrounded by double quotes) or field values that the records are expected to have. "headers" is similar, but since it is building a dictionary each list element must be a pair of values (either a literal value or field), separated by an "=" sign. There is an example str directive already, here is a matrix:

"MS_METABOLITE_DATA": {
  "Units": {
    "required": "True",
    "fields": [
      "intensity%type"
    ],
    "id": "Units",
    "table": "measurement",
    "value_type": "str"
  },
  "Data": {
    "required": "True",
    "collate": "assignment",
    "headers": [
      "\"Metabolite\"=assignment",
      "entity.id=intensity"
    ],
    "id": "Data",
    "sort_by": [
      "assignment"
    ],
    "sort_order": "ascending",
    "table": "measurement",
    "value_type": "matrix",
    "values_to_str": "True"
  }
}

I wanted to go over these so we are both clear on where we are starting from. Nested directives were born out of the need to have nested dictionaries in the matrix output for ISA. The Workbench output never goes beyond a single dictionary, but ISA has many layers of nesting. We need to be able to say "for this header, build a dictionary or list of dictionaries". We already have a fairly robust way for the user to tell us how to build dictionaries and lists of dictionaries with the directives, so it is only natural that we let a header point to another directive for what its value should be. Although originally only necessary for headers in the matrix directive they also make sense in some other places, such as the fields in a str directive, so I have allowed them in some other places I think make sense. I have not simply allowed them everywhere because I don't think it's necessary or would ever be used in some places, but if you think differently let me know.

The idea of the built-in functions came from the need to be able to parse strings into more complicated arbitrary structures. Specifically, a string into a dictionary, but the idea can easily be expanded to incorporate any input type into any output type. I wanted this to be a keyword inside a directive rather than having it be like nested directives that can be in several locations for a few reasons. One is that I didn't want to have to differentiate between the two or deal with duplicate name issues. If you allowed something like "field_name=function_or_directive()" you can have a function with the same name as a directive and have extra code to deal with that. If you use some naming convention to deal with it then it's just adding extra syntax to everything and they can misspell it, I just prefer to have it either be 1 of 3 options, literal, record field, or directive. The main reason is that putting the built-in function inside a directive allows you to get more functionality and versatility out of it through all of the other directive fields. For instance, a default value.

Since the built-in functions can return any value type the functionality has to go under the section type directive, and that's why I put it there and only there. "execute" is only for this directive type.

Now to some of the specifics from our discussion. I want to first address letting the "execute" keyword run a directive and not just be limited to functions. This seems completely pointless to me. You already had to call a directive to get to the "execute", so why wouldn't you just call the directive you wanted in the first place? Why would you need to launder it through another directive first? Let's show an example:

{
  "directive1": {
    "Data": {
      "headers": [
        "\"Metabolite\"=assignment",
        "entity.id=directive1%entity()"
      ],
      "table": "measurement",
      "value_type": "matrix"
    }
  }

  "directive1%entity": {
    "no_id_needed": {
      "execute": "dumb_parse_ontology_annotation(^.entity.id)"
      "value_type": "section"
    }
  }


  "directive2": {
    "Data": {
      "headers": [
        "\"Metabolite\"=assignment",
        "entity.id=directive2%entity()"
      ],
      "table": "measurement",
      "value_type": "matrix"
    }
  }

  "directive2%entity": {
    "no_id_needed": {
      "execute": "directive2%entity%entity()"
      "value_type": "section"
    }
  }

  "directive2%entity%entity": {
    "no_id_needed": {
      "override": "asdf",
      "value_type": "section_str"
    }
  }


  "directive3": {
    "Data": {
      "headers": [
        "\"Metabolite\"=assignment",
        "entity.id=directive2%entity%entity()"
      ],
      "table": "measurement",
      "value_type": "matrix"
    }
  }
}

The first set of directives is my intention for "execute", you simply call the parsing function through the directive to parse the field into the value you want. The second set is showing "execute" calling another directive to set the field value to "asdf" (it's simple, but I'm just trying to show what I see as needless laundering). The third is the same as the second, but it simply calls the directive directly instead of laundering it through another directive using "execute". "directive2" and "directive3" have the same output but "directive3" only uses 2 directives instead of 3.

I need a more concrete reason why we want to complicate "execute". You would end up with the naming issues I already described as one of the reasons for putting built-in functions inside a directive in the first place.

Now let's discuss argument passing for nested directives. This still seems mostly needless to me because we have exposed the calling record's information with the call, and I don't think it's that much of a burden to simply copy a directive and change the values, but I can see that it does add some utility. What I want to clarify is what I see as 2 different types of arguments. One type is what you mentioned, which is overwriting directive key values. The other type would be arbitrary arguments you can replace anywhere. I will illustrate with an example.

  "directive1": {
    "Data": {
      "headers": [
        "\"Metabolite\"=assignment",
        "entity.id=directive1%entity(override=param1)"
      ],
      "table": "measurement",
      "value_type": "matrix"
    }
  }

  "directive1%entity": {
    "no_id_needed": {
      "override": "asdf"
      "value_type": "section_str"
    }
  }


  "directive2": {
    "Data": {
      "headers": [
        "\"Metabolite\"=assignment",
        "entity.id=directive2%entity(param1)"
      ],
      "table": "measurement",
      "value_type": "matrix"
    }
  }

  "directive2%entity": {
    "no_id_needed": {
      "override": "PARAM1"
      "value_type": "section_str"
    }
  }


  "directive3": {
    "Data": {
      "headers": [
        "\"Metabolite\"=assignment",
        "entity.id=directive3%entity(override=param1, param2)"
      ],
      "table": "measurement",
      "value_type": "matrix"
    }
  }

  "directive3%entity": {
    "no_id_needed": {
      "override": "PARAM2"
      "value_type": "section_str"
    }
  }

The first set shows overwriting the key "override" with the value "param1". The second set shows filling in values with arbitrary parameters and the last set shows a collision between the two. Were you considering both types of these arguments? The key overwriting style is the easiest to deal with and implement, but the other style is more flexible. Doing both can cause the collision I highlighted in the example, but I would say just pick one to be dominant and warn the user if they do this.

Summary:

Do you see what I mean about "execute" not needing to call directives? Is there something I am not seeing?

What do you think about the argument passing to directives?

Should we allow nested directives for every field value (even boolean value ones)?

Instead of a special character in the directive name to denote that it should be skipped as it is only meant to be a nested directive, would it be better to instead have an attribute such as "skip" to indicate it? (We can still recommend as good practice to name directives similar to attributes with '%'.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants