Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain formatting of Display name for annotations and table column names #1330

Merged
merged 16 commits into from
Jan 29, 2024

Conversation

mialy-defelice
Copy link
Contributor

@mialy-defelice mialy-defelice commented Nov 20, 2023

Summary

Retain display name formatting when submitting manifests, for column labels on tables, and and adding annotations to linked files.

Note: the testing procedure has been updated to conform to new parameters designed and discussed in comments below

This is to address FDS-1199.

To test

  1. In a testing project, create a new dataset folder, make note of the folder name, and get record its Synapse ID.
  2. Make sure the dataset is in a Fileview and record the Fileview Synapse ID.
  3. Download the data files from this synapse test project.
    Put these files and put into your new test dataset folder.
  4. Download the the Bulk RNA seq manifest from the same dataset folder, and save locally to submit via schematic
  5. Open the manifest in and editor.
  6. Remove the UUID/Id and entitityId columns prior to inital submission or it will be pointing to the wrong entityId columns
  7. Update the Filepaths in the Filename column to point to the new path the files are located in on synapse
  8. Submit test manifest to this dataset as appropriate to fully test (see below) via API and CLI. IMPORTANTIn between tests make sure to delete any generated table, and annotations (on the data files)to ensure the table column names and annotation keys are updated as expected.

Example CLI command:

To config.yml:

  • Update filview as needed,
  • Update data model path to the example.model.jsonld
schematic model -c config.yml submit -mp path/to/synapse_storage_manifest_bulkrna-seqassay.csv -d dataset_id -mrt table_and_file --table_column_names display_label --annotation_keys display_label

Example API input:

schema_url: https://raw.githubusercontent.com/Sage-Bionetworks/schematic/develop/tests/data/example.model.jsonld
dataset_id: synID
manifest_record_type: table_and_file
restrict_rules: false
asset_view: synID
table_column_names: display_label
annotation_keys: display_label

Check outputs

  • Look at the annotations on the files in the test dataset.
    • Should see sampleID recorded as an annotation. If compared to the manifest, you will see that the -'s have been strippped along with the lowercase first letter.
    • The manifest file should have a column named sample-I-D
  • Look at the table bulkrna-seqassay_synapse_storage_manifest_table in your project. Should see it has a column named sampleID.

These changes are significant bc normally, names given would be SampleID.

Delete annotations and table, then submit again with other options as described in the comment below this one.

@mialy-defelice mialy-defelice marked this pull request as draft November 20, 2023 23:50
@mialy-defelice
Copy link
Contributor Author

mialy-defelice commented Nov 22, 2023

@AmyHeiser @GiaJordan @linglp

In the schematic strategy meeting we talked about making 3 different choices from a single parameter.

For example:

  • label:
    • use_schema_label
    • retain_dl_formatting
    • use_display_label.

But this won't work bc of how they are called and used.
We will also get the parameter use_display_name_as_label in the future.

Instead maybe make 3 parameters with explicit choices to cover all the schema label and display name options that will be available:

  • data_model_labels: (would be added in PR#1324 instead of this one) -- would define the label attached to each node as well as the label for the JSONLD -- in current version of PR, this is controlled by the bool use_display_name_as_label. if JSONLD is used to generate a database...Only change if needed, will need to be consistently applied across all endpoints or will result in error.
    • display_label: use the display name as a label, if it is valid (contains no blacklisted characters) otherwise will default to schema_label. Depends on users being very intentional with this selection, will throw a warning if the schema_label is used instead of the display_name as the label. Will occur the most for valid values.
    • class_label: default, use standard class or property label.
  • annotation_keys: -- Defines the keys for uploading annotations. -- configuration in current PR under control of the boolean retain_dl_formatting...only used during submission, to upload the contents as annotations.
    • display_label: use the display name formatting as the annotation key. Will strip blacklisted characters (including spaces) when present.
    • class_label: default, use class or property label and strip blacklisted characters (including spaces).
  • table_column_names: -- Defines how to name the column names for tables. -- configuration in current PR under control of the boolean use_schema_label and retain_dl_formatting, but lacking some of the flexibility provided below, (retain_dl_formatting would be set once per submit, so would not be able to separate that type of formatting from annotations and table, the would be explicitly tied, the options below allow them to be separated)...only used during submission, to upload the manifest as a table.
    • display_name: use raw display name as the column name, no modifications)
    • display_label: use the display name formatting as the column name while striping blacklisted characters (including spaces) when present. (would be used to keep the naming the same between the annotation_keys and table_column_names.
    • class_label: default, use class or property label and strip blacklisted characters (including spaces).

@linglp
Copy link
Contributor

linglp commented Nov 23, 2023

@mialy-defelice
I tried the instructions that you mentioned above, but I got an error:

Traceback (most recent call last):
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/bin/schematic", line 6, in <module>
    sys.exit(main())
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/decorators.py", line 45, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/models/commands.py", line 128, in submit_manifest
    manifest_id = metadata_model.submit_metadata_manifest(
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/models/metadata.py", line 395, in submit_metadata_manifest
    manifest_id = syn_store.associateMetadataWithFiles(
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/store/synapse.py", line 1720, in associateMetadataWithFiles
    manifest_synapse_file_id = self.upload_manifest_as_table(
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/store/synapse.py", line 1530, in upload_manifest_as_table
    manifest = self.add_annotations_to_entities_files(se, schemaGenerator, manifest, manifest_record_type, datasetId, hideBlanks, manifest_synapse_table_id, retain_dl_formatting)
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/store/synapse.py", line 1481, in add_annotations_to_entities_files
    self._add_annotations(se, schemaGenerator, row, entityId, hideBlanks, retain_dl_formatting)
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/store/synapse.py", line 1401, in _add_annotations
    annos = self.format_row_annotations(se, schemaGenerator, row, entityId, hideBlanks, retain_dl_formatting)
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/store/synapse.py", line 266, in wrapper
    raise ex
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/store/synapse.py", line 259, in wrapper
    return method(*args, **kwargs)
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/store/synapse.py", line 1172, in format_row_annotations
    annos = self.syn.get_annotations(entityId)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/synapseclient/client.py", line 1530, in get_annotations
    return from_synapse_annotations(self._getRawAnnotations(entity, version))
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/synapseclient/client.py", line 1510, in _getRawAnnotations
    return self.restGET(uri)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/synapseclient/client.py", line 4035, in restGET
    response = self._rest_call('get', uri, None, endpoint, headers, retryPolicy, requests_session, **kwargs)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/synapseclient/client.py", line 4019, in _rest_call
    self._handle_synapse_http_error(response)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/synapseclient/client.py", line 3989, in _handle_synapse_http_error
    exceptions._raise_for_status(response, verbose=self.debug)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/synapseclient/core/exceptions.py", line 160, in _raise_for_status
    raise SynapseHTTPError(message, response=response)
synapseclient.core.exceptions.SynapseHTTPError: 400 Client Error: 
nan is not a valid Synapse ID.

I think this happened because when merging data frames to entity IDs in this line:

manifest = manifest.merge(file_df, how = 'left', on='Filename', suffixes=['_x',None]).drop('entityId_x',axis=1)

Even though file_df has entityId column that is not empty, after merging, manifest["entityId"] is still showing NaN..
I eventually had to add a line below after manifest.merge :

manifest['entityId'] = file_df['entityId'].values

After the code runs successfully, I could see sampleID column like below:

Screen Shot 2023-11-22 at 6 59 33 PM

Also annotations of file looks like this:
Screen Shot 2023-11-22 at 7 28 46 PM

Without the retain_dl_formatting, just by using the default (use_schema_label is True by default), I could see SampleID column:
Screen Shot 2023-11-22 at 7 02 11 PM

With annotations like this:
Screen Shot 2023-11-22 at 7 27 04 PM
**Is this the expected behavior? When use_schema_label = True, should the annotations of file looks the same as when use_display_label = True? **

If I use use_display_label, I could see sample-I-D column:
Screen Shot 2023-11-22 at 7 16 21 PM

With annotations like this:
Screen Shot 2023-11-22 at 7 21 38 PM

From my perspective, one difficult thing is to understand how setting retain_dl_formatting parameter impacts key of annotations and columns. I think I understand it better now after playing around the code. But just like Mialy mentioned, setting keys and formatting column names should probably be controlled by different parameter so that it is easier to understand the expected behaviors. Also, I think for table_column_names, if the expected behavior is to preserve raw input as column headers, maybe the option should be called "raw" or something else since users typically expect "display_name" strips blacklisted characters.

Potential TODos:

  1. Add test foradd_annotations_to_entities_files function. This should probably be added in a different PR.

@mialy-defelice
Copy link
Contributor Author

@linglp
So I don't think users expect use_display_name to strip blacklisted characters, so maybe for the future labeling of the other parameters we should call them display_name_label to distinguish them.

**Is this the expected behavior? When use_schema_label = True, should the annotations of file looks the same as when use_display_label = True? **

Yes, use_display_label and use_schema_label control the table columns not the annotations.

Annotations do not currently have any parameters controlling formatting until the introduction of retain_dl_formatting.

@mialy-defelice
Copy link
Contributor Author

@linglp did you delete the UUID and entityId column? I should have added something to delete those columns

@mialy-defelice
Copy link
Contributor Author

@linglp I can confirm, I got the error you reported above when I forgot to delete the id and entityId column in the manifest when doing an initial upload in a new test folder. This happens bc the entityIds are pointing to the wrong entities.

@mialy-defelice mialy-defelice marked this pull request as ready for review December 7, 2023 21:49
@mialy-defelice
Copy link
Contributor Author

mialy-defelice commented Dec 7, 2023

@linglp can you test this PR again?
Here is a draft SchemaHub doc as well.

@linglp
Copy link
Contributor

linglp commented Dec 13, 2023

confirm that the API part is working:

display_name would use the raw attribute display name as the column name. display_label, would strip all blacklisted characters (including spaces) from the display name, while retaining the rest of the display name formatting (best used when display name is already in camelcase), class_label (default) converts the display name to upper camelcase and strips blacklisted characters.
Case 1:

  • annotation_keys = class_label and table_column_names = class_label
  • Expected behavior: class_label (default) converts the display name to upper camelcase and strips blacklisted characters.
  • Actual behavior: I could see black listed characters got removed in sample-I-D, and the new column name is in a proper upper camel case format
Screen Shot 2023-12-13 at 1 49 05 PM

Case 2:

  • I updated the column File Format in original resource and changed it to fileformat
  • annotation_keys = class_label and table_column_names = display_label
  • Expected behavior: display_label, would strip all blacklisted characters (including spaces) from the display name, while retaining the rest of the display name formatting (best used when display name is already in camelcase)
  • Actual behavior: I could see file format gets preserved as 'file format'
Screen Shot 2023-12-13 at 2 30 00 PM

Case 3:

  • I updated the column File Format in original resource to file Format
  • annotation_keys = class_label and table_column_names = display_name
  • Expected behavior: display_label, would strip all blacklisted characters (including spaces) from the display name, while retaining the rest of the display name formatting (best used when display name is already in camelcase)
  • Actual behavior: I could see column names in the CSV were preserved in original format
Screen Shot 2023-12-13 at 2 22 30 PM

Copy link

sonarcloud bot commented Dec 15, 2023

Quality Gate Failed Quality Gate failed

Failed conditions

E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

idea Catch issues before they fail your Quality Gate with our IDE extension SonarLint SonarLint

@linglp
Copy link
Contributor

linglp commented Jan 9, 2024

@mialy-defelice Maybe not related to this PR, but I think that the submission of manifests through CLI is not working. I tried running: schematic model -c config.yml submit -mp /Users/lpeng/Downloads/sub/synapse_storage_manifest_bulkrna-seqassay_new.csv -d syn52854553 -mrt table_and_file --table_column_names display_label --annotation_keys display_label, but I got an error:

Traceback (most recent call last):
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/bin/schematic", line 6, in <module>
    sys.exit(main())
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/lpeng/Library/Caches/pypoetry/virtualenvs/schematicpy-9OomxyhV-py3.10/lib/python3.10/site-packages/click/decorators.py", line 45, in new_func
    return f(get_current_context().obj, *args, **kwargs)
  File "/Users/lpeng/Documents/schematic-git2/schematic/schematic/models/commands.py", line 152, in submit_manifest
    manifest_id = metadata_model.submit_metadata_manifest(
TypeError: MetadataModel.submit_metadata_manifest() missing 1 required positional argument: 'access_token

I think that's because access_token is now a required parameter for function submit_metadata_manifest. But in the develop branch, access_token is optional. Maybe after you resolve conflicts, the errors would go away. I will try again after you pull the latest develop.

@mialy-defelice
Copy link
Contributor Author

@linglp yes we did the fix in Dev a bit ago...bc it was accidentally made required at some point while this PR was in development. Lets hold off on testing this anyway since it will go in post refactor, so I'll need to merge that in ...sigh.

@mialy-defelice
Copy link
Contributor Author

@linglp this PR is ready to be tested again.

Copy link

sonarcloud bot commented Jan 25, 2024

Quality Gate Failed Quality Gate failed

Failed conditions

E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

idea Catch issues before they fail your Quality Gate with our IDE extension SonarLint SonarLint

@linglp
Copy link
Contributor

linglp commented Jan 29, 2024

Thanks @mialy-defelice . I am taking a look now.
Tests that I ran:

  1. schematic model -c config.yml submit -mp /Users/lpeng/Downloads/sub/synapse_storage_manifest_bulkrna-seqassay_new.csv -d syn52854553 -mrt table_and_file --table_column_names display_label --annotation_keys display_label

Saw expected behavior of "sampleID" in annotation:
Screen Shot 2024-01-29 at 1 11 05 PM

and also the manifest table has the expected table names as expected.
Screen Shot 2024-01-29 at 1 24 07 PM

  1. schematic model -c config.yml submit -mp /Users/lpeng/Downloads/sub/synapse_storage_manifest_bulkrna-seqassay_new.csv -d syn52854553 -mrt table_and_file --table_column_names display_name --annotation_keys display_label
    Saw expected behavior of "sample-I-D" in the table:
Screen Shot 2024-01-29 at 1 28 48 PM
  1. schematic model -c config.yml submit -mp /Users/lpeng/Downloads/sub/synapse_storage_manifest_bulkrna-seqassay_new.csv -d syn52854553 -mrt table_and_file --table_column_names class_label --annotation_keys class_label
    Saw class label in the table:
Screen Shot 2024-01-29 at 1 33 37 PM

can see annotation keys are using class labels too:
Screen Shot 2024-01-29 at 1 35 26 PM

  1. Through APIs:
  • annotation_key: class_label
  • table_column_names: display_label

saw display label in table:
Screen Shot 2024-01-29 at 1 43 06 PM

saw class label in annotations:
Screen Shot 2024-01-29 at 1 44 06 PM

I also ran test_api.py to make sure that all tests are working

Copy link
Contributor

@linglp linglp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mialy-defelice thanks for the hard work. I think it looks good.

@mialy-defelice
Copy link
Contributor Author

Thank you @linglp!!

@mialy-defelice mialy-defelice merged commit 01208bd into develop Jan 29, 2024
3 of 4 checks passed
@mialy-defelice mialy-defelice deleted the develop-retain-dl-formatting-FDS-1199 branch January 29, 2024 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants