-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Evaluate population of GSA with all CarbBank entries #58
Comments
@ReneRanzinger There are rows that contain multiple values, separated by line breaks. The affected columns (with data that can be ingested by GSA) are AM, BS, MT, PA, PM. For example, the following entry starting at line 98 has two different biological sources listed: More human readable version of the above multi-line fields:
This is an example of glycans that were purified from both an expression system and naturally occurring human urine. Also, it does not appear that the multi-line entries in one column correspond to another. Regarding format, some observations will need to be split into separate rows due to multiple biological sources or glycosylation sites. I think the BS column should be split into new columns per your suggestion, but only after we address the multi-line issue. Let me know your thoughts. |
@ReneRanzinger Ignore rows with multi-lines in the following fields: AM, BS, MT, PA, PM Some of these fields will be imported as keywords into GSA. Let's exclude the multi-line cases for now, but eventually we could replace the line breaks with a pipe or semicolon if they all apply to the same GSA record. |
I update the carbbank file: https://github.com/ReneRanzinger/org.glycomedb.export.glygen/blob/main/export/carbbank.csv. Rows with multi line entries in BS, MT, PA, PM are filtered out. That leaves 35,869 of 49,897 records. The AM column is the experimental method. I linarized it and separate the different methods by "|". We allow multiple methods you just have to split it by "|". I also split the BS field. I left the original column as is, but parsed the individual components into separate columns, which I appended after GlyTouCan:
|
Thanks @ReneRanzinger If you have time, could you also replace the line breaks with pipes for the fields AN and DB? |
@kmartinez834 Ok, done. |
@rykahsay here is the carbbank field and corresponding gsa field. There will only be one tax_id per entry, so map using the most specific term. Order of terms from most broad to specific:
Note: If a field is not listed above, disregard for now |
Evaluate the possibility for populating GSA with CarbBank data. A CSV with CarbBank data can be found here. The column abbreviations are explained here.
The first step would be to extract the biological data from the BS column. There are different information encoded in the column:
There is no dictionary IDs for the values but all are free text. We will need to extract and map them to the corresponding dictionaries (NCBI Taxonomy, Disease ontology, UBERON etc.).
Have a look and let me know how to proceed. If you want me to change the format (add or remove the columns) or split the columns up, let me know.
The text was updated successfully, but these errors were encountered: