Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate population of GSA with all CarbBank entries #58

Open
ReneRanzinger opened this issue Sep 2, 2022 · 6 comments
Open

Evaluate population of GSA with all CarbBank entries #58

ReneRanzinger opened this issue Sep 2, 2022 · 6 comments

Comments

@ReneRanzinger
Copy link
Member

Evaluate the possibility for populating GSA with CarbBank data. A CSV with CarbBank data can be found here. The column abbreviations are explained here.

The first step would be to extract the biological data from the BS column. There are different information encoded in the column:

  • CN - species common name
  • GS - species scientific name
  • OT - organ type
  • LS - life stage
  • disease - Disease
  • ...

There is no dictionary IDs for the values but all are free text. We will need to extract and map them to the corresponding dictionaries (NCBI Taxonomy, Disease ontology, UBERON etc.).

Have a look and let me know how to proceed. If you want me to change the format (add or remove the columns) or split the columns up, let me know.

@kmartinez834
Copy link

@ReneRanzinger There are rows that contain multiple values, separated by line breaks. The affected columns (with data that can be ingested by GSA) are AM, BS, MT, PA, PM.

For example, the following entry starting at line 98 has two different biological sources listed:
"16519","","",""," Takeuchi M; Takasaki S; Miyazaki H; Kato T; Hoshi S; Kochibe N;
Kobata A",""," (CN) Chinese hamster, (OT) CHO cells
(CN) human, (OT) urine"," J Biol Chem (1988) 263: 3657-3663"," 04-01-1992",""," N-linked glycoprotein
recombinant glycoprotein",""," 1-2 Neup5Ac per molecule, urinary HuEPO has .alpha.2.fwdawr.3 and
.alpha.2.fwdawr.6 linkages, rHuEPO has .alpha.2.fwdawr.3 linkages",""," EPO, erythropoietin, human
EPO, erythropoietin, human, recombinant"," Kleen A"," AN1/1', 1' has fucose"," CBank:21607",""," Comparative study of the asparagine-linked sugar chains of human
erythropoietins purified from urine and the culture medium of recombinant
Chinese hamster ovary cells","","9769","G02718AK"

More human readable version of the above multi-line fields:

BS MT PM
(CN) Chinese hamster, (OT) CHO cells N-linked glycoprotein EPO, erythropoietin, human
(CN) human, (OT) urine recombinant glycoprotein EPO, erythropoietin, human, recombinant

This is an example of glycans that were purified from both an expression system and naturally occurring human urine. Also, it does not appear that the multi-line entries in one column correspond to another.

Regarding format, some observations will need to be split into separate rows due to multiple biological sources or glycosylation sites. I think the BS column should be split into new columns per your suggestion, but only after we address the multi-line issue. Let me know your thoughts.

@kmartinez834
Copy link

@ReneRanzinger Ignore rows with multi-lines in the following fields: AM, BS, MT, PA, PM

Some of these fields will be imported as keywords into GSA. Let's exclude the multi-line cases for now, but eventually we could replace the line breaks with a pipe or semicolon if they all apply to the same GSA record.

@ReneRanzinger
Copy link
Member Author

I update the carbbank file: https://github.com/ReneRanzinger/org.glycomedb.export.glygen/blob/main/export/carbbank.csv.

Rows with multi line entries in BS, MT, PA, PM are filtered out. That leaves 35,869 of 49,897 records. The AM column is the experimental method. I linarized it and separate the different methods by "|". We allow multiple methods you just have to split it by "|".

I also split the BS field. I left the original column as is, but parsed the individual components into separate columns, which I appended after GlyTouCan:

  • BS-CN
  • BS-OT
  • BS-disease
  • BS-LS
  • BS-GS
  • BS-GT
  • BS-C
  • BS-*
  • BS-cell line
  • BS-K
  • BS-domain
  • BS-BS
  • BS-F
  • BS-O

@kmartinez834
Copy link

Thanks @ReneRanzinger

If you have time, could you also replace the line breaks with pipes for the fields AN and DB?

@ReneRanzinger
Copy link
Member Author

@kmartinez834 Ok, done.

@kmartinez834
Copy link

kmartinez834 commented Sep 23, 2022

@rykahsay here is the carbbank field and corresponding gsa field.

There will only be one tax_id per entry, so map using the most specific term. Order of terms from most broad to specific:
domain, kingdom, class, family, common name, species

carbbank field gsa field processing notes
CC database_source URL is https://www.genome.jp/entry/carbbank+%s
AM experimental_method some entries have multiple, separated by pipes
AN keywords some entries have multiple, separated by pipes
AU publication Author, use for mapping
CT publication Citation info, use for mapping
DB xrefs some entries have multiple, separated by pipes
MT keywords
PA site protein attachment site
PM glycoprotein Protein name, use for mapping
ST evidence_type Synthetic entry if the term "synthetic" is in this field
TI publication Paper title, use for mapping
GlycomeDB ID xrefs
GlyTouCan Acc xrefs
BS-CN tax_name common name
BS_OT tissue
BS-disease disease
BS-GS tax_name species
BS-GT strain, serotype field includes both strain and serotype
BS-C tax_name class
BS-K tax_name kingdom
BS-domain tax_name domain
BS-F tax_name family
BS-cell line cell_type mapping specific to carbbank file: carbbank_cell_lines.csv

Note: If a field is not listed above, disregard for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants