Skip to content

Generation type

Rene Ranzinger edited this page Jan 4, 2023 · 5 revisions

The generation type used in the file configuration defines how a file is processed and how duplicated information is handled.

glygen_protein_data

The file is expected to be a CSV file with protein centric information. The metadata information in the file configuration has to be provided as:

  • protein - mandatory
  • gene - mandatory
  • glycan - optional
  • disease - optional
  • anatomy - optional
  • species - mandatory

Missing one of the mandatory columns will result into an error and stop of the program. If multiple rows are present for the same protein all metadata (e.g. glycan, disease, anatomy) will be added to the same protein entry which will create a single collection object in the CFDE data structure. Duplicated values for glycan, disease or anatomy will be ignored.

glygen_protein_no_gene_data

The file is expected to be a CSV file with protein centric information. However, in difference to glygen_protein_data its expected that the proteins have no gene information (e.g., viruses). The metadata information in the file configuration has to be provided as:

  • protein - mandatory
  • gene - no
  • glycan - optional
  • disease - optional
  • anatomy - optional
  • species - mandatory

Missing one of the mandatory columns will result into an error and stop of the program. If multiple rows are present for the same protein all metadata (e.g. glycan, disease, anatomy) will be added to the same protein entry which will create a single collection object in the CFDE data structure. Duplicated values for glycan, disease or anatomy will be ignored.

glygen_glycan_data

The file is expected to be a CSV file with glycan centric information. The metadata information in the file configuration has to be provided as:

  • protein - optional (if provided gene information has to be provided as well)
  • gene - optional (if provided protein information has to be provided as well)
  • glycan - mandatory
  • disease - optional
  • anatomy - optional
  • species - optional

Missing the glycan column will result into an error and stop of the program. If multiple rows are present for the same glycan all metadata (e.g. protein, gene, disease, species, anatomy) will be added to the same glycan entry which will create a single collection object in the CFDE data structure. Duplicated values for protein, species, disease or anatomy will be ignored.

glygen_protein_glycan_mix_data

The file is expected to be a CSV file which can contain both, protein centric information or glycan centric information. The distinction between these two cases is made based on the presence of the protein. The metadata information in the file configuration has to be provided as:

  • protein - mandatory
  • gene - mandatory
  • glycan - mandatory
  • disease - optional
  • anatomy - optional
  • species - mandatory

Missing one of the mandatory columns will result into an error and stop of the program. For rows with protein IDs if multiple rows are present for the same protein all metadata (e.g. glycan, disease, anatomy) will be added to the same protein entry which will create a single collection object in the CFDE data structure. Duplicated values for glycan, disease or anatomy will be ignored. For rows with no protein ID if multiple rows are present for the same glycan all metadata (e.g. protein, gene, disease, species, anatomy) will be added to the same glycan entry which will create a single collection object in the CFDE data structure. Duplicated values for species, disease or anatomy will be ignored.