ISA Model Extension - DataSet #484

muehlhaus · 2022-12-14T16:37:30Z

muehlhaus
Dec 14, 2022

The community requires a way to describe result file content as part of the ISA -Model. Here, we want to discuss possible solutions. (Discussion related to issue #475)

Currently, the ISA model is very strong in describing the path from biological source to a measurement result file. From this point on the model relies on the specification of the result file format for machine tractability, which is perfect if such a file format is established. However, we often face the situation that such a file format is not established, or we want to point into such a format describing a specific processing path.
• Possibility to point into a file (e.g. data frame, XML, JSON, or image coordinate space)
• Using the classical ISA process description (e.g. ISA Tab) to not change the user experience

Brilator · 2022-12-14T16:46:07Z

Brilator
Dec 14, 2022

Examples for types of results files that would need to be described with a generalized "data dictionary"

MIAPPE's Trait definition file (TDF) - (example below sourced from INRAE URGI)

Trait Name	Full Name	Description	Protocol	Unit
MIPO:0000025 MFLW_D20deg	Male flowering days to anthesis D20deg	Anthesis time	Thermal time between emergence and to anthesis - Computation	D20deg: days equivalent time at 20 °C
MIPO:0000007 ear_height	Ear insertion height (cm)	Plant height from base to the insertion of the top (uppermost) ear.	EH - Measurement	cm
MIPO:0000018 GY_Adj_tha	Grain yield	Grain yield per unit area, either field weight basis, dry weight basis or adjusted.	Adjusted GY - Computation	t/ha
MIPO:0000012 KW	Thousand kernel weight (g)	Grain weight of mature kernels.	GW DW - Measurement	g/1000grain
MIPO:0000017 GN	Grain number	Grain number	GN - Computation	grain/m-2
MIPO:0000024 FFLW_D20deg	Female flowering days to silking D20deg	Silking time	Thermal time between emergence and silking – Computation	D20deg: days equivalent time at 20 °C
MIPO:0000026 ASI_D20deg	Anthesis to silking interval D20deg	Anthesis silking interval	Anthesis silking interval in thermal time – Computation	D20deg: days equivalent time at 20 °C
MIPO:0000014 tassel_height	Tassel length (cm)	Tassel size.
MIPO:0000006 plant_height	Plant height (cm)	Plant height from the base to the top part (in reproductive stages to the top of the tassel).	PH - Measurement	cm

MetaboLights's MAF https://www.ebi.ac.uk/metabolights/guides/MAF/Title - (example below sourced from MTBLS2154)

metabolite_identification	mass_to_charge	retention_time	taxid	species	ICS1_1	ICS1_2	ICS1_3	Sca6_1	Sca6_2	Sca6_3
unknown	61.9880836	1.10663333	NCBITAXON/3641	Theobroma cacao	6023.23745	6290.05185	5881.19619	3273.45553	3868.78162	3808.81421
unknown	96.9595434	6.83985833	NCBITAXON/3641	Theobroma cacao	241.255549	355.024716	343.59143	383.767354	337.659757	421.683717
unknown	96.9595436	7.95281667	NCBITAXON/3641	Theobroma cacao	388.990881	523.950795	485.254267	600.302418	445.523309	321.548359
unknown	96.9597404	4.92275	NCBITAXON/3641	Theobroma cacao	541.570892	769.703525	799.981072	880.628663	792.917161	593.48679

@muehlhaus @Brilator adding examples for TDF and MAF files

0 replies

muehlhaus · 2022-12-15T15:45:08Z

muehlhaus
Dec 15, 2022
Author

The first idea that came up in DataPLANT is to extend the ISA “data file” object with a kind of data dictionary including the following fields:

      - Identifier : the identifier within the file (in case of a data frame: column name) allowing regex patterns / xpath (to point into an XML. e.g.)
      - TargetFile : file name in which we point (already covered by the data file object in ISA)
      - Attribute : tripled of ontology term, termId, source URI (describing the content using an ontology)
      - Unit : tripled of ontology term, termId, source URI
      - ObjectType : data type
      - [optional] Label and Comment

We envision an additional file (table) called “ISA dataset” that accompanies one ore more result files (e.g. data matrices).

Your thoughts are highly appreciated.

3 replies

proccaserra Dec 19, 2022
Maintainer

Defining the structure for the data dictionary

Label	Data File	Data Type	Term Source	cardinality	pattern	missing values
Trait Name	Trait Definition File	string	PO	1	/regex/	enum["NaN","NA",""]
Full Name	Trait Definition File	string	PO	1	/regex/	enum["NaN","NA",""]
Description	Trait Definition File	string	PO	1	/regex/	enum["NaN","NA",""]
Protocol	Trait Definition File	string	PO	1	/regex/	enum["NaN","NA",""]
Unit	Trait Definition File	string	PO	1	/regex/	enum["NaN","NA",""]
Scale	Trait Definition File	string	PO	1	/regex/	enum["NaN","NA",""]
database_identifier	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
chemical_formula	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
smiles	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
inchi	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
metabolite_identification	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
mass_to_charge	Metabolights Assignment File	float		1		enum["NaN","NA",""]
fragmentation	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
modifications	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
charge	Metabolights Assignment File	float		1		enum["NaN","NA",""]
retention_time	Metabolights Assignment File	float		1		enum["NaN","NA",""]
taxid	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
species	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
database	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
database_version	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
reliability	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
uri	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
search_engine	Metabolights Assignment File	string		1	/regex/	enum["NaN","NA",""]
search_engine_score	Metabolights Assignment File	float		1		enum["NaN","NA",""]
smallmolecule_abundance_sub	Metabolights Assignment File	float		1		enum["NaN","NA",""]
smallmolecule_abundance_stdev_sub	Metabolights Assignment File	float		1		enum["NaN","NA",""]
smallmolecule_abundance_std_error_sub	Metabolights Assignment File	float		1		enum["NaN","NA",""]
sample identifier	Metabolights Assignment File	string		1..n	/regex/	enum["NaN","NA",""]

Brilator Dec 20, 2022

Hey @proccaserra, I don't understand the cardinality column; or more precisely the sample identifier => 1..n. To my understanding the data dictionary is supposed to describe a specific "result file" (e.g. the MAF or a gene X sample count table from RNASeq Mapping). In this result file every column (if tabular) should be unique.

proccaserra Dec 21, 2022
Maintainer

@Brilator the cardinality is meant specify how many times a field is expected to be found in a target file. But you are right in the sense that the string 'sample identifier' does not occur explicitly in the file itself. I was trying to capture that fact by using a combination of 'cardinality' and 'regex' but this is in not correct. So it seems another mechanism is needed to indicate fields which 1/ follow a pattern and 2/ can occur multiple times . Typically, an array of sample/bioassay identifiers or an array of contrasts

proccaserra · 2022-12-19T16:56:15Z

proccaserra
Dec 19, 2022
Maintainer

@muehlhaus ,is this what you had in mind (please see the second example where 2 files are referenced ( the MIAPPA TDF and the Metabolights MAF).
So this 'data matrix header dictionary' would define each of the fields.
As discussed last week, we'd need specific metatada to specify the 'data cube orientation', something the mage-OM defined as the biodata cube order (see https://genomebiology.biomedcentral.com/articles/10.1186/gb-2002-3-9-research0046, figure2)

this brings up the issue of complex header structure to take into account 2 dimensions: bioassay & quantitation type(s) reported for each 'bioassay'.

3 replies

muehlhaus Dec 20, 2022
Author

Hi @proccaserra, exactly this are my intentions. The `data dictionary defines each of the fields (headers).

'Data cube orientation' is a valid concern. A possible solution would be to include the orientation as an additional field in the dictionary (row/column major) or we require column major by definition (maybe less favorable).

Regarding the pattern field, I would recommend to just use the Identifier field and use pattern (regex extensions).
This would allow to write something like:
//sample_([1-9]|10)$
to match to multiple columns. The advantage would be that we could better use the ISA Process graph by including the resulting sample name and file name. Sample names could then be regex matched and mapped to the column identifier.... sample_1 ...sample_2 (different rows in previous isa tab)

This would also make the description of multiple dimensions possible... => e.g. same identifier different source

proccaserra Dec 21, 2022
Maintainer

@muehlhaus I now got what you meant by 'Identifier'. I got wrong-footed and understood it as a molecular entity identifier or an ontology term identifier.

Note that if users were to use for example Frictionless.io, the data files referenced from an ISA document would be somehow self describing.

Let's continue this discussion and provide more examples. thank you all!

muehlhaus Dec 21, 2022
Author

@proccaserra I think that data dictionary would be useful to link an ISA process with a part of a file even when the file is in a nicely specified format.

Let`s assume in column 1 and 2 of our data matrix are quantification values (of any kind), I want to know that column 1 was obtained under 25°C while column 2 was obtained under 45°C ... This is information that we have stored in the ISA Graph. However, currently the connection is not in the model.

Using frictionless.io to solve this, would require a link/definition from frictionless.io to ISA... and just postpone the issue (at least to my understanding). Also, it would require loading information from ISA and frictionless description...

kappe-c · 2023-01-09T16:50:37Z

kappe-c
Jan 9, 2023

Hello. My (computer scientist) pov.

The following is more or less how I (and a few others) have envisioned the "data dictionary" (not necessarily complying with ISA, actually kind of orthogonal to it).

Identifier	TargetFile	wasGeneratedBy	Attribute(Term)	ObjectType	Label	Comment
target_id	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	gene identifier [NCIT:C48664]	String	Gene ID	Gene identifier, reference zum fasta
pval	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	p-value [NCIT:C44185]	Decimal	P-Value	Pvalue, has a term
qval	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	q-value [NCIT:C64217]	Decimal	Q-Value	qVal, has a term
test_stat	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	test statistic [OBCS:0000013]	Decimal	Test Statistic	test statisitc, has a term
rss	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	residual [STATO:0000234]	Decimal	Residual sum of squares	residual sum of squares, has a term
degrees_free	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	number of degrees of freedom [STATO_0000069]	Integer	Number Of Degrees Of Freedom	degrees of freedom, has a term
mean_obs	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	mean [NCIT:C53319]	Decimal	Mean	mean of observations, needs references
var_obs	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	variance [NCIT:C48918]	Decimal	Variance	variance of observations, needs references
tech_var	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	variance [NCIT:C48918]	Decimal	Technical variance	technical variance? Has a term
sigma_sq	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	variance [NCIT:C48918]	Decimal	Variance	variance of observations, needs references
smooth_sigma_sq	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	smoothed variance [User specific]	Decimal	Smoothed Variance	smoothed variance
final_sigma_sq	runs/kallisto_sleuth/sleuth_dge.csv	workflows/kallisto_sleuth.R	corrected variance [User specific]	Decimal	Corrected Variance	adjusted variance

The following (imagined as an additional sheet in "isa.dataset.xlsx") is less of an extension but more a reuse of ISA to document which physical samples contributed (via some software) to which data files. (The column names are rather ad-hoc, excuse me, I am not much of an ISA expert yet.)

Data File Path	Software	Data File Path 2	Sample Name
assays/Talinum_RNASeq_minimal/dataset/DB_097_CAMMD_CAGATC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	target_id
assays/Talinum_RNASeq_minimal/dataset/DB_099_CAMMD_CTTGTA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	target_id
assays/Talinum_RNASeq_minimal/dataset/DB_103_CAMMD_AGTCAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	target_id
assays/Talinum_RNASeq_minimal/dataset/DB_161_reC3MD_GTCCGC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	target_id
assays/Talinum_RNASeq_minimal/dataset/DB_163_reC3MD_GTGAAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	target_id
assays/Talinum_RNASeq_minimal/dataset/DB_165_re-C3MD_GTGAAA_L002_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	target_id
assays/Talinum_RNASeq_minimal/dataset/DB_097_CAMMD_CAGATC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	degrees_free
assays/Talinum_RNASeq_minimal/dataset/DB_099_CAMMD_CTTGTA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	degrees_free
assays/Talinum_RNASeq_minimal/dataset/DB_103_CAMMD_AGTCAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	degrees_free
assays/Talinum_RNASeq_minimal/dataset/DB_161_reC3MD_GTCCGC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	degrees_free
assays/Talinum_RNASeq_minimal/dataset/DB_163_reC3MD_GTGAAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	degrees_free
assays/Talinum_RNASeq_minimal/dataset/DB_165_re-C3MD_GTGAAA_L002_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	degrees_free
assays/Talinum_RNASeq_minimal/dataset/DB_097_CAMMD_CAGATC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	mean_obs
assays/Talinum_RNASeq_minimal/dataset/DB_099_CAMMD_CTTGTA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	mean_obs
assays/Talinum_RNASeq_minimal/dataset/DB_103_CAMMD_AGTCAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	mean_obs
assays/Talinum_RNASeq_minimal/dataset/DB_161_reC3MD_GTCCGC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	mean_obs
assays/Talinum_RNASeq_minimal/dataset/DB_163_reC3MD_GTGAAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	mean_obs
assays/Talinum_RNASeq_minimal/dataset/DB_165_re-C3MD_GTGAAA_L002_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	mean_obs
assays/Talinum_RNASeq_minimal/dataset/DB_097_CAMMD_CAGATC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	var_obs
assays/Talinum_RNASeq_minimal/dataset/DB_099_CAMMD_CTTGTA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	var_obs
assays/Talinum_RNASeq_minimal/dataset/DB_103_CAMMD_AGTCAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	var_obs
assays/Talinum_RNASeq_minimal/dataset/DB_161_reC3MD_GTCCGC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	var_obs
assays/Talinum_RNASeq_minimal/dataset/DB_163_reC3MD_GTGAAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	var_obs
assays/Talinum_RNASeq_minimal/dataset/DB_165_re-C3MD_GTGAAA_L002_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	var_obs
assays/Talinum_RNASeq_minimal/dataset/DB_097_CAMMD_CAGATC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	tech_var
assays/Talinum_RNASeq_minimal/dataset/DB_099_CAMMD_CTTGTA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	tech_var
assays/Talinum_RNASeq_minimal/dataset/DB_103_CAMMD_AGTCAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	tech_var
assays/Talinum_RNASeq_minimal/dataset/DB_161_reC3MD_GTCCGC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	tech_var
assays/Talinum_RNASeq_minimal/dataset/DB_163_reC3MD_GTGAAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	tech_var
assays/Talinum_RNASeq_minimal/dataset/DB_165_re-C3MD_GTGAAA_L002_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	tech_var
assays/Talinum_RNASeq_minimal/dataset/DB_097_CAMMD_CAGATC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq
assays/Talinum_RNASeq_minimal/dataset/DB_099_CAMMD_CTTGTA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq
assays/Talinum_RNASeq_minimal/dataset/DB_103_CAMMD_AGTCAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq
assays/Talinum_RNASeq_minimal/dataset/DB_161_reC3MD_GTCCGC_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq
assays/Talinum_RNASeq_minimal/dataset/DB_163_reC3MD_GTGAAA_L001_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq
assays/Talinum_RNASeq_minimal/dataset/DB_165_re-C3MD_GTGAAA_L002_R1_001.fastq.gz	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq

Lastly, the following pic demonstrates (roughly …) my intended use case: being able to grasp individual "variables" inside data files.

0 replies

kappe-c · 2023-01-09T16:56:29Z

kappe-c
Jan 9, 2023

Oh, to clarify that empty column. There is the concept of "derived" variables, that – in my mock-up – go in different sheets depending on their "level of dependence". E.g.
L1-Derivations

Data File Path	Source Name	Software	Data File Path 2	Sample Name
runs/kallisto_sleuth/sleuth_dge.csv	mean_obs	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	smooth_sigma_sq
runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	smooth_sigma_sq

and L2-Derivations

Data File Path	Source Name	Software	Data File Path 2	Sample Name
runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	rss
runs/kallisto_sleuth/sleuth_dge.csv	smooth_sigma_sq	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	rss
runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	final_sigma_sq
runs/kallisto_sleuth/sleuth_dge.csv	smooth_sigma_sq	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	final_sigma_sq

0 replies

Ndubu12 · 2023-01-09T17:40:51Z

Ndubu12
Jan 9, 2023

Nice work. Let me look it up.

…

On Mon, Jan 9, 2023, 17:56 Christopher Kappe ***@***.***> wrote: Oh, to clarify that empty column. There is the concept of "derived" variables, that – in my mock-up – go in different sheets depending on their "level of dependence". E.g. L1-Derivations Data File Path Source Name Software Data File Path 2 Sample Name runs/kallisto_sleuth/sleuth_dge.csv mean_obs sleuth runs/kallisto_sleuth/sleuth_dge.csv smooth_sigma_sq runs/kallisto_sleuth/sleuth_dge.csv sigma_sq sleuth runs/kallisto_sleuth/sleuth_dge.csv smooth_sigma_sq and L2-Derivations Data File Path Source Name Software Data File Path 2 Sample Name runs/kallisto_sleuth/sleuth_dge.csv sigma_sq sleuth runs/kallisto_sleuth/sleuth_dge.csv rss runs/kallisto_sleuth/sleuth_dge.csv smooth_sigma_sq sleuth runs/kallisto_sleuth/sleuth_dge.csv rss runs/kallisto_sleuth/sleuth_dge.csv sigma_sq sleuth runs/kallisto_sleuth/sleuth_dge.csv final_sigma_sq runs/kallisto_sleuth/sleuth_dge.csv smooth_sigma_sq sleuth runs/kallisto_sleuth/sleuth_dge.csv final_sigma_sq — Reply to this email directly, view it on GitHub <#484 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A3K3NVVPCC3QWGCXOPKEXY3WRQ7MZANCNFSM6AAAAAAS6WHRJQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

HLWeil · 2023-01-27T12:20:52Z

HLWeil
Jan 27, 2023

Hey, I also wanted to give my input from the perspective of a tool developer, consuming both ISA-Json and ISA-Tab (or rather ISA-XLSX in our case). IMO, extending the description and the process pointing into the data file, rather than stopping at pointing to the file is really important for making actual computational use of the data described in ISA.

Process graph pointers

Firstly I want to touch on the pointers in the process graph, which can then be used to associate sources,samples and e.g. raw data files with specific columns in a derived data file.

Naming of the headers / Integration into ISA-Tab

As was stated above, these process graph pointers might be part of a dataset file. But it would definitely also be necessary to have them
as part of assay and study files. Not only in the case of computational assays, where e.g. a tool workflow is described using the assay file, but also in the most general case, where the measurement of the assay has a data file as a result, where different measured samples might directly be merged into a single raw data file, only differentiable through their distinct headers.

So these header pairs (file and pointer into file) need to be already included in the assay and study tab files. I think it might be beneficial to give these headers more specific names, leaving less room for interpretation. So building on the example from @kappe-c (link):

Raw Data File Path	Raw Data Identifier	Software	Derived Data File Path	Derived Data Identifier
runs/kallisto_sleuth/sleuth_dge.csv	mean_obs	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	smooth_sigma_sq
runs/kallisto_sleuth/sleuth_dge.csv	sigma_sq	sleuth	runs/kallisto_sleuth/sleuth_dge.csv	smooth_sigma_sq

Having Data File Path and Data File Path 2, and also using Source and Sample as headers is kinda symptomatic to us at DataPlant using XLSX tables to store the tabular ISA metadata, where no duplicate headers are allowed. But from an ISA-Tab perspective, where object propertiy columns follow the object column, having the column pairs adjacent to each other would be enough to associate them.
The headers I wrote here are meant more as an example for the general naming pattern. Data File Name and Data Pointer would be another possibility.
Are there any good reasons to use Source Name/Sample Name?

Integration into ISA-Json

This is relatively straight forward I think. The data object currently contains the field name of type string. One could just add a new field identifier or pointer or replace the field name with two new fields with the same names as in ISA-Tab to be consistent. So the object in the example above could look like the following:

{
  "filePath": "runs/kallisto_sleuth/sleuth_dge.csv",
  "identifier": "mean_obs",
  "type": "Raw Data File"
}

and

{
  "filePath": "runs/kallisto_sleuth/sleuth_dge.csv",
  "identifier": "smooth_sigma_sq",
  "type": "Derived Data File"
}

The name Data of the object class is still fitting, as this is ambivalent to describing either a data file or a data chunk of a file. Just the possible values of the data type field might be adjusted. E.g. in the schema we could have

"type": {
    "type": "string",
    "enum": [
        "Raw Data File",
        "Derived Data File",
        "Image File",
        "Raw Data",
        "Derived Data",
        "Image"
    ]
}

or just

"type": {
    "type": "string",
    "enum": [
        "Raw Data",
        "Derived Data",
        "Image"
    ]
}

Pointer descriptors / Dataset

All of you touched the reasons for and the modelling details of these pointer descriptors nicely. I just want to add a few thoughts about the implementation.

Integration into ISA-Tab

I'm not quite sure what your final thoughts about the cardinality and other specific columns were, but I will build my points here on the minimalistic version @kappe-c proposed (link).
In principle, having the data dictionary in its own file (dataset file) solves the problem of where to put the specific information. The Data File Name and Data Identifier columns together precisely specify the piece of data both in this file and the assay/study files, allowing the connection of the information in both sources.
One tricky detail I see, that needs to be solved is where to register the dataset file: As I see it now, all ISA related files currently need to be registered in the investigation file, which kind of works as the associating element between all its parts. As dataset files might appear both in studies and assays, there needs to be fields in both of them. My first intuition was to just add a file name field to both, like below for study link:

STUDY

Label	Datatype	Description
Study Identifier	String	A unique identifier, either a temporary identifier supplied by users or one generated by a repository or other database. For example, it could be an identifier complying with the LSID specification.
Study Title	String	A concise phrase used to encapsulate the purpose and goal of the study.
Study Description	String	A textual description of the study, with components such as objective or goals.
Study Submission Date	String formatted as ISO8601 date	The date on which the study is submitted to an archive.
Study Public Release Date	String formatted as ISO8601 date	The date on which the study SHOULD be released publicly.
Study File Name	String formatted as file name or URI	A field to specify the name of the Study Table file corresponding the definition of that Study. There can be only one file per cell.
Study Dataset File Name	String formatted as file name or URI	A field to specify the names of the Dataset files containing metadata about data items.

Unfortunately this breaks the STUDY section, as all fields are of cardinality 1, but the Dataset File Name would need to be of cardinality n. A similar problem arises for the STUDY ASSAYS section.
Alternatively, it could be its own section, situated either on top level like the (Ontology Source Reference section)[https://isa-specs.readthedocs.io/en/latest/isatab.html#ontology-source-reference-section] or as a child of the investigation and/or study section. In this case, one would have to think about what fields/headers to include, as only the paths would look kind of empty (but perfectly fine for me). On the other hand, if we would include the 5 - 10 fields specified in the dataset file, what would we need the file for?

DATASET

Label	Datatype	Description
File Name	String formatted as file name or URI	A field to specify the names of the Dataset files containing metadata about data items.

Input on this would be much appreciated.

Integration into ISA-Json

Again, the integration into ISA-Json is IMO more straightforward. My preferred solution would be to just add the additional fields directly to the data object. The two headers specifying the data piece of interest (File Name and Identifier) would already be part of the object as specified above, so it would just get a few more fields. E.g. going with the headers proposed above:

{
  "filePath": "runs/kallisto_sleuth/sleuth_dge.csv",
  "pointer": "mean_obs",
  "type": "Raw Data File",
  "wasGeneratedBy" : "workflows/kallisto_sleuth.R",
  "attribute" : {
    "annotationValue" : "Arithmetic Mean",
    "termSource" : "NCIT",
    "termAccession" : "http://purl.obolibrary.org/obo/NCIT_C53319"
  },
  "objectType" : "Decimal",
  "label" : "Mean"
}

0 replies

proccaserra · 2023-01-29T19:26:42Z

proccaserra
Jan 29, 2023
Maintainer

@HLWeil @kappe-c, thank you the input and explanation.

@HLWeil : one clarification please. Should your Dataset be understood as the same thing as @kappe-c 's Data dictionary ?

I am still unclear about the following points:

add a link to a 'data dictionary file', associated to the ISA.Raw Data File
reference 'data dictionary' from Study Assay Section instead of Study Section.
use a reference to a CWL file to declare processes describing 'ISA.Data Transformations'. This would entail that ISA Assay Table would not have 'Derived Data File' anymore . This last point was discussed with @muehlhaus during the BH2022.
if everyone fine with 2, then this group to document the implementation pattern in ISA

@terazus what are your thoughts ?

1 reply

HLWeil Feb 3, 2023

Thanks for your answer, @proccaserra.
In principle, the Dataset is the same as the Data dictionary. In this answer I will use the term data dictionary interchangeably with dataset. (I will put the detailed explanation at the bottom as it might be more distracting than helpful.)

Regarding your other points:

I didn't quite get this point. You mean as an additional field in the json, namely dataDictionaryFileName?
As implied in my post above (Under Pointer descriptors / Dataset -> Integration into ISA-Tab -> Text between the two tables), putting the data dictionary file name into the STUDY ASSAYS section still raises the problem of different cardinalities. Each column in this section contains values associated to a specific assay. So to assign multiple data dictionary file names to one assay, you would need to put them in a single cell, e.g. ; separated:

STUDY ASSAYS	-
Study Assay Measurement Type	-
Study Assay Measurement Type Term Accession Number	-
Study Assay Measurement Type Term Source REF	-
Study Assay Technology Type	-
Study Assay Technology Type Term Accession Number	-
Study Assay Technology Type Term Source REF	-
Study Assay Technology Platform	-
Study Assay File Name	a_MyAssay.txt
Study Assay Data Dictionary File Names	d_DataSet1.txt;d_DataSet2.txt;d_DataSet3.txt;d_DataSetN.txt

Not sure if this would be a problem though, as it basically would be identical to the principle applied in the `Study Person Roles`

--

Besides making CWL referenceable, I think it would also be good to still have the ability to describe Software Assays, were you would get Derived Data as output.

Explanation for Dataset/Data dictionary:

Initially, the Dataset file was consisting of two parts: First, the data dictionary, which is basically the static description of the entries in a data file. In my post, I called this Pointer descriptors. The second part was describing how these pointers were connected to previous data or samples in the process graph. We noticed, that this is what ISA already does best, so we dropped this part and aimed to include the pointers which point into the data files directly in the isa assay and isa study files (In my post under the header Process graph pointers).
This then leaves the Dataset being basically equal to the data dictionary. Correct me if I forgot or distorted something, @kappe-c @muehlhaus.

kappe-c · 2023-05-31T09:01:13Z

kappe-c
May 31, 2023

Hi guys, I just wanted to follow up on this.
Is work being done in this direction? Is more input needed?

5 replies

proccaserra Jun 30, 2023
Maintainer

hi @kappe-c @muehlhaus @HLWeil ,

We've been giving more thoughts about the proposal and we came to the realisation that the specification outlined is outside the scope of ISA. It is complementary though.
The main use case if to provide a data dictionary detailing the headers of a matrix/dataframe/tab-delimited file but even if we reference these 'data dictionary files' from the Investigation Study Assays section using Study Assay Data Dictionary File Names, we'd have issues for devising a parsing rule.

We all understand and agree on the need to describe computational process and results. I don't think we have a generic solution yet.

I agree with @HLWeil about another solution: i.e. add an entirely new section "DATA ANALYSIS" , which would cover the computation workflow executions, describe the input files, output files and their header definitions (if tabular), or other descriptions if graphical output, rules for combining fields. but this sounds very much like a new workflow language.

kappe-c Jul 5, 2023

Thank you for the info @proccaserra ! I guess we will then think about how to concretly advance this topic outside of ISA for now.

ptth222 Feb 21, 2024

@proccaserra I agree that this seems out of the scope of ISA. If I understand correctly what is wanted I'm not sure it could realistically be achieved or would even be a good idea. The way I am understanding this thread is that people want a way to describe how the data is arranged inside the various data files. This would vary greatly depending on the type of file, so I don't really see how a generalized solution could be done except on a file type basis. For instance, you could indicate a format for the file, such as "tabular" and then have a system for describing tabular files, but that gets complicated quickly. Maybe adding some minimal attributes to data files such as "format" and "specification" could be helpful for programmatically identifying well defined file types. For internal use, the comments on data files could be used to implement some of this already.

HLWeil Feb 21, 2024

Hey @ptth222, I agree that this is a difficult endeavour, but a necessary one. Without explicit selection of the file fragments, there is no way to automatically associate final computational results with the experimental factors they are related to.
For standard file formats (like tabular), this has already been specified by W3: https://www.w3.org/TR/annotation-model/#fragment-selector
Here's also our current specification on this for reference: nfdi4plants/ARC-specification#93

ptth222 Feb 22, 2024

@HLWeil What you just linked is more reasonable and is quite different from everything else above. I did not know that most of the selector stuff was already worked out. Note that how the selector stuff works is exactly what I was talking about and is done on a file type basis.

I have to disagree that file fragments are the only way to do this though. The other repository formats I have worked with create their own data structures that you have to conform yours to so it ends up directly linked. For instance, if ISA created a new measurement node and you had to extract all the data from the files into these nodes you get the associations without file fragments. Again, this is what some other places do (Metabolomics Workbench for example).

Specifically, what you have done with the Data Format and Data Selector Format look very good.
Reproducing the example here:

Input [Sample Name]	Output [Data]	Data Format	Data Selector Format
input1	result.csv#col=1	text/csv	https://datatracker.ietf.org/doc/html/rfc7111
input2	result.csv#col=2	text/csv	https://datatracker.ietf.org/doc/html/rfc7111

Trying to work out how this would translate to ISA-JSON is interesting. Currently, the JSON doesn't associate entities (source, sample, material) with data files directly, and the selector stuff is a property of the linked entity, not the data file. I think the format needs to go on the data file. Connecting the entities to the file directly is more difficult. I'm thinking maybe a "measurements" list and the objects in the list would have 2 possible forms, either something like a characteristic (to handle cases of a single measurement), or a data file with additional "selector" and "selector format" properties. I think the "selector" has to be a property on its own since data files are given unique ids in the JSON and references within the JSON use those ids. "Selector" might have to be a separate column too to save parsing the file name value for it.

Something like:

{
"name": "sample1",
"measurements": [
        { 
         "@id": "#data/some_file.csv",
         "selector": "row=2"
         "selectorFormat": "https://datatracker.ietf.org/doc/html/rfc7111"
        },
        { 
         "@id": "#measurement/something_unique",
         "value": "100"
         "category": {"@id": "#characteristic_category/protein weight"}
         "unit": {"@id": "#unit/mg"}
        }
                            ]
}

This is a little bit different from some of the norms of the ISA-JSON format, but it solves 3 different problems:

Linking entities to data files directly.
Specifying measurements directly without a file.
Specifying file fragments to link an entity to its measurements in a file.

The only other issue I see with file fragments is it introduces a new source of error. You can accidently indicate the wrong row for an entity for example. Assuming the file also had a way to link the entities, such as a key column, then it can be unclear whether the indicated row is wrong or the key column is wrong. I wouldn't say this is enough to scrap the idea though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISA Model Extension - DataSet #484

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

ISA Model Extension - DataSet #484

Replies: 9 comments · 12 replies

muehlhaus Dec 15, 2022 Author

proccaserra Dec 19, 2022 Maintainer

proccaserra Dec 21, 2022 Maintainer

proccaserra Dec 19, 2022 Maintainer

muehlhaus Dec 20, 2022 Author

proccaserra Dec 21, 2022 Maintainer

muehlhaus Dec 21, 2022 Author

Process graph pointers

Naming of the headers / Integration into ISA-Tab

Integration into ISA-Json

Pointer descriptors / Dataset

Integration into ISA-Tab

STUDY

DATASET

Integration into ISA-Json

proccaserra Jan 29, 2023 Maintainer

proccaserra Jun 30, 2023 Maintainer

Replies: 9 comments 12 replies

muehlhaus
Dec 15, 2022
Author

proccaserra Dec 19, 2022
Maintainer

proccaserra Dec 21, 2022
Maintainer

proccaserra
Dec 19, 2022
Maintainer

muehlhaus Dec 20, 2022
Author

proccaserra Dec 21, 2022
Maintainer

muehlhaus Dec 21, 2022
Author

proccaserra
Jan 29, 2023
Maintainer

proccaserra Jun 30, 2023
Maintainer