Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dealing with repeated arms sequencing runs #23

Open
kmexter opened this issue Sep 27, 2024 · 4 comments
Open

dealing with repeated arms sequencing runs #23

kmexter opened this issue Sep 27, 2024 · 4 comments
Assignees

Comments

@kmexter
Copy link
Contributor

kmexter commented Sep 27, 2024

The ARMS samples processed by genoscope will be ingested into GH (as soon as the logsheets are created and filled and harvested).
For some ARMS samples, the sequencing was done twice on the same sample (as far as I can tell it was actually the same sample, not a replicate sample).
You can see this in the ARMS overview googlesheet https://docs.google.com/spreadsheets/d/1j3yuY5lmoPMo91w6e3kkJ6pmp1X6FVGUtLealuKJ3wE/edit?gid=855411053#gid=855411053 rows 547+587 or 539+584 are the same sample (they have the same sample ID and same SampReplicate in col P) but a first and a second sequencing run (see column N). The fact that in col 1 these are indicated as being "genoscope batch xx" means that these were processed as part of emo bon, and so these events and samples should be added to the emo bon ARMS logsheets and get into the emobon GH.

So I have 2 questions

  • were these really sequences on the same sample replicate?
  • if yes, then we need to find a way to indicate that one replicate was used to create 2 sequences. That mean, we have to associated one sample with 2 sequences. my suggestion is to NOT make any changes to the mat_samp_id (which is what we did for the arms-mbon project and was a shitty decision, but it is a fait accomplie now) but to deal with it in the ttl files by allowing a one-to-many relationship
  • if no, can someone tell me what sample replicate number these should be then?

Adding @cymon and @cpavloud as this affects you/you are decision makers on this topic

@cpavloud
Copy link

FYI, sequencing has been done twice for other samples too, not only for ARMS.

Allowing one-to-many relationships probably is the way to go. This is also what ENA does (one sample accession number can be linked to multiple run accession numbers).

By the way, line 547 has MaterialSample-ID "ARMS_Svalbard_S1C_20200806_20220803_MF500_DMSO" while line 587 has a different one "ARMS_Svalbard_S1C_20200806_20220803_MF500"

@kmexter
Copy link
Contributor Author

kmexter commented Oct 1, 2024

ok, for the the ARMS-only (i.e. ARMS-MBON GH space) we deal with this by adding _r1 to the ARMS material sample ID etc to the repeated sequencing runs, and _s1 etc for sample replicates .... yes I know, shitty convention but it is what it is now :-| .

But for the emo-bon arms (and other) data we need to deal with this. We can have a one to many, that is fine. Questions

  1. How should we indicate this in the governance files for the omics data? This would be the files in e.g. https://github.com/emo-bon/sequencing-data/tree/main/shipment/batch-001. In the file https://github.com/emo-bon/sequencing-data/blob/main/shipment/batch-001/ena-accession-numbers-batch-001.csv in there we have currently only the project, sample, biosample, and umbrella ENA accession numbers, and the Genoscope ref_code, but the idea was to add the run accession numbers to this also? But then how do we deal with there being 2 run accessions for some samples, as well as saying which is the first and which is the second?
  2. So perhaps we need a separate file for run accession numbers? in there we can have e.g. source_material_id, ref_code, run_accession_number, comment and where there is more than one run_accession number, we repeat a row but have a different run_accession_number and comment?

@kmexter
Copy link
Contributor Author

kmexter commented Oct 1, 2024

FYI, sequencing has been done twice for other samples too, not only for ARMS.

Allowing one-to-many relationships probably is the way to go. This is also what ENA does (one sample accession number can be linked to multiple run accession numbers).

By the way, line 547 has MaterialSample-ID "ARMS_Svalbard_S1C_20200806_20220803_MF500_DMSO" while line 587 has a different one "ARMS_Svalbard_S1C_20200806_20220803_MF500"

That should not be! We agreed to use DMSO or ETOH for all samples in year 1 only, so DMSO has to go.
I will go over this with @JustinePa, we will decide whether we need to remove this "DMSO" and from where it has to be removed

@kmexter
Copy link
Contributor Author

kmexter commented Oct 22, 2024

and @laurianvm and I will go over this when we add the ARMS to the data model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants