Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty counts files #69

Open
renberg opened this issue Sep 22, 2022 · 4 comments
Open

Empty counts files #69

renberg opened this issue Sep 22, 2022 · 4 comments

Comments

@renberg
Copy link

renberg commented Sep 22, 2022

I got the pipeline to work with a previous set of sequencing data but now it is finishing with the following error message, and leaving empty count files in the outputs folder. All that I've changed is the data directory, and an updated experiment.csv file to account for the new data. Any idea where the breakdown might be occurring?

Thanks!

executor >  slurm (36)
[e3/8f8bd3] process > create_BAM (make idx)    [100%] 6 of 6 ✔
[f4/661ab1] process > raw_counts (6)           [100%] 6 of 6 ✔
[bd/e76751] process > filter_counts (6)        [100%] 6 of 6 ✔
[01/6d0d95] process > final_counts (6)         [100%] 6 of 6 ✔
[81/009046] process > dna_rna_merge_counts (3) [100%] 3 of 3 ✔
[50/f5d719] process > dna_rna_merge (3)        [100%] 3 of 3 ✔
[5d/ccf914] process > calc_correlations (2)    [  0%] 0 of 3
[61/5f2759] process > make_master_tables (2)   [ 66%] 2 of 3
Error executing process > 'calc_correlations (1)'

Caused by:
  Process `calc_correlations (1)` terminated with an error exit status (1)

Command executed:

  Rscript /home/~~~~~/MPRAflow/src/plot_perInsertCounts_correlation.R 36CRS_1 NA 10 36CRS_1_1_counts.tsv 1

Command exit status:
  1

Command output:
                    File Replicate Condition
  1 36CRS_1_1_counts.tsv         1   36CRS_1
  [1] "hist"
  [1] name      dna_count rna_count ratio     log2      n_obs_bc 
  <0 rows> (or 0-length row.names)
  logical(0)

Command error:
  
  Attaching package: ‘dplyr’
  
  The following objects are masked from ‘package:stats’:
  
      filter, lag
  
  The following objects are masked from ‘package:base’:
  
      intersect, setdiff, setequal, union
  
  Error in hist.default(as.numeric(data1$n_obs_bc), breaks = 100, xlim = c(0,  : 
    character(0)
  Calls: hist -> hist.default
  In addition: Warning messages:
  1: In min(x) : no non-missing arguments to min; returning Inf
  2: In max(x) : no non-missing arguments to max; returning -Inf
  Execution halted

Work dir:
  /home/~~~~~/MPRAflow/work/45/d569c5d4c05016cf996063983a9986

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
@visze
Copy link
Collaborator

visze commented Sep 27, 2022

I suggest that the count file (36CRS_1_1_counts.tsv) is empty. You can find it here: /home/~~~~~/MPRAflow/work/45/d569c5d4c05016cf996063983a9986

So there is an error happening upstream. So you should check all output files of the previous processes to see where the pipeline creates empty counts. My suggestion is that the assignment did not work (process dna_rna_merge). E.g. when the assignment pickle file contains no associations or the wrong associations (e.g. different names that are not in the design file)

@visze
Copy link
Collaborator

visze commented Oct 11, 2022

Any news here? Could you solve it?

@renberg
Copy link
Author

renberg commented Oct 12, 2022

Hi, yes and no.

I figured out that my barcode reads had extra bases, and I think that was causing it to not be able to match them to barcodes properly. So I trimmed those and ran it again. Now it fails at this point:

executor >  slurm (36)
[93/066772] process > create_BAM (make idx)    [100%] 6 of 6 ✔
[a0/22f4d8] process > raw_counts (6)           [100%] 6 of 6 ✔
[e5/f97b70] process > filter_counts (6)        [100%] 6 of 6 ✔
[1a/495a96] process > final_counts (6)         [100%] 6 of 6 ✔
[81/a4c2a0] process > dna_rna_merge_counts (2) [100%] 3 of 3 ✔
[20/7485c2] process > dna_rna_merge (1)        [100%] 3 of 3 ✔
[82/55e4be] process > calc_correlations (3)    [100%] 1 of 1, failed: 1
[9e/f430f4] process > make_master_tables (3)   [200%] 2 of 1 ✔
Error executing process > 'calc_correlations (1)'

Caused by:
  Missing output file(s) `*_correlation.txt` expected by process `calc_correlations (1)`

Command executed:

  Rscript /home/~/MPRAflow/src/plot_perInsertCounts_correlation.R 36CRS_2 NA 10 36CRS_2_1_counts.tsv 1

Command error:
  
  Attaching package: ‘dplyr’
  
  The following objects are masked from ‘package:stats’:
  
      filter, lag
  
  The following objects are masked from ‘package:base’:
  
      intersect, setdiff, setequal, union

Work dir:
  /home/~/MPRAflow/work/02/01a83548c52f2f6f08770bd1df7500

In a previous post where you aided me I got this error with a different data set and you said it could be ignored and was a result of my not having replicates to calculate correlations with. That is the case with this dataset as well. However, unlike that previous data, the outputs generated by this point don't make any sense. There are large groups of barcodes with identical count results, and in the case of one set of data every RNA count is exactly 25,000 except one that is 50,000 and one that is 100,000. Looking at raw DNA counts showed small whole numbers that were grouped (ie: a handful of barcodes with 4, a handful with 16, a handful with 2) instead of the more continuous distribution we expected.

It is unclear if something is still going wrong in the processing, or if something is weird with our sequencing data itself. Any insights? We are really excited to use this software for some larger scale projects, but so far we haven't gotten it to work. Also happy to correspond by email if you're willing to take the conversation off of your issues page. Thanks in advance for your continued assistance.

@visze
Copy link
Collaborator

visze commented Oct 13, 2022

ok. the first part is because of missing replicates as you already figured out by yourself. But the missing replicates will make the second part very dificult to find out.

This is a really strange behaviour. Maybe you have PCR artefacts. Did you use UMIs? Than you should check the duplication rate.

I am just guessing but I think this large very even numbers like 50,000 or 10,000 are the issue of some counting artefact. Im not sure what exactly causes the issue but e.g. it can be possible that the command uniq -c has an upper limit (e.g. 1000) it will just count the BC 1000 times. And the summing up all barcodes for one oligo you will get BCs per oligo x maximum count.

I really do not know if is an issue wih the assignment (e.g. if most barcodes are assigned to specific oligos) or the with your count data (e.g. PCR artefacts), or your experiment design in general (having elements in taht have an enourmous activity compared to the other.
I think this is a point where a standardized workflow cannot help anymore and you have to go deep into the data, check step by step what happes and if the data reflects your expectations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants