Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIRE on targeted seq data: fiber-locations-shuffled.bed.gz is created empty #7

Open
Strausyatina opened this issue Dec 29, 2023 · 7 comments

Comments

@Strausyatina
Copy link

Hi Mitchell!
We've tried to run FIRE on targeted seq data, and pipeline is failing with "polars.exceptions.NoDataError: empty CSV", since fiber-locations-shuffled.bed.gz is created empty.

The bed file with complement to targeted regions was used for exclusion in filtered_and_shuffled_fiber_locations_chromosome.

What could be an issue in our usage of FIRE? Is it suitable for such a task?

Config yaml:

ref: /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa
ref_name: hg38
n_chunks: 1 # split bam file across x chunks
max_t: 4 # use X threaeds per chunk
manifest: config/config_targeted_project.tbl # table with samples to process

keep_chromosomes: chr4 # only keep chrs matching this regex.
keep_chromosomes: chr7
keep_chromosomes: chr20
## Force a read coverage instead of calulating it genome wide from the bam file.
## This can be useful if only a subset of the genome has reads.
#force_coverage: 50

## regions to not use when identifying null regions that should not have RE, below are the defaults auto used for hg38.
excludes:
 - workflow/annotations/hg38.fa.sorted.bed
#- workflow/annotations/hg38.gap.bed.gz
#- workflow/annotations/SDs.merged.hg38.bed.gz

## you can optionally specify a model that is not the default.
# model: models/my-custom-model.dat

##
## only used if training a new model
##
# train: True
# dhs: workflow/annotations/GM12878_DHS.bed.gz # regions of suspected regulatory elements

Example of error log:

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=204800, mem_mib=195313, disk_mb=4096, disk_mib=3907, time=100440, gpus=0
Select jobs to execute...
[Thu Dec 28 16:04:53 2023]
rule fdr_table:
    input: results/bc2031/fiber-calls/FIRE.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz, /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai
    output: results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
    jobid: 0
    reason: Forced execution
    wildcards: sm=bc2031
    threads: 8
    resources: mem_mb=204800, mem_mib=195313, disk_mb=4096, disk_mib=3907, tmpdir=/tmp, time=100440, gpus=0
        python /home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py -v 1 results/bc2031/fiber-calls/FIRE.bed.gz results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai -s results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz -o results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
        
Activating conda environment: ../../../../../../../home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_
[INFO][Time elapsed (ms) 1068]: Reading FIRE file: results/bc2031/fiber-calls/FIRE.bed.gz
/home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py:486: DeprecationWarning: `the argument comment_char` for `read_csv` is deprecated. It has been renamed to `comment_prefix`.
  fire = pl.read_csv(
[INFO][Time elapsed (ms) 1082]: Reading genome file: /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai
[INFO][Time elapsed (ms) 1085]: Reading fiber locations file: results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz
[INFO][Time elapsed (ms) 1095]: Reading shuffled fiber locations file: results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz
Traceback (most recent call last):
  File "/home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py", line 539, in <module>
    defopt.run(main, show_types=True, version="0.0.1")
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/defopt.py", line 356, in run
    return call()
           ^^^^^^
  File "/home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py", line 517, in main
    shuffled_locations = pl.read_csv(
                         ^^^^^^^^^^^^
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/polars/utils/deprecation.py", line 100, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/polars/io/csv/functions.py", line 369, in read_csv
    df = pl.DataFrame._read_csv(
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nshaikhutdinov/.local/lib/python3.11/site-packages/polars/dataframe/frame.py", line 784, in _read_csv
    self._df = PyDataFrame.read_csv(
               ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.NoDataError: empty CSV
[Thu Dec 28 16:04:54 2023]
Error in rule fdr_table:
    jobid: 0
    input: results/bc2031/fiber-calls/FIRE.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz, results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz, /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai
    output: results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
    conda-env: /home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_
    shell:
        
        python /home/nshaikhutdinov/.cache/snakemake/snakemake/source-cache/runtime-cache/tmpiwuex449/file/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/../scripts/fire-null-distribution.py -v 1 results/bc2031/fiber-calls/FIRE.bed.gz results/bc2031/coverage/filtered-for-coverage/fiber-locations.bed.gz /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa.fai -s results/bc2031/coverage/filtered-for-coverage/fiber-locations-shuffled.bed.gz -o results/bc2031/FDR-peaks/FIRE.score.to.FDR.tbl
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Index(['bc2029', 'bc2031', 'bc2025', 'bc2027', 'bc2026', 'bc2032', 'bc2030',
       'bc2028'],
      dtype='object', name='sample')

Exclusion bed file:

chr1    1   248956422
chr10   1   133797422
chr11   1   135086622
chr12   1   133275309
chr13   1   114364328
chr14   1   107043718
chr15   1   101991189
chr16   1   90338345
chr17   1   83257441
chr18   1   80373285
chr19   1   58617616
chr2    1   242193529
chr20   1   4680670
chr20   4690391 64444167
chr21   1   46709983
chr22   1   50818468
chr3    1   198295559
chr4    1   3072454
chr4    3077294 190214555
chr5    1   181538259
chr6    1   170805979
chr7    1   140917955
chr7    140927420   159345973
chr8    1   145138636
chr9    1   138394717
chrM    1   16569
chrX    1   156040895
chrY    1   57227415
@mrvollger
Copy link
Member

keep chromosomes cannot be specified multiple times like this; it has to be a single (regex) string. I expect only data on chr20 is being kept with this config:

keep_chromosomes: chr4 # only keep chrs matching this regex.
keep_chromosomes: chr7
keep_chromosomes: chr20

given how you set up your exclude I think you can drop the keep_chr... entirely.

This might fix the issue, but with target data the coverage calculations and filters might be removing everything. For example, I am not sure if you are getting even coverage between targets. If not, this could make for some strict coverage filters. You can find files with the filters applied in coverage to see if they are eliminating all your data.

@mrvollger
Copy link
Member

mrvollger commented Dec 29, 2023

I should note more generally I did not build this with targeted data in mind. That being said, @StephanieBohaczuk might have some advice on making a config file since I think she has done this successfully before?

@ywang285
Copy link

@mrvollger May I ask how should I specify to keep multiple chromosomes like chr4, chr7 and chr20 using the "keep_chromosomes:"? Thanks!

@StephanieBohaczuk
Copy link
Contributor

@Strausyatina @mrvollger For targeted data, I usually first filter the bam for location(s) I'm interested in with samtools view -b <in.bam> "<chromosome coordinates>"><out.bam> and then I run it through FIRE. I would also recommend using the min_per_acc_peak option (to do peak calling based on % actuation rather than the FDR approach designed for whole genome probabilities). You may also need to play with the minimum coverage and coverage_within_n_sd depending on coverage over your targeted regions. Here's an example of a config file that I used for data with targets on multiple chromosomes

ref: /mmfs1/gscratch/stergachislab/assemblies/hg38.analysisSet.fa
ref_name: hg38
n_chunks: 2
 # split bam file across x chunks
max_t: 4 # use X threads per chunk
manifest: config/PS00118_hg38config.tbl # table with samples to process

#keep_chromosomes:

## Force a read coverage instead of calulating it genome wide from the bam file.
## This can be useful if only a subset of the genome has reads.
#force_coverage: 10
min_coverage: 15
coverage_within_n_sd: 100


## regions to not use when identifying null regions that should not have RE, below are the defaults auto used for hg38.
#excludes:
#- workflow/annotations/hg38.blacklist.ENCFF356LFX.bed.gz
#- workflow/annotations/hg38.gap.bed.gz
#- workflow/annotations/SDs.merged.hg38.bed.gz

## you can optionally specify a model that is not the default.
# model: models/my-custom-model.dat

##
## only used if training a new model
##
# train: True
# dhs: workflow/annotations/GM12878_DHS.bed.gz # regions of suspected regulatory elements

# Filter peaks on a % accesible threshold instead of FDR. Removes FDR filtering.
min_per_acc_peak: 0.20 # enforce 20% of reads in peak are accessible

P.S. Sorry for the late reply. This went to a strange section of my inbox and I just saw it.

@NurislamSheih
Copy link
Collaborator

@StephanieBohaczuk Thank you a lot for your reply!

Are you using the latest FIRE release - version 0.0.3? I noticed that the previous release, 0.0.2, did not have the options for min_coverage, coverage_within_n_sd, and min_per_acc_peak in the config.yaml file.

We are having difficulty achieving enough coverage of targeted-seq with the PacBio platform, but we have over 600X coverage with ONT data. So I'm trying to run FIRE on ONT data.

I tried the following configurations:

ref: /home/nshaikhutdinov/working_directory/genome_hg38/hg38.fa
ref_name: hg38
n_chunks: 2 # split bam file across x chunks
max_t: 4 # use X threaeds per chunk
manifest: config/config_all_runs_bc20.tbl # table with samples to process

#keep_chromosomes: chr20 # only keep chrs matching this regex.

## Force a read coverage instead of calulating it genome wide from the bam file.
## This can be useful if only a subset of the genome has reads.
# force_coverage: 30
min_coverage: 15
coverage_within_n_sd: 100

## regions to not use when identifying null regions that should not have RE, below are the defaults auto used for hg38.
excludes:
- workflow/annotations/hg38.fa.sorted.bed.gz
#- workflow/annotations/hg38.gap.bed.gz
#- workflow/annotations/SDs.merged.hg38.bed.gz

## you can optionally specify a model that is not the default.
# model: models/my-custom-model.dat

##
## only used if training a new model
##
# train: True
# dhs: workflow/annotations/GM12878_DHS.bed.gz # regions of suspected regulatory elements

# Filter peaks on a % accesible threshold instead of FDR. Removes FDR filtering.
min_per_acc_peak: 0.20 # enforce 20% of reads in peak are accessible

It can find fibers across the genome, estimate median coverage, but still breaks down on the performance of the coverage rule. It feels like the script keeps counting coverage in the regions that I specified as excluded in the config file (see, config file)

[Mon Jan 22 17:09:37 2024]
localrule coverage:
    input: results/all_runs_bc20/coverage/all_runs_bc20.median.chromosome.coverage.bed
    output: results/all_runs_bc20/coverage/all_runs_bc20.median.coverage.txt, results/all_runs_bc20/coverage/all_runs_bc20.minimum.coverage.txt, results/all_runs_bc20/coverage/all_runs_bc20.maximum.coverage.txt
    jobid: 2
    reason: Missing output files: results/all_runs_bc20/coverage/all_runs_bc20.median.coverage.txt, results/all_runs_bc20/coverage/all_runs_bc20.minimum.coverage.txt, results/all_runs_bc20/coverage/all_runs_bc20.maximum.coverage.txt; Input files updated by another job: results/all_runs_bc20/coverage/all_runs_bc20.median.chromosome.coverage.bed
    wildcards: sm=all_runs_bc20
    resources: mem_mb=204800, mem_mib=195313, disk_mb=4096, disk_mib=3907, tmpdir=/tmp, time=100440, gpus=0

python -c "from __future__ import print_function; import sys, json; print(json.dumps([sys.version_info.major, sys.version_info.minor]))"
Activating conda environment: ../../../../../../../home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_
python /net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/.snakemake/scripts/tmpttpf5k4m.cov.py
Activating conda environment: ../../../../../../../home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_
      chr     start        end  coverage     weight
0    chr1    596666  248874536       1.0  248277870
1   chr10    212206  133604650       1.0  133392444
3   chr11    146477  135065781       1.0  134919304
5   chr12    207935  133237234       1.0  133029299
7   chr13  18173155  114354328       1.0   96181173
9   chr14  18525079  106871298       1.0   88346219
18  chr15  19775575  101966778       1.0   82191203
20  chr16    145790   90175367       1.0   90029577
22  chr17    414699   82690917       1.0   82276218
28  chr18     41804   80038054       1.0   79996250
29  chr19    102879   58564743       1.0   58461864
40   chr2    216919  241783883       1.0  241566964
41  chr20    166109   64323470       1.0   64157361
42  chr21   5222994   46674683       1.0   41451689
43  chr22  10712117   50765139       1.0   40053022
48   chr3     10000  197682234       1.0  197672234
52   chr4    119052  190119703       1.0  190000651
56   chr5     19554  181362024       1.0  181342470
59   chr6    166156  170595345       1.0  170429189
62   chr7    195340  159188755       1.0  158993415
66   chr8    311917  144999162       1.0  144687245
67   chr9    196037  138108618       1.0  137912581
1.0
Traceback (most recent call last):
  File "/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/.snakemake/scripts/tmpttpf5k4m.cov.py", line 49, in <module>
    raise ValueError(
ValueError: Median coverage is 1.0! Did you use the correct reference, or is data missing from most of your genome. If so consider the keep_chromosomes parameter in config.yaml
[Mon Jan 22 17:09:38 2024]
Error in rule coverage:
    jobid: 2
    input: results/all_runs_bc20/coverage/all_runs_bc20.median.chromosome.coverage.bed
    output: results/all_runs_bc20/coverage/all_runs_bc20.median.coverage.txt, results/all_runs_bc20/coverage/all_runs_bc20.minimum.coverage.txt, results/all_runs_bc20/coverage/all_runs_bc20.maximum.coverage.txt
    conda-env: /home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_

RuleException:
CalledProcessError in file /net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/coverages.smk, line 40:
Command 'source /net/module/sw/fibertools-rs/0.3.2/bin/activate '/home/nshaikhutdinov/FIRE/env/72529d38651d38b3fc44b5aae6fe7a22_'; set -euo pipefail;  python /net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/.snakemake/scripts/tmpttpf5k4m.cov.py' returned non-zero exit status 1.
  File "/net/seq/pacbio/fiberseq_processing/fiberseq/fire_analysis_v0.0.2/fiberseq-fire/workflow/rules/coverages.smk", line 40, in __rule_coverage
  File "/net/module/sw/FIRE/0.0.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run
/home/nshaikhutdinov/FIRE/env/374939a2f8b3af63c7d016754875a76c_/lib/python3.10/site-packages/xgboost/compat.py:36: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import MultiIndex, Int64Index
[Mon Jan 22 17:09:46 2024]
Finished job 34.
27 of 212 steps (13%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

@StephanieBohaczuk
Copy link
Contributor

Hi @NurislamSheih.

Are you using the latest FIRE release - version 0.0.3? I noticed that the previous release, 0.0.2, did not have the options for min_coverage, coverage_within_n_sd, and min_per_acc_peak in the config.yaml file.

I've been using 0.0.2, but there were a few small updates between 0.0.2 and 0.0.3 that were not tagged as versions. Here's the exact version:

(variantcallers) [bohaczuk@n3304 fiberseq-fire]$ git log -1
commit b06f95195bb75a3014bf9458df2a5e7fad8c1c72 (HEAD -> main, origin/main, origin/HEAD)
Author: Mitchell R. Vollger <[email protected]>
Date:   Tue Oct 31 16:45:06 2023 -0700

    Add a % acctuation threshold

It can find fibers across the genome, estimate median coverage, but still breaks down on the performance of the coverage rule. It feels like the script keeps counting coverage in the regions that I specified as excluded in the config file (see, config file)

Sorry, but I've never tried using the exclude regions. I thought that was just for training, but @mrvollger would be able to clarify. Instead, I typically just filter my bam for regions of interest with samtools (see my comment above for more details) and then just run the filtered bam through the pipeline.

@mrvollger
Copy link
Member

Exclude is only used in making the FDR null regions and not in calculating initial coverage profiles. ( Did FIRE actually call anything for you? i.e. is maybe the FIRE bam empty?)

I am/was not planning on allowing for targeted data in FIRE, but after discussing with @StephanieBohaczuk I am seeing the utility even though the FDR calculation won't work well.

I'd be happy to review a PR that adds the ability to accept a bed file of target regions to narrow the analysis. If you don't want to tackle this and want me to do this, I'll need a minimal targeted test case (small bam and small genome) that recreates the issues you have been having that I can run in a few minutes. And then some patience since I won't be able to get to this right away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants