Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors if genome-grist run on Marine metagenomes #241

Open
jeanzzhao opened this issue Nov 27, 2022 · 25 comments
Open

Errors if genome-grist run on Marine metagenomes #241

jeanzzhao opened this issue Nov 27, 2022 · 25 comments

Comments

@jeanzzhao
Copy link

There were a few errors that happened during genome-grist run of Marine metagenomes:
less ~/assloss/grist/marine21/jobs/grist.j56313129.err

  • SRR9178284, error in rulesamtools_count_wc, bam_to_depth_wc, bam_to_fastq_wc
...
[Thu Nov 24 07:14:11 2022]
Error in rule samtools_count_wc:
    jobid: 27756
    output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/50a874ea6fd99a2f81d96884a9de6c9e
    shell:

        samtools view -c -F 260 outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/
mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
  
Activating conda environment: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
Removing output files of failed job samtools_count_wc since they might be corrupted:
outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
Job failed, going on with independent jobs.
...
Error in rule bam_to_depth_wc:
    jobid: 26167
    output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.depth.txt
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/9048b0b8e113b3e7b4e477f4051b67a7
Job failed, going on with independent jobs.
    shell:

        samtools depth -aa outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/mappi
ng/SRR9178284.x.GCF_902527765.1.depth.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
...
Error in rule bam_to_fastq_wc:
    jobid: 28547
    output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.mapped.fq.gz
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/9048b0b8e113b3e7b4e477f4051b67a7
    shell:

        samtools bam2fq outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam | gzip > outputs.marine21_samples/m
apping/SRR9178284.x.GCF_902527765.1.mapped.fq.gz

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
  • Error in rule download_matching_genome_wc
...
downloading genome for ident GCF_000014265.1/Trichodesmium erythraeum IMS101 from NCBI...
ESC[32m[Thu Nov 24 07:14:13 2022]ESC[0m
ESC[31mError in rule download_matching_genome_wc:ESC[0m
ESC[31m    jobid: 0ESC[0m
ESC[31m    output: genbank_cache/GCF_000173095.1_genomic.fna.gzESC[0m
ESC[31mESC[0m
ESC[32m[Thu Nov 24 07:14:13 2022]ESC[0m
ESC[31mError in rule download_matching_genome_wc:ESC[0m
ESC[31m    jobid: 0ESC[0m
ESC[31m    output: genbank_cache/GCF_000014265.1_genomic.fna.gzESC[0m
ESC[31mESC[0m
ESC[31mRuleException:
HTTPError in line 1063 of /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
HTTP Error 404: Not Found
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1063, in __rule_download_matching_genome_wc
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 214, in urlopen
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 523, in open
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 632, in http_response
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 561, in error
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 494, in _call_chain
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 641, in http_error_default
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/concurrent/futures/thread.py", line 58, in runESC[0m
ESC[31mRuleException:
HTTPError in line 1063 of /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
HTTP Error 404: Not Found
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1063, in __rule_dow
nload_matching_genome_wc
...
  • SRR11922358, error in rulemake_mapping_notebook_wc:
...
Error in rule make_mapping_notebook_wc:
    jobid: 90
    output: outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR11922358.html
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
    shell:

        papermill /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-mappin
g.ipynb outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb -k genome_grist               -p sample_id SRR11922
358 -p render ''               -p outdir outputs.marine21_samples --cwd outputs.marine21_samples/reports/
        python -m nbconvert outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb --to html --stdout --no-input
            --ExecutePreprocessor.kernel_name=genome_grist > outputs.marine21_samples/reports/report-mapping-SRR11922358.html

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
  • SRR13449930, error in rulemake_mapping_notebook_wc:
...
Error in rule make_mapping_notebook_wc:
    jobid: 167
    output: outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR13449930.html
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
    shell:

        papermill /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-mappin
g.ipynb outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb -k genome_grist               -p sample_id SRR13449
930 -p render ''               -p outdir outputs.marine21_samples --cwd outputs.marine21_samples/reports/
        python -m nbconvert outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb --to html --stdout --no-input
            --ExecutePreprocessor.kernel_name=genome_grist > outputs.marine21_samples/reports/report-mapping-SRR13449930.html

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
@carden24
Copy link

carden24 commented Dec 2, 2022

I am getting the same error, grist cannot download a specific genome. In my case is GCF_006715245.1

When I checked the status of the genome in the (NCBI ftp)[https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/715/245/GCF_006715245.1_ASM671524v1/], the genome is missing!.

There I found a message saying that the assembly status is suppressed. Therefore it make sense that it fails. My suggestion is to add a line to handle that error. I will try to put some code.

This is the relevant section in the Snakefile

# download actual genomes from genbank!
rule download_matching_genome_wc:
    input:
        csvfile = ancient(f'{GENBANK_CACHE}/{{ident}}.info.csv')
    output:
        genome = f"{GENBANK_CACHE}/{{ident}}_genomic.fna.gz"
    run:
        rows = list(load_csv(input.csvfile))
        assert len(rows) == 1
        row = rows[0]
        ident = row['ident']
        assert wildcards.ident.startswith(ident)
        url = row['genome_url']
        name = row['display_name']

        print(f"downloading genome for ident {ident}/{name} from NCBI...",
              file=sys.stderr)
        with open(output.genome, 'wb') as outfp:
            with urllib.request.urlopen(url) as response:
                content = response.read()
                outfp.write(content)
                print(f"...wrote {len(content)} bytes to {output.genome}",
                      file=sys.stderr)

@carden24
Copy link

carden24 commented Dec 3, 2022

My solution:
Change the snakefile from:
Starting at line1062

        with open(output.genome, 'wb') as outfp:
            with urllib.request.urlopen(url) as response:
                content = response.read()
                outfp.write(content)
                print(f"...wrote {len(content)} bytes to {output.genome}",
                      file=sys.stderr)

to:

        with open(output.genome, 'wb') as outfp:
            try:
                with urllib.request.urlopen(url) as response:
                    content = response.read()
                    outfp.write(content)
                    print(f"...wrote {len(content)} bytes to {output.genome}",
                          file=sys.stderr)
            except:
                print(f"Genome not found for {ident}/{name}, skipping it",
                          file=sys.stderr)
                pass

@jeanzzhao
Copy link
Author

@ctb
Copy link
Member

ctb commented Dec 4, 2022

...
[Thu Nov 24 07:14:11 2022]
Error in rule samtools_count_wc:
jobid: 27756
output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/50a874ea6fd99a2f81d96884a9de6c9e
shell:

    samtools view -c -F 260 outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/

mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

in this one, it's hard to know what the error is because it occurred above the copy/paste - can you try rerunning with the target outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt in place of any of the other targets (gather_reads or summarize_mapping or whatnot)?

Error in rule make_mapping_notebook_wc:
jobid: 90
output: outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR11922358.html

I think I fixed this one in #242 which is now released in v0.9.1! So if you pip install -U genome-grist it should run!

I am getting the same error, grist cannot download a specific genome. In my case is GCF_006715245.1

Thanks @carden24! I have some ideas here - I don't want to just ignore the missing genomes... more in a bit.

@ctb
Copy link
Member

ctb commented Dec 4, 2022

Here's one way I'm thinking of support "missing" genomes -

#255

I like the idea of requiring that they be added manually (or at least that manual acknowledgement be made).

A different or additional approach would be to suggest downloading them or a replacement manually and making it part of a private database.

@ctb
Copy link
Member

ctb commented Dec 5, 2022

#255 is maturing. I'd be interested in your thoughts @carden24 @jeanzzhao

@carden24
Copy link

carden24 commented Dec 5, 2022

I spent some time trying to get the data from other sources and I could not get it from the genbank or Gold but it is available from the JGI portal. I assume that this will not be the case for the other genomes so I am thinking than an alternative is to get another closely related genome based maybe on ANI or some other measurement of genome similarity.

@ctb
Copy link
Member

ctb commented Dec 5, 2022

usually the genome has been removed for a good reason. I would probably go use GTDB or NCBI taxonomy to find another genome from the same species.

@carden24
Copy link

carden24 commented Dec 5, 2022

Yes, totally agree, the criteria for removal from the NCBI can vary and there is no way to know programatically.

@ctb
Copy link
Member

ctb commented Dec 6, 2022

I've just released genome-grist v0.9.2. pip install -U genome-grist should upgrade.

This includes skip_genomes - from the configuration page,

# skip_genomes: identifiers to ignore when they show up in gather output.
# This is useful when the sourmash database contains genomes that are no
# longer present in GenBank because they have been deprecated or suppressed.
#
# Note, in such cases you should try to find a new genome to include in
# a local database!
#
# DEFAULT: []
skip_genomes: []

You can use something like:

skip_genomes:
- GCF_000020205.1

to give it a try.

@carden24
Copy link

carden24 commented Dec 6, 2022

I upgraded grist to 0.9.2 and run it again but snakemake if failing because it expects to have the genome downloaded as required in the rule output. I used the skip_genomes option in the config file and it was read successfully but cannot handle the missing output.

`[Tue Dec 6 08:40:45 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_006715245.1.info.csv
output: genbank_cache/GCF_006715245.1_genomic.fna.gz
jobid: 144
reason: Missing output files: genbank_cache/GCF_006715245.1_genomic.fna.gz
wildcards: ident=GCF_006715245.1
resources: tmpdir=/tmp

samples: ['Mock_T0_3_S3', 'Mock_T0_2_S2', 'Mock_T0_1_S1']
outdir: grist
base_tempdir: /tmp/tmpf8o7ziq2
['GCF_006715245.1']
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Select jobs to execute...
downloading genome for ident GCF_006715245.1/Bacillus sp. SLBN-3 from NCBI...
Cannot download genome from URL:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/715/245/GCF_006715245.1_ASM671524v1/GCF_006715245.1_ASM671524v1_genomic.fna.gz
Is it missing? If so, consider adding 'GCF_006715245.1' to 'skip_genomes' list in config file.
[Tue Dec 6 08:40:45 2022]
Error in rule download_matching_genome_wc:
jobid: 0
input: genbank_cache/GCF_006715245.1.info.csv
output: genbank_cache/GCF_006715245.1_genomic.fna.gz

RuleException:
Exception in line 1077 of /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
Genbank genome not found
File "/home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1077, in __rule_download_matching_genome_wc
File "/home/mixtures/miniconda3/envs/grist/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Removing output files of failed job download_matching_genome_wc since they might be corrupted:
genbank_cache/GCF_006715245.1_genomic.fna.gz
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-12-06T084040.101718.snakemake.log`

@ctb
Copy link
Member

ctb commented Dec 6, 2022

hi @carden24 just to confirm, did you add it to skip_genomes in the config file?

skip_genomes:
- GCF_006715245.1

@jeanzzhao
Copy link
Author

jeanzzhao commented Dec 6, 2022

should I remove file like GCF_000173095.1.info.csv under genbank_cache/ before starting the re-run?

@carden24
Copy link

carden24 commented Dec 6, 2022

Yes on the config file. but I think I needed to clean files before rerunning.
Originally I run genome-grist , and once it failed because of the missing error, I added the skip_genomes option and tried to rerun. I have now removed al the output folders and now it works. Seemed like IGNORE_IDENT variable is used in earlier steps and that is why it kept looking for those genome.
Thanks a lot for the help.

@jeanzzhao
Copy link
Author

@carden24 could you remind me which files you cleaned? thanks

@carden24
Copy link

carden24 commented Dec 6, 2022

I removed the genbank_cache folder, the gather one, and the sig one too. Not sure if all of them were required.

@ctb
Copy link
Member

ctb commented Dec 6, 2022

hmm, that's interesting 😓 it should be downstream of those, although removing them will certainly force recalculation of everything downstream! @jeanzzhao wait a few and I'll see if I can figure out something more precise!

@ctb
Copy link
Member

ctb commented Dec 7, 2022

Whoops, looks like I messed up the skip_genomes code in #255 - I needed to add it in one more place. Working on a fix in #259. Apologies!

@ctb
Copy link
Member

ctb commented Dec 7, 2022

Merged #259 and released genome-grist v0.9.3. Please give it a try:

pip install -U genome-grist

@ctb
Copy link
Member

ctb commented Dec 7, 2022

(you shouldn't need to remove or edit any files to get this to work, @jeanzzhao)

@jeanzzhao
Copy link
Author

jeanzzhao commented Dec 8, 2022

  • pip install -U genome-grist, v0.9.3., did not remove any previous file, sbatch #58672657, failed

'/home/zyzhao/assloss/grist/marine44/.snakemake/log/2022-12-08T082206.462631.snakemake.log'

Building DAG of jobs...
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.

Using shell: /bin/bash
Provided cores: 11
Rules claiming more threads will be scaled down.
Job stats:
job                                 count    min threads    max threads
--------------------------------  -------  -------------  -------------
copy_sample_genomes_to_output_wc       19              1              1
download_matching_genome_wc             8              1              1
make_combined_info_csv_wc              19              1              1
make_gather_notebook_wc                19              1              1
summarize_gather                        1              1              1
total                                  66              1              1

Select jobs to execute...

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000472565.1.info.csv
    output: genbank_cache/GCF_000472565.1_genomic.fna.gz
    jobid: 935
    wildcards: ident=GCF_000472565.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000504225.1.info.csv
    output: genbank_cache/GCF_000504225.1_genomic.fna.gz
    jobid: 995
    wildcards: ident=GCF_000504225.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000020205.1.info.csv
    output: genbank_cache/GCF_000020205.1_genomic.fna.gz
    jobid: 1003
    wildcards: ident=GCF_000020205.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000472605.1.info.csv
    output: genbank_cache/GCF_000472605.1_genomic.fna.gz
    jobid: 983
    wildcards: ident=GCF_000472605.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000701385.1.info.csv
    output: genbank_cache/GCF_000701385.1_genomic.fna.gz
    jobid: 1737
    wildcards: ident=GCF_000701385.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000173095.1.info.csv
    output: genbank_cache/GCF_000173095.1_genomic.fna.gz
    jobid: 1469
    wildcards: ident=GCF_000173095.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:14 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000597705.1.info.csv
    output: genbank_cache/GCF_000597705.1_genomic.fna.gz
    jobid: 2329
    wildcards: ident=GCF_000597705.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:14 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000014265.1.info.csv
    output: genbank_cache/GCF_000014265.1_genomic.fna.gz
    jobid: 2919
    wildcards: ident=GCF_000014265.1
    resources: tmpdir=/tmp

Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Complete log: /home/zyzhao/assloss/grist/marine44/.snakemake/log/2022-12-08T082206.462631.snakemake.log

@carden24
Copy link

carden24 commented Dec 8, 2022

I am getting an error at the make_gather_notebook_wc step. I run it with a simple sample.

`Error in rule make_gather_notebook_wc:
jobid: 1
input: /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-gather.ipynb, grist/gather/Mock_T0_3_S3.gather.csv.gz, grist/gather/Mock_T0_3_S3.genomes.info.csv, grist/.kernel.set
output: grist/reports/report-gather-Mock_T0_3_S3.ipynb, grist/reports/report-gather-Mock_T0_3_S3.html
conda-env: /home/mixtures/erick_dev/GC_Test/Test4/.snakemake/conda/3661d3423026d9d473032c65ccc8aec6_
shell:

    papermill /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-gather.ipynb grist/reports/report-gather-Mock_T0_3_S3.ipynb -k genome_grist               -p sample_id Mock_T0_3_S3 -p render '' -p outdir grist              --cwd grist/reports/
    python -m nbconvert grist/reports/report-gather-Mock_T0_3_S3.ipynb --to html --stdout --no-input              --ExecutePreprocessor.kernel_name=genome_grist > grist/reports/report-gather-Mock_T0_3_S3.html

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job make_gather_notebook_wc since they might be corrupted:
grist/reports/report-gather-Mock_T0_3_S3.ipynb
`

This is the folder structure of the grist folder:

grist
├── gather
│   ├── Mock_T0_3_S3.gather.csv.gz
│   ├── Mock_T0_3_S3.gather.out
│   ├── Mock_T0_3_S3.genomes.info.csv
│   ├── Mock_T0_3_S3.known.sig.zip
│   ├── Mock_T0_3_S3.matches.sig.zip
│   ├── Mock_T0_3_S3.prefetch.csv.gz
│   └── Mock_T0_3_S3.unknown.sig.zip
├── genomes
│   ├── GCF_000009045.1_genomic.fna.gz
│   ├── GCF_000009045.1.info.csv
│   ├── GCF_000012905.2_genomic.fna.gz
│   ├── GCF_000012905.2.info.csv
│   ├── GCF_000219605.1_genomic.fna.gz
│   ├── GCF_000219605.1.info.csv
│   ├── GCF_000238915.1_genomic.fna.gz
│   ├── GCF_000238915.1.info.csv
│   ├── GCF_000368145.1_genomic.fna.gz
│   ├── GCF_000368145.1.info.csv
│   ├── GCF_000368685.1_genomic.fna.gz
│   ├── GCF_000368685.1.info.csv
│   ├── GCF_001042485.2_genomic.fna.gz
│   ├── GCF_001042485.2.info.csv
│   ├── GCF_001646745.1_genomic.fna.gz
│   ├── GCF_001646745.1.info.csv
│   ├── GCF_900215245.1_genomic.fna.gz
│   └── GCF_900215245.1.info.csv
├── raw
│   ├── Mock_T0_3_S3_1.fastq.gz
│   └── Mock_T0_3_S3_2.fastq.gz
├── sigs
│   └── Mock_T0_3_S3.trim.sig.zip
└── trim
├── Mock_T0_3_S3.trim.fq.gz
├── Mock_T0_3_S3.trim.html
└── Mock_T0_3_S3.trim.json

@ctb
Copy link
Member

ctb commented Dec 14, 2022

pip install -U genome-grist, v0.9.3., did not remove any previous file, sbatch #58672657, failed

Hi Jean, I took a look at ~assloss/grist/marine44/ and tried running one of your samples as below - so far it's working. I wonder if you "just" need to add more skip_genomes? It's annoying to figure out, I know... I'll seek additional solutions!

samples:
- SRR5915428
outdir: outputs.jean/

sourmash_databases:
- gtdb-rs207.genomic.k31.zip

skip_genomes:
- GCF_000472605.1
- GCF_000504225.1

@jeanzzhao
Copy link
Author

jeanzzhao commented Dec 14, 2022 via email

@carden24
Copy link

carden24 commented Apr 5, 2023

I am still having issues with this error. I think that there are still some rules that need to incorporate a check to ignore genomes that cannot be downloaded.

These rules correctly ignore the missing genome specified in the yaml:

download_matching_genome
make_genbank_info_csv
bam_to_depth_wc
minimap_wc
samtools_mpileup_wc
samtools_count_wc
bam_to_fastq_wc

The first rule that is creating an error is extract_leftover_reads_wc. I checked its code and it seems that it uses as input the gather_csv file but it does not check for the flagged genomes in the python script substract_gather.py

   input:
        csv = f'{outdir}/gather/{{sample}}.gather.csv.gz',
        mapped = Checkpoint_GatherResults(f"{outdir}/mapping/{{sample}}.x.{{ident}}.mapped.fq.gz"),

These other rules also used that csv as input
make_gather_notebook_wc - > Uses papermill and report-gather.ipynb
make_mapping_notebook_wc -> Uses papermill and report-mapping.ipynb
.

A possible solution would be to pass as an argument the list of flagged genomes (IGNORE_IDENTS) to the python script when it is loading the list of genomes from the csv

Line 29:

    with gzip.open(args.gather_csv, "rt") as fp:
        r = csv.DictReader(fp)
        for row in r:
            rows.append(row)
    print(f"...loaded {len(rows)} results total.")

    print("checking input/output pairs:")
    pairs = []
    fail = False
    for row in rows:
        acc = row["name"].split()[0]
>>>if acc in IGNORE_IDENTS:
>>>   continue
>>>   print("Ignoring {acc} ")
>>>else:
            filename = f"{outdir}/mapping/{sample_id}.x.{acc}.mapped.fq.gz"
            overlapping = f"{outdir}/mapping/{sample_id}.x.{acc}.overlap.fq.gz"
            leftover = f"{outdir}/mapping/{sample_id}.x.{acc}.leftover.fq.gz"
            if not os.path.exists(filename):
                print(f"ERROR: input filename {filename} does not exist. Will exit.")
                 fail = True
            pairs.append((acc, filename, overlapping, leftover))

I don't know enough about python notebooks to suggest a solution there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants