Errors if genome-grist run on Marine metagenomes #241

jeanzzhao · 2022-11-27T16:04:58Z

There were a few errors that happened during genome-grist run of Marine metagenomes:
less ~/assloss/grist/marine21/jobs/grist.j56313129.err

SRR9178284, error in rulesamtools_count_wc, bam_to_depth_wc, bam_to_fastq_wc

...
[Thu Nov 24 07:14:11 2022]
Error in rule samtools_count_wc:
    jobid: 27756
    output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/50a874ea6fd99a2f81d96884a9de6c9e
    shell:

        samtools view -c -F 260 outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/
mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
  
Activating conda environment: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
Removing output files of failed job samtools_count_wc since they might be corrupted:
outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
Job failed, going on with independent jobs.
...
Error in rule bam_to_depth_wc:
    jobid: 26167
    output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.depth.txt
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/9048b0b8e113b3e7b4e477f4051b67a7
Job failed, going on with independent jobs.
    shell:

        samtools depth -aa outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/mappi
ng/SRR9178284.x.GCF_902527765.1.depth.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
...
Error in rule bam_to_fastq_wc:
    jobid: 28547
    output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.mapped.fq.gz
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/9048b0b8e113b3e7b4e477f4051b67a7
    shell:

        samtools bam2fq outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam | gzip > outputs.marine21_samples/m
apping/SRR9178284.x.GCF_902527765.1.mapped.fq.gz

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Error in rule download_matching_genome_wc

...
downloading genome for ident GCF_000014265.1/Trichodesmium erythraeum IMS101 from NCBI...
ESC[32m[Thu Nov 24 07:14:13 2022]ESC[0m
ESC[31mError in rule download_matching_genome_wc:ESC[0m
ESC[31m    jobid: 0ESC[0m
ESC[31m    output: genbank_cache/GCF_000173095.1_genomic.fna.gzESC[0m
ESC[31mESC[0m
ESC[32m[Thu Nov 24 07:14:13 2022]ESC[0m
ESC[31mError in rule download_matching_genome_wc:ESC[0m
ESC[31m    jobid: 0ESC[0m
ESC[31m    output: genbank_cache/GCF_000014265.1_genomic.fna.gzESC[0m
ESC[31mESC[0m
ESC[31mRuleException:
HTTPError in line 1063 of /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
HTTP Error 404: Not Found
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1063, in __rule_download_matching_genome_wc
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 214, in urlopen
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 523, in open
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 632, in http_response
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 561, in error
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 494, in _call_chain
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/urllib/request.py", line 641, in http_error_default
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/concurrent/futures/thread.py", line 58, in runESC[0m
ESC[31mRuleException:
HTTPError in line 1063 of /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
HTTP Error 404: Not Found
  File "/home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1063, in __rule_dow
nload_matching_genome_wc
...

SRR11922358, error in rulemake_mapping_notebook_wc:

...
Error in rule make_mapping_notebook_wc:
    jobid: 90
    output: outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR11922358.html
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
    shell:

        papermill /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-mappin
g.ipynb outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb -k genome_grist               -p sample_id SRR11922
358 -p render ''               -p outdir outputs.marine21_samples --cwd outputs.marine21_samples/reports/
        python -m nbconvert outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb --to html --stdout --no-input
            --ExecutePreprocessor.kernel_name=genome_grist > outputs.marine21_samples/reports/report-mapping-SRR11922358.html

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

SRR13449930, error in rulemake_mapping_notebook_wc:

...
Error in rule make_mapping_notebook_wc:
    jobid: 167
    output: outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR13449930.html
    conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/98df4d4bacf2028ed0321b61771606f2
    shell:

        papermill /home/zyzhao/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-mappin
g.ipynb outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb -k genome_grist               -p sample_id SRR13449
930 -p render ''               -p outdir outputs.marine21_samples --cwd outputs.marine21_samples/reports/
        python -m nbconvert outputs.marine21_samples/reports/report-mapping-SRR13449930.ipynb --to html --stdout --no-input
            --ExecutePreprocessor.kernel_name=genome_grist > outputs.marine21_samples/reports/report-mapping-SRR13449930.html

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

The text was updated successfully, but these errors were encountered:

carden24 · 2022-12-02T23:01:58Z

I am getting the same error, grist cannot download a specific genome. In my case is GCF_006715245.1

When I checked the status of the genome in the (NCBI ftp)[https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/715/245/GCF_006715245.1_ASM671524v1/], the genome is missing!.

There I found a message saying that the assembly status is suppressed. Therefore it make sense that it fails. My suggestion is to add a line to handle that error. I will try to put some code.

This is the relevant section in the Snakefile

# download actual genomes from genbank!
rule download_matching_genome_wc:
    input:
        csvfile = ancient(f'{GENBANK_CACHE}/{{ident}}.info.csv')
    output:
        genome = f"{GENBANK_CACHE}/{{ident}}_genomic.fna.gz"
    run:
        rows = list(load_csv(input.csvfile))
        assert len(rows) == 1
        row = rows[0]
        ident = row['ident']
        assert wildcards.ident.startswith(ident)
        url = row['genome_url']
        name = row['display_name']

        print(f"downloading genome for ident {ident}/{name} from NCBI...",
              file=sys.stderr)
        with open(output.genome, 'wb') as outfp:
            with urllib.request.urlopen(url) as response:
                content = response.read()
                outfp.write(content)
                print(f"...wrote {len(content)} bytes to {output.genome}",
                      file=sys.stderr)

carden24 · 2022-12-03T00:53:37Z

My solution:
Change the snakefile from:
Starting at line1062

        with open(output.genome, 'wb') as outfp:
            with urllib.request.urlopen(url) as response:
                content = response.read()
                outfp.write(content)
                print(f"...wrote {len(content)} bytes to {output.genome}",
                      file=sys.stderr)

to:

        with open(output.genome, 'wb') as outfp:
            try:
                with urllib.request.urlopen(url) as response:
                    content = response.read()
                    outfp.write(content)
                    print(f"...wrote {len(content)} bytes to {output.genome}",
                          file=sys.stderr)
            except:
                print(f"Genome not found for {ident}/{name}, skipping it",
                          file=sys.stderr)
                pass

jeanzzhao · 2022-12-04T15:06:15Z

another genome missing:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/020/205/GCF_000020205.1_ASM2020v1/

ctb · 2022-12-04T18:19:47Z

...
[Thu Nov 24 07:14:11 2022]
Error in rule samtools_count_wc:
jobid: 27756
output: outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
conda-env: /home/zyzhao/assloss/grist/marine21/.snakemake/conda/50a874ea6fd99a2f81d96884a9de6c9e
shell:
    samtools view -c -F 260 outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.bam > outputs.marine21_samples/
mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

in this one, it's hard to know what the error is because it occurred above the copy/paste - can you try rerunning with the target outputs.marine21_samples/mapping/SRR9178284.x.GCF_902527765.1.count_mapped_reads.txt in place of any of the other targets (gather_reads or summarize_mapping or whatnot)?

Error in rule make_mapping_notebook_wc:
jobid: 90
output: outputs.marine21_samples/reports/report-mapping-SRR11922358.ipynb, outputs.marine21_samples/reports/report-mappin
g-SRR11922358.html

I think I fixed this one in #242 which is now released in v0.9.1! So if you pip install -U genome-grist it should run!

I am getting the same error, grist cannot download a specific genome. In my case is GCF_006715245.1

Thanks @carden24! I have some ideas here - I don't want to just ignore the missing genomes... more in a bit.

ctb · 2022-12-04T21:59:40Z

Here's one way I'm thinking of support "missing" genomes -

#255

I like the idea of requiring that they be added manually (or at least that manual acknowledgement be made).

A different or additional approach would be to suggest downloading them or a replacement manually and making it part of a private database.

ctb · 2022-12-05T15:32:30Z

#255 is maturing. I'd be interested in your thoughts @carden24 @jeanzzhao

carden24 · 2022-12-05T17:38:22Z

I spent some time trying to get the data from other sources and I could not get it from the genbank or Gold but it is available from the JGI portal. I assume that this will not be the case for the other genomes so I am thinking than an alternative is to get another closely related genome based maybe on ANI or some other measurement of genome similarity.

ctb · 2022-12-05T17:40:59Z

usually the genome has been removed for a good reason. I would probably go use GTDB or NCBI taxonomy to find another genome from the same species.

carden24 · 2022-12-05T19:37:21Z

Yes, totally agree, the criteria for removal from the NCBI can vary and there is no way to know programatically.

ctb · 2022-12-06T14:16:12Z

I've just released genome-grist v0.9.2. pip install -U genome-grist should upgrade.

This includes skip_genomes - from the configuration page,

# skip_genomes: identifiers to ignore when they show up in gather output.
# This is useful when the sourmash database contains genomes that are no
# longer present in GenBank because they have been deprecated or suppressed.
#
# Note, in such cases you should try to find a new genome to include in
# a local database!
#
# DEFAULT: []
skip_genomes: []

You can use something like:

skip_genomes:
- GCF_000020205.1

to give it a try.

carden24 · 2022-12-06T16:44:58Z

I upgraded grist to 0.9.2 and run it again but snakemake if failing because it expects to have the genome downloaded as required in the rule output. I used the skip_genomes option in the config file and it was read successfully but cannot handle the missing output.

`[Tue Dec 6 08:40:45 2022]
rule download_matching_genome_wc:
input: genbank_cache/GCF_006715245.1.info.csv
output: genbank_cache/GCF_006715245.1_genomic.fna.gz
jobid: 144
reason: Missing output files: genbank_cache/GCF_006715245.1_genomic.fna.gz
wildcards: ident=GCF_006715245.1
resources: tmpdir=/tmp

samples: ['Mock_T0_3_S3', 'Mock_T0_2_S2', 'Mock_T0_1_S1']
outdir: grist
base_tempdir: /tmp/tmpf8o7ziq2
['GCF_006715245.1']
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Select jobs to execute...
downloading genome for ident GCF_006715245.1/Bacillus sp. SLBN-3 from NCBI...
Cannot download genome from URL:
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/006/715/245/GCF_006715245.1_ASM671524v1/GCF_006715245.1_ASM671524v1_genomic.fna.gz
Is it missing? If so, consider adding 'GCF_006715245.1' to 'skip_genomes' list in config file.
[Tue Dec 6 08:40:45 2022]
Error in rule download_matching_genome_wc:
jobid: 0
input: genbank_cache/GCF_006715245.1.info.csv
output: genbank_cache/GCF_006715245.1_genomic.fna.gz

RuleException:
Exception in line 1077 of /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile:
Genbank genome not found
File "/home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/Snakefile", line 1077, in __rule_download_matching_genome_wc
File "/home/mixtures/miniconda3/envs/grist/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Removing output files of failed job download_matching_genome_wc since they might be corrupted:
genbank_cache/GCF_006715245.1_genomic.fna.gz
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-12-06T084040.101718.snakemake.log`

ctb · 2022-12-06T16:50:43Z

hi @carden24 just to confirm, did you add it to skip_genomes in the config file?

skip_genomes:
- GCF_006715245.1

jeanzzhao · 2022-12-06T18:07:15Z

should I remove file like GCF_000173095.1.info.csv under genbank_cache/ before starting the re-run?

carden24 · 2022-12-06T18:19:47Z

Yes on the config file. but I think I needed to clean files before rerunning.
Originally I run genome-grist , and once it failed because of the missing error, I added the skip_genomes option and tried to rerun. I have now removed al the output folders and now it works. Seemed like IGNORE_IDENT variable is used in earlier steps and that is why it kept looking for those genome.
Thanks a lot for the help.

jeanzzhao · 2022-12-06T18:23:02Z

@carden24 could you remind me which files you cleaned? thanks

carden24 · 2022-12-06T18:39:56Z

I removed the genbank_cache folder, the gather one, and the sig one too. Not sure if all of them were required.

ctb · 2022-12-06T18:51:27Z

hmm, that's interesting 😓 it should be downstream of those, although removing them will certainly force recalculation of everything downstream! @jeanzzhao wait a few and I'll see if I can figure out something more precise!

ctb · 2022-12-07T14:30:06Z

Whoops, looks like I messed up the skip_genomes code in #255 - I needed to add it in one more place. Working on a fix in #259. Apologies!

ctb · 2022-12-07T14:57:44Z

Merged #259 and released genome-grist v0.9.3. Please give it a try:

pip install -U genome-grist

ctb · 2022-12-07T14:58:07Z

(you shouldn't need to remove or edit any files to get this to work, @jeanzzhao)

jeanzzhao · 2022-12-08T16:50:06Z

pip install -U genome-grist, v0.9.3., did not remove any previous file, sbatch #58672657, failed

'/home/zyzhao/assloss/grist/marine44/.snakemake/log/2022-12-08T082206.462631.snakemake.log'

Building DAG of jobs...
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job make_combined_info_csv_wc.
Updating job copy_sample_genomes_to_output_wc.
Updating job copy_sample_genomes_to_output_wc.

Using shell: /bin/bash
Provided cores: 11
Rules claiming more threads will be scaled down.
Job stats:
job                                 count    min threads    max threads
--------------------------------  -------  -------------  -------------
copy_sample_genomes_to_output_wc       19              1              1
download_matching_genome_wc             8              1              1
make_combined_info_csv_wc              19              1              1
make_gather_notebook_wc                19              1              1
summarize_gather                        1              1              1
total                                  66              1              1

Select jobs to execute...

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000472565.1.info.csv
    output: genbank_cache/GCF_000472565.1_genomic.fna.gz
    jobid: 935
    wildcards: ident=GCF_000472565.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000504225.1.info.csv
    output: genbank_cache/GCF_000504225.1_genomic.fna.gz
    jobid: 995
    wildcards: ident=GCF_000504225.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000020205.1.info.csv
    output: genbank_cache/GCF_000020205.1_genomic.fna.gz
    jobid: 1003
    wildcards: ident=GCF_000020205.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000472605.1.info.csv
    output: genbank_cache/GCF_000472605.1_genomic.fna.gz
    jobid: 983
    wildcards: ident=GCF_000472605.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000701385.1.info.csv
    output: genbank_cache/GCF_000701385.1_genomic.fna.gz
    jobid: 1737
    wildcards: ident=GCF_000701385.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:13 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000173095.1.info.csv
    output: genbank_cache/GCF_000173095.1_genomic.fna.gz
    jobid: 1469
    wildcards: ident=GCF_000173095.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:14 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000597705.1.info.csv
    output: genbank_cache/GCF_000597705.1_genomic.fna.gz
    jobid: 2329
    wildcards: ident=GCF_000597705.1
    resources: tmpdir=/tmp

[Thu Dec  8 08:22:14 2022]
rule download_matching_genome_wc:
    input: genbank_cache/GCF_000014265.1.info.csv
    output: genbank_cache/GCF_000014265.1_genomic.fna.gz
    jobid: 2919
    wildcards: ident=GCF_000014265.1
    resources: tmpdir=/tmp

Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Complete log: /home/zyzhao/assloss/grist/marine44/.snakemake/log/2022-12-08T082206.462631.snakemake.log

carden24 · 2022-12-08T22:28:07Z

I am getting an error at the make_gather_notebook_wc step. I run it with a simple sample.

`Error in rule make_gather_notebook_wc:
jobid: 1
input: /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-gather.ipynb, grist/gather/Mock_T0_3_S3.gather.csv.gz, grist/gather/Mock_T0_3_S3.genomes.info.csv, grist/.kernel.set
output: grist/reports/report-gather-Mock_T0_3_S3.ipynb, grist/reports/report-gather-Mock_T0_3_S3.html
conda-env: /home/mixtures/erick_dev/GC_Test/Test4/.snakemake/conda/3661d3423026d9d473032c65ccc8aec6_
shell:

    papermill /home/mixtures/miniconda3/envs/grist/lib/python3.9/site-packages/genome_grist/conf/../notebooks/report-gather.ipynb grist/reports/report-gather-Mock_T0_3_S3.ipynb -k genome_grist               -p sample_id Mock_T0_3_S3 -p render '' -p outdir grist              --cwd grist/reports/
    python -m nbconvert grist/reports/report-gather-Mock_T0_3_S3.ipynb --to html --stdout --no-input              --ExecutePreprocessor.kernel_name=genome_grist > grist/reports/report-gather-Mock_T0_3_S3.html

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job make_gather_notebook_wc since they might be corrupted:
grist/reports/report-gather-Mock_T0_3_S3.ipynb
`

This is the folder structure of the grist folder:

grist
├── gather
│   ├── Mock_T0_3_S3.gather.csv.gz
│   ├── Mock_T0_3_S3.gather.out
│   ├── Mock_T0_3_S3.genomes.info.csv
│   ├── Mock_T0_3_S3.known.sig.zip
│   ├── Mock_T0_3_S3.matches.sig.zip
│   ├── Mock_T0_3_S3.prefetch.csv.gz
│   └── Mock_T0_3_S3.unknown.sig.zip
├── genomes
│   ├── GCF_000009045.1_genomic.fna.gz
│   ├── GCF_000009045.1.info.csv
│   ├── GCF_000012905.2_genomic.fna.gz
│   ├── GCF_000012905.2.info.csv
│   ├── GCF_000219605.1_genomic.fna.gz
│   ├── GCF_000219605.1.info.csv
│   ├── GCF_000238915.1_genomic.fna.gz
│   ├── GCF_000238915.1.info.csv
│   ├── GCF_000368145.1_genomic.fna.gz
│   ├── GCF_000368145.1.info.csv
│   ├── GCF_000368685.1_genomic.fna.gz
│   ├── GCF_000368685.1.info.csv
│   ├── GCF_001042485.2_genomic.fna.gz
│   ├── GCF_001042485.2.info.csv
│   ├── GCF_001646745.1_genomic.fna.gz
│   ├── GCF_001646745.1.info.csv
│   ├── GCF_900215245.1_genomic.fna.gz
│   └── GCF_900215245.1.info.csv
├── raw
│   ├── Mock_T0_3_S3_1.fastq.gz
│   └── Mock_T0_3_S3_2.fastq.gz
├── sigs
│   └── Mock_T0_3_S3.trim.sig.zip
└── trim
├── Mock_T0_3_S3.trim.fq.gz
├── Mock_T0_3_S3.trim.html
└── Mock_T0_3_S3.trim.json

ctb · 2022-12-14T03:46:12Z

pip install -U genome-grist, v0.9.3., did not remove any previous file, sbatch #58672657, failed

Hi Jean, I took a look at ~assloss/grist/marine44/ and tried running one of your samples as below - so far it's working. I wonder if you "just" need to add more skip_genomes? It's annoying to figure out, I know... I'll seek additional solutions!

samples:
- SRR5915428
outdir: outputs.jean/

sourmash_databases:
- gtdb-rs207.genomic.k31.zip

skip_genomes:
- GCF_000472605.1
- GCF_000504225.1

jeanzzhao · 2022-12-14T15:05:57Z

Hi Titus, - I realized that I did not have `rs207` in the folder when I changed `conf.yml` to `rs207`. - `curl -L https://osf.io/w4bcm/download -o gtdb-rs207.genomic-reps.k31.sbt.zip` - re-run, sbatch #58926209, failed after ~9h with different Error in "rule make_combined_info_csv_wc" refer to this for details: https://hackmd.io/DOWP1qUzTCqdihYOSyp5Zg?view#12922 -Jean

…

On Tue, Dec 13, 2022 at 7:46 PM C. Titus Brown ***@***.***> wrote: pip install -U genome-grist, v0.9.3., did not remove any previous file, sbatch #58672657, failed Hi Jean, I took a look at ~assloss/grist/marine44/ and tried running one of your samples as below - so far it's working. I wonder if you "just" need to add more skip_genomes? It's annoying to figure out, I know... I'll seek additional solutions! samples: - SRR5915428 outdir: outputs.jean/ sourmash_databases: - gtdb-rs207.genomic.k31.zip skip_genomes: - GCF_000472605.1 - GCF_000504225.1 — Reply to this email directly, view it on GitHub <#241 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWMYGGCWL45JVAF4AWMYFJDWNE7I5ANCNFSM6AAAAAASMSV76I> . You are receiving this because you were mentioned.Message ID: ***@***.***>

carden24 · 2023-04-05T04:23:52Z

I am still having issues with this error. I think that there are still some rules that need to incorporate a check to ignore genomes that cannot be downloaded.

These rules correctly ignore the missing genome specified in the yaml:

download_matching_genome
make_genbank_info_csv
bam_to_depth_wc
minimap_wc
samtools_mpileup_wc
samtools_count_wc
bam_to_fastq_wc

The first rule that is creating an error is extract_leftover_reads_wc. I checked its code and it seems that it uses as input the gather_csv file but it does not check for the flagged genomes in the python script substract_gather.py

   input:
        csv = f'{outdir}/gather/{{sample}}.gather.csv.gz',
        mapped = Checkpoint_GatherResults(f"{outdir}/mapping/{{sample}}.x.{{ident}}.mapped.fq.gz"),

These other rules also used that csv as input
make_gather_notebook_wc - > Uses papermill and report-gather.ipynb
make_mapping_notebook_wc -> Uses papermill and report-mapping.ipynb
.

A possible solution would be to pass as an argument the list of flagged genomes (IGNORE_IDENTS) to the python script when it is loading the list of genomes from the csv

Line 29:

    with gzip.open(args.gather_csv, "rt") as fp:
        r = csv.DictReader(fp)
        for row in r:
            rows.append(row)
    print(f"...loaded {len(rows)} results total.")

    print("checking input/output pairs:")
    pairs = []
    fail = False
    for row in rows:
        acc = row["name"].split()[0]
>>>if acc in IGNORE_IDENTS:
>>>   continue
>>>   print("Ignoring {acc} ")
>>>else:
            filename = f"{outdir}/mapping/{sample_id}.x.{acc}.mapped.fq.gz"
            overlapping = f"{outdir}/mapping/{sample_id}.x.{acc}.overlap.fq.gz"
            leftover = f"{outdir}/mapping/{sample_id}.x.{acc}.leftover.fq.gz"
            if not os.path.exists(filename):
                print(f"ERROR: input filename {filename} does not exist. Will exit.")
                 fail = True
            pairs.append((acc, filename, overlapping, leftover))

I don't know enough about python notebooks to suggest a solution there.

carden24 mentioned this issue Dec 2, 2022

quickstart tutorial command throws attribute error #237

Closed

ctb mentioned this issue Dec 4, 2022

[MRG] add code to support skipping genomes by ident #255

Merged

ctb mentioned this issue Dec 7, 2022

[MRG] update ignore_ident code for later use too #259

Merged

bcpd mentioned this issue Apr 16, 2023

add check to ignore genome(s) that cannot be up downloaded #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors if genome-grist run on Marine metagenomes #241

Errors if genome-grist run on Marine metagenomes #241

jeanzzhao commented Nov 27, 2022

carden24 commented Dec 2, 2022 •

edited

Loading

carden24 commented Dec 3, 2022

jeanzzhao commented Dec 4, 2022

ctb commented Dec 4, 2022

ctb commented Dec 4, 2022

ctb commented Dec 5, 2022

carden24 commented Dec 5, 2022

ctb commented Dec 5, 2022

carden24 commented Dec 5, 2022

ctb commented Dec 6, 2022 •

edited

Loading

carden24 commented Dec 6, 2022 •

edited

Loading

ctb commented Dec 6, 2022

jeanzzhao commented Dec 6, 2022 •

edited

Loading

carden24 commented Dec 6, 2022

jeanzzhao commented Dec 6, 2022

carden24 commented Dec 6, 2022

ctb commented Dec 6, 2022

ctb commented Dec 7, 2022

ctb commented Dec 7, 2022

ctb commented Dec 7, 2022

jeanzzhao commented Dec 8, 2022 •

edited

Loading

carden24 commented Dec 8, 2022 •

edited

Loading

ctb commented Dec 14, 2022

jeanzzhao commented Dec 14, 2022 via email

carden24 commented Apr 5, 2023

Errors if genome-grist run on Marine metagenomes #241

Errors if genome-grist run on Marine metagenomes #241

Comments

jeanzzhao commented Nov 27, 2022

carden24 commented Dec 2, 2022 • edited Loading

carden24 commented Dec 3, 2022

jeanzzhao commented Dec 4, 2022

ctb commented Dec 4, 2022

ctb commented Dec 4, 2022

ctb commented Dec 5, 2022

carden24 commented Dec 5, 2022

ctb commented Dec 5, 2022

carden24 commented Dec 5, 2022

ctb commented Dec 6, 2022 • edited Loading

carden24 commented Dec 6, 2022 • edited Loading

ctb commented Dec 6, 2022

jeanzzhao commented Dec 6, 2022 • edited Loading

carden24 commented Dec 6, 2022

jeanzzhao commented Dec 6, 2022

carden24 commented Dec 6, 2022

ctb commented Dec 6, 2022

ctb commented Dec 7, 2022

ctb commented Dec 7, 2022

ctb commented Dec 7, 2022

jeanzzhao commented Dec 8, 2022 • edited Loading

carden24 commented Dec 8, 2022 • edited Loading

ctb commented Dec 14, 2022

jeanzzhao commented Dec 14, 2022 via email

carden24 commented Apr 5, 2023

carden24 commented Dec 2, 2022 •

edited

Loading

ctb commented Dec 6, 2022 •

edited

Loading

carden24 commented Dec 6, 2022 •

edited

Loading

jeanzzhao commented Dec 6, 2022 •

edited

Loading

jeanzzhao commented Dec 8, 2022 •

edited

Loading

carden24 commented Dec 8, 2022 •

edited

Loading