update module: diamond/blastp #7350

erikrikarddaniel · 2025-01-23T09:17:39Z

Is there an existing module for this?

I have searched for the existing module

Is there an open PR for this?

I have searched for existing PRs

Is there an open issue for this?

I have searched for existing issues

Are you going to work on this?

If I'm planning to work on this module, I added myself to the Assignees to facilitate tracking who is working on the module

I have some issues with the current module, some easy to solve, some that need a discussion since they potentially change the interface of the module.

I need to use the --include-lineage parameter supported by Diamond 2.10 but not the current 2.8. Easy to fix of course.
Diamond supports reading gzipped files, so the unzipping of gzipped input seems unnecessary and a waste of space.
Diamond also supports gzipping output files via --compress 1. Seems like a good feature to support, but is made difficult by the current third parameter being the file suffix which in turn decides the output format. My suggestion is to change the third param to the numerical output format and set the suffix from this instead. If $args contains --compress 1, a .gz can be added to the file suffix (except for .daa). (Overall, I also find the use of a suffix to decide the output format a bit too implicit for my taste. I definitely consider the blast output format a .tsv and would use that if I didn't know about the taxonomy (102) format.)
In my experience, Diamond uses cpus quite efficiently and tends to require a lot of memory, so I'd suggest labelling the process process_high instead of medium.
You're using find to get the path of the input .dmnd but I don't understand why since this is param two to the module.
I think it would be good to include `${meta2.id}" in the default output file name as that's likely (it is in my case at least) the name of the database.

Update: I realize I need to match output channels from this module with the database channel (meta2) in my pipeline before calling the next module, i.e. join the output of this module with a channel that is indexed with database names (meta2.id). As I see this, I can either return both meta objects or add a field (e.g. db) containing meta2.id to the meta object before returning it. An alternative is perhaps to modify meta.id to contain the database name before calling this module.

I need a module like my description above for #metatdenovo, so I've made a local while we discuss. At the moment it looks like the below but hasn't been properly tested yet (i.e. WIP). I'm happy to contribute this to nf-core/modules if and when we agree.

    tag "$meta.id"
    label 'process_high'

    conda "${moduleDir}/environment.yml"
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/diamond:2.1.10--h5ca1c30_3':
        'biocontainers/diamond:2.1.10--h5ca1c30_3' }

   input:
    tuple val(meta) , path(fasta)
    tuple val(meta2), path(db)
    val outfmt
    val blast_columns

    output:
    tuple val(meta), path('*.blast*'), optional: true, emit: blast
    tuple val(meta), path('*.xml*')  , optional: true, emit: xml 
    tuple val(meta), path('*.txt*')  , optional: true, emit: txt 
    tuple val(meta), path('*.daa')   , optional: true, emit: daa 
    tuple val(meta), path('*.sam*')  , optional: true, emit: sam 
    tuple val(meta), path('*.tsv*')  , optional: true, emit: tsv 
    tuple val(meta), path('*.paf*')  , optional: true, emit: paf 
    path "versions.yml"              , emit: versions

    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''
    def prefix = task.ext.prefix ?: "${meta.id}.${meta2.id}"
    def columns = blast_columns ? "${blast_columns}" : ''
    switch ( outfmt ) { 
        case 0:   out_ext = "blast"; break
        case 5:   out_ext = "xml";   break
        case 6:   out_ext = "txt";   break
        case 100: out_ext = "daa";   break
        case 101: out_ext = "sam";   break
        case 102: out_ext = "tsv";   break
        case 103: out_ext = "paf";   break
        default:
            outfmt = 6;
            out_ext = 'txt';
            log.warn("Unknown output file format provided (${out_ext}): selecting DIAMOND default of tabular BLAST output (txt)");
            break
    }   
    if ( args =~ /--compress\s+1/ ) out_ext += '.gz'
    """ 
    diamond \\
        blastp \\
        --threads ${task.cpus} \\
        --db $db \\
        --query ${fasta} \\
        --outfmt ${outfmt} ${columns} \\
        ${args} \\
        --out ${prefix}.${out_ext}

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        diamond: \$(diamond --version 2>&1 | tail -n 1 | sed 's/^diamond version //')
    END_VERSIONS
    """

The text was updated successfully, but these errors were encountered:

mberacochea · 2025-01-23T16:13:21Z

Small comment, switch statements are "deprecated". Nextflow vscode docs - The Nextflow language specification does not support switch statements. Use if-else statements instead

vagkaratzas · 2025-01-28T09:22:45Z

I made a PR (#7381) with most of the requested changes that we can work on before merging.
I left out comment number 3. I get that the original code wants to explicitly know the output format of the generated file in order to capture the output properly. Without --outfmt I think that the output is just redirected to the console, so we can't leave it empty as default.
Why can't the .daa output be compressed? This will create a problem with this line I guess
if ( args =~ /--compress\s+1/ ) out_ext += '.gz' and we won't be able to capture the output properly.
If we agree to keep the 3rd argument outfmt, I would stick with the names of the extensions instead of numbers (as it is), so it is more clear to the new users.
I am open to more opinions, to decide the best way to update the module.

erikrikarddaniel · 2025-01-28T09:39:52Z

I made a PR (#7381) with most of the requested changes that we can work on before merging.

Oh, that's kind of you. I meant to do this after we had discussed things!

I left out comment number 3. I get that the original code wants to explicitly know the output format of the generated file in order to capture the output properly. Without --outfmt I think that the output is just redirected to the console, so we can't leave it empty as default.

Sure, the output format needs to be set. See my comment below.

Why can't the .daa output be compressed? This will create a problem with this line I guess if ( args =~ /--compress\s+1/ ) out_ext += '.gz' and we won't be able to capture the output properly.

I'm not sure if it works to provide a --compress 1 if the output format is daa, and I'm even less sure if it would make a difference. In any case, if a user sets this in a personal config and it fails there will be an error message. I think the code logic can stay, i.e. we add a .gz even to a .daa file.

If we agree to keep the 3rd argument outfmt, I would stick with the names of the extensions instead of numbers (as it is), so it is more clear to the new users. I am open to more opinions, to decide the best way to update the module.

This is a matter of taste, I suppose. Personally, I find it much more explicit to look up the output formats supported by the tool and call the module with the number, and let the pipeline calculate a filename than let the filename suffix decide the output format. (As I mention above, the .tsv is confusing IMO as more than one format is a tab separated file. As it is now, I need to check the module code to understand what the different formats mean.) Also, if the user has set --compress 1 in a config, the suffix won't be what out_fmt is set to which I find confusing. I understand this breaks the interface for other pipelines.

erikrikarddaniel · 2025-01-28T09:54:58Z

@vagkaratzas I'm waiting a moment with the review to let you reply. No hurry, I have a local module for the time being.

vagkaratzas · 2025-01-28T10:41:37Z

Ok, I agree with the --compress mode and I made the respective changes in the module. I also added a relevant nf-test to check that this works as intended. Feel free to look/comment at the PR.

For the outfmt, I suggest we get some more input/opinions from others before making any changes. The module is not currently being used by any nf-core pipelines, so I don't think we will break anything ;)

erikrikarddaniel added the update module label Jan 23, 2025

erikrikarddaniel assigned erikrikarddaniel, jfy133 and vagkaratzas Jan 23, 2025

jfy133 removed their assignment Jan 23, 2025

erikrikarddaniel added the question Further information is requested label Jan 25, 2025

vagkaratzas mentioned this issue Jan 28, 2025

Diamond/blastp updates #7381

Merged

17 tasks

vagkaratzas closed this as completed in #7381 Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update module: diamond/blastp #7350

update module: diamond/blastp #7350

erikrikarddaniel commented Jan 23, 2025 •

edited

Loading

mberacochea commented Jan 23, 2025

vagkaratzas commented Jan 28, 2025

erikrikarddaniel commented Jan 28, 2025

erikrikarddaniel commented Jan 28, 2025

vagkaratzas commented Jan 28, 2025

update module: diamond/blastp #7350

update module: diamond/blastp #7350

Comments

erikrikarddaniel commented Jan 23, 2025 • edited Loading

Is there an existing module for this?

Is there an open PR for this?

Is there an open issue for this?

Are you going to work on this?

mberacochea commented Jan 23, 2025

vagkaratzas commented Jan 28, 2025

erikrikarddaniel commented Jan 28, 2025

erikrikarddaniel commented Jan 28, 2025

vagkaratzas commented Jan 28, 2025

erikrikarddaniel commented Jan 23, 2025 •

edited

Loading