Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added sample irida_next sample field option #140

Open
wants to merge 32 commits into
base: dev
Choose a base branch
from
Open

added sample irida_next sample field option #140

wants to merge 32 commits into from

Conversation

mattheww95
Copy link
Collaborator

Added support for the irida_next sample id.

Copy link

github-actions bot commented Oct 24, 2024

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 899e35b

+| ✅ 229 tests passed       |+
#| ❔  32 tests were ignored |#
!| ❗   4 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: conf/igenomes_ignored.config
  • nextflow_config - nf-validation has been detected in the pipeline. Please migrate to nf-schema: https://nextflow-io.github.io/nf-schema/latest/migration_guide/
  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • schema_lint - Schema $id should be https://raw.githubusercontent.com/phac-nml/mikrokondo/master/nextflow_schema.json
    Found https://raw.githubusercontent.com/phac-nml/mikrokondo/main/nextflow_schema.json

❔ Tests ignored:

  • files_exist - File is ignored: CODE_OF_CONDUCT.md
  • files_exist - File is ignored: assets/nf-core-mikrokondo_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-mikrokondo_logo_light.png
  • files_exist - File is ignored: docs/images/nf-core-mikrokondo_logo_dark.png
  • files_exist - File is ignored: .github/ISSUE_TEMPLATE/config.yml
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: docs/output.md
  • files_exist - File is ignored: docs/README.md
  • files_exist - File is ignored: docs/usage.md
  • nextflow_config - Config variable ignored: manifest.name
  • nextflow_config - Config variable ignored: manifest.homePage
  • nextflow_config - Config variable ignored: params.max_cpus
  • files_unchanged - File does not exist: CODE_OF_CONDUCT.md
  • files_unchanged - File ignored due to lint config: LICENSE or LICENSE.md or LICENCE or LICENCE.md
  • files_unchanged - File ignored due to lint config: .github/CONTRIBUTING.md
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/bug_report.yml
  • files_unchanged - File does not exist: .github/ISSUE_TEMPLATE/config.yml
  • files_unchanged - File ignored due to lint config: .github/ISSUE_TEMPLATE/feature_request.yml
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File does not exist: assets/nf-core-mikrokondo_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-mikrokondo_logo_light.png
  • files_unchanged - File does not exist: docs/images/nf-core-mikrokondo_logo_dark.png
  • files_unchanged - File does not exist: docs/README.md
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/mikrokondo/mikrokondo/.github/workflows/awstest.yml
  • multiqc_config - multiqc_config

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2024-11-12 20:17:26

@mattheww95
Copy link
Collaborator Author

If these tests pass, a sample with the name .iridanext_output. should be passed as a sample name to verify it is valid and data passes through.

@mattheww95 mattheww95 marked this pull request as ready for review October 29, 2024 18:42
Copy link

@kylacochrane kylacochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Matthew 😸
I don’t have any specific comments - this sample_name solution looks solid to me. I tried adding a helper function to simplify the inx_string_suffix extraction logic in updated_samples within main.nf, but it ended up making things more complicated than expected, haha!

@@ -796,4 +796,66 @@ nextflow_pipeline {
}
}

test("Test Stupid Name in Input Sheet") {
tag "from_assemblies_stupidnames"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 Great test name

Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great Matthew. Thanks so much for your work on including sample names 😄

I have a few suggestions and comments for you (given in-line below).

@@ -47,6 +54,6 @@
"unique": true
}
},
"required": ["sample"]
"required": ["sample_name"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key sample should still be required as in IRIDA Next it contains the IRIDA Next identifiers. The key sample_name should be optional.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, I see sorry I misunderstood what was being requested

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: 6cb6d8c

Additional commits were related to updating test file column names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample is required, but you don't need to swap the item properties as in 6cb6d8c it was correct before.

Quick Summary
sample is the IRIDA-ID column of the samplesheet so it has to be unique, and is required. This is why we named the meta meta.irida_id (or in your case meta.external_id).

sample_name is an optional column to simply rename file-outputs, or use in results so that the user can interperate them better. This is why @kylacochrane introduced this at the start of the workflow (I did it for all other pipelines too) which is basically:

if (!meta.id) {
    meta.id = meta.irida_id

This means it if it is run locally it'll default to the old "just use sample column" and not break anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment Steven. This is correct. Could you swap back sample and sample_name Matthew? But make it so that sample is the one that is required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you mean by swap back sample and sample_name? Do you mean just within the schema_input.json or for the whole pipeline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly! Just for the schema_input.json

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all is made well in this commit: 71260a9

},
"sample_name": {
"type": "string",
"pattern": "^[^\\.]\\S+$",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should remove any restrictions on sample_name and instead follow a similar pattern as this block of code to replace any restricted characters with underscores _:

https://github.com/phac-nml/snvphylnfc/blob/f1e5fae76af276acf0a8c98174978cb21ca5d7e0/workflows/snvphylnfc.nf#L98-L109

The reason being that restricting sample_name means that mikrokondo will fail to run sample names don't match the above pattern (and spaces and periods are allowed in sample names in IRIDA Next). Allowing all patterns through but cleaning them up in the workflow code means that mikrokondo will still run for samples with any name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this check here to verify that only the sample name does not start with a period as there are issues with nf-prov later on when it aggregates files for the providence reports so here was my intention.

But with your clarification above about sample_name vs sample I think I can revert this cahnge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed here: 71260a9

default_samp_suffix = "_flat_sample.json"
parser = argparse.ArgumentParser("Table Summary")
parser.add_argument("-f", "--file-in", help="Path to the mikrokondo json summary")
parser.add_argument("-s", "--sample-tag", help="Optional suffix and extension to name output samples.", default=default_samp_suffix)
parser.add_argument("-o", "--out-file", help="output name plus the .tsv extension e.g. prefix.tsv")
parser.add_argument("-x", "--inx-id-token", help="A token to insert into the flattened json file names for separation of the irida next sample id.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious if, rather than inserting the IRIDA Next id into the flattened JSON file names, it can be used to create a folder with contents being the flattened JSON report? That way, you don't have to worry about inserting tokens into a filename and then parsing them out from the string later on. You can then also insert the sample name as part of the flattened report file name.

That is name output files like:

FlattenedReports/IRIDA_NEXT_ID/SAMPLE_NAME.flat_sample.json

Then, you can iterate over all subdirectories in FlattenedReports/, and parse the IRIDA Next identifier from the sub-directory. That is in:

mikrokondo/main.nf

Lines 113 to 119 in c036fb5

def inx_string_suffix = params.report_aggregate.inx_string_insertion
def name_trim = sample.getName()
def trimmed_name = name_trim.substring(0, name_trim.length() - params.report_aggregate.sample_flat_suffix.length())
def output_map = [
"id": trimmed_name,
"sample": trimmed_name,
"external_id": trimmed_name]

Pull out the IRIDA Next id (external_id) from the directory name instead of from the file name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea!! I think I will do that, that is much cleaner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: 39c8505

The output directory structure did not change, just the outputs from the script are structured.

@@ -20,7 +22,6 @@ workflow INPUT_CHECK {
meta -> tuple(meta.id[0], meta[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should use the meta.external_id for checking for the when there are reads that need to be combined. As we created the sample_name column to allow for repeat values. meta.id is used to here to find reads to be merged.

    grouped_tuples = reads_in.groupTuple(by: 0).branch {
            it ->
                merge_data: it[1].size() > 1
                format: true
            }

Copy link
Collaborator Author

@mattheww95 mattheww95 Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good question, as the only way (after reverting sample_name to sample) to merge reads now would be if the IRIDANext ID is the same. But going forward reads are only getting merged within IRIDANext now? @apetkau

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with Aaron, we decided to not let mikrokonod merge reads by default. We added a parameter to allow you too, but this will be a CLI feature. As in IRIDANext it is better to merge reads in the system where it is an auditable event and not something that may occur accidentally in a pipeline.
but it is fixed here: db5f420

Copy link
Contributor

@sgsutcliffe sgsutcliffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on my comment, there needs to be a renaming of meta.id with meta.external_id when no sample_name is provided because it becomes null and then wants to group everything in the COMBINE_DATA() process. I tried using the map{} we have used in other pipelines but it wasn't working. I can give it more of a try.

What I tried doing was:

    // Track processed IDs
    def processedIDs = [] as Set

    input = Channel.fromSamplesheet("input")
    // and remove non-alphanumeric characters in sample_names (meta.id), whilst also correcting for duplicate sample_names (meta.id)
    .map { meta ->
            if (!meta.id) {
                meta.id = meta.external_id
            } else {
                // Non-alphanumeric characters (excluding _,-,.) will be replaced with "_"
                meta.id = meta.id.replaceAll(/[^A-Za-z0-9_.\-]/, '_')
            }
            // Ensure ID is unique by appending meta.external_id if needed
            while (processedIDs.contains(meta.id)) {
                meta.id = "${meta.id}_${meta.external_id}"
            }
            // Add the ID to the set of processed IDs
            processedIDs << meta.id

            tuple(meta)}.view()

in the input_check subworkflow but it tells me it cannot perform replaceAll because it is an ArrayList type.

@sgsutcliffe
Copy link
Contributor

One last comment! I promise, and a suggestion. Could we use meta.irida_id instead of meta.external_id, that way it will be consistent with the other phac-nml nextflow pipelines.

@mattheww95
Copy link
Collaborator Author

mattheww95 commented Nov 6, 2024

One last comment! I promise, and a suggestion. Could we use meta.irida_id instead of meta.external_id, that way it will be consistent with the other phac-nml nextflow pipelines.

Just to provide the rationale for the name.

I had used irida_id at first, but I wanted a name that was more generalized so that the purpose of the parameter was better communicated to users that may be using mikrokondo externally from the NML.

meta ->

// Remove any unallowed charactars in the meta.id field
meta[0].id = meta[0].id.replaceAll(/[^A-Za-z0-9_\-]/, '_')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this. meta.id only needs to be scrubbed of unallowed characters if sample_name is provided in the samplesheet. This relates to my next comment.

Copy link
Member

@apetkau apetkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks so much for all your work @mattheww95 . A few inline comments.

CHANGELOG.md Outdated

### `Changed`

- Added a `sample_name` field, `sample` still exists but is used for different purposes [PR 140](https://github.com/phac-nml/mikrokondo/pull/140)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be under Added. Also, maybe state that sample_name is used primarily to incorporate an additional name/identifier when running the pipeline through IRIDA Next.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: 899e35b

CHANGELOG.md Outdated

- RASUSA now used for down sampling of Nanopore or PacBio data. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125)

- Sample names (`sample_name` field) can no longer begin with a period. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could remove this statement since sample_name was added as a new field in this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: a1c3f3e

CHANGELOG.md Outdated

- Added RASUSA for down sampling of Nanopore or PacBio data. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125)

- Added a new field to the `schema_input.json` file to allow for sample ID's from external systems such as IRIDA Next: [PR 140](https://github.com/phac-nml/mikrokondo/pull/140)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this statement and just have one statement about adding sample_name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: a2c56a8

"errorMessage": "Sample name must be provided and cannot contain spaces",
"meta": ["id"]
"meta": ["id"],
"errorMessage": "Sample name to be used in report generation. Invalid characters are replaces with underscores."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change the error message to state: Default sample identifier used by the pipeline. Also, invalid characters should not be replaced by underscores for sample, so you can remove that statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: a2c56a8

We made need to review this though. I was implementing what was discussed in our meeting so if anything is wrong apologies!

},
"sample_name": {
"type": "string",
"errorMessage": "Optional. Used to override sample when used in tools like IRIDA-Next. Invalid characters will be replaced with underscores.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you list the valid characters (e.g., valid characters include alphanumeric and . and _. All other characters will be replaced by underscores).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: b1e60dd

meta ->

// Remove any unallowed charactars in the meta.id field
meta[0].id = meta[0].id.replaceAll(/[^A-Za-z0-9_\-]/, '_')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in our call, this should be changed to be more similar to the way SNVPhyl handles this: https://github.com/phac-nml/snvphylnfc/blob/f1e5fae76af276acf0a8c98174978cb21ca5d7e0/workflows/snvphylnfc.nf#L98-L103

That is, meta.id should correspond to the sample_name by default, but if that column is empty it should be set to sample. The meta.external_id should instead correspond (when run through IRIDA Next) to the IRIDA Next identifier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: b1e60dd

// Remove any unallowed charactars in the meta.id field
meta[0].id = meta[0].id.replaceAll(/[^A-Za-z0-9_\-]/, '_')

if (meta[0].external_id != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's sample_name (i.e. meta.id) that is optional, and contains the ability to have unallowed characters. So the if/else should be:

                if (meta[0].id != null) {
                    // remove any charactars in the external_id that should not be used
                    meta[0].id = meta[0].id.replaceAll(/[^A-Za-z0-9_\-]/, '_')
                }else{
                    meta[0].id = meta[0].external_id
                }

Everything is named with meta.id but if not provided use the old-fashioned sample. Basically keep it as is for non-IRIDA users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: b1e60dd

if (meta[0].external_id != null) {
// remove any charactars in the external_id that should not be used
meta[0].id = meta[0].external_id.replaceAll(/[^A-Za-z0-9_\-]/, '_')
}else{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the other comments suggested, I won't need this else clause as grouping is by meta.id which if duplicated by either sample or sample_name will take place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in: b1e60dd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants