-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added sample irida_next sample field option #140
base: dev
Are you sure you want to change the base?
Conversation
|
If these tests pass, a sample with the name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Matthew 😸
I don’t have any specific comments - this sample_name
solution looks solid to me. I tried adding a helper function to simplify the inx_string_suffix
extraction logic in updated_samples within main.nf
, but it ended up making things more complicated than expected, haha!
@@ -796,4 +796,66 @@ nextflow_pipeline { | |||
} | |||
} | |||
|
|||
test("Test Stupid Name in Input Sheet") { | |||
tag "from_assemblies_stupidnames" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆 Great test name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great Matthew. Thanks so much for your work on including sample names 😄
I have a few suggestions and comments for you (given in-line below).
assets/schema_input.json
Outdated
@@ -47,6 +54,6 @@ | |||
"unique": true | |||
} | |||
}, | |||
"required": ["sample"] | |||
"required": ["sample_name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key sample
should still be required as in IRIDA Next it contains the IRIDA Next identifiers. The key sample_name
should be optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, I see sorry I misunderstood what was being requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: 6cb6d8c
Additional commits were related to updating test file column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample
is required, but you don't need to swap the item properties as in 6cb6d8c it was correct before.
Quick Summary
sample
is the IRIDA-ID column of the samplesheet so it has to be unique, and is required. This is why we named the meta
meta.irida_id
(or in your case meta.external_id
).
sample_name
is an optional column to simply rename file-outputs, or use in results so that the user can interperate them better. This is why @kylacochrane introduced this at the start of the workflow (I did it for all other pipelines too) which is basically:
if (!meta.id) {
meta.id = meta.irida_id
This means it if it is run locally it'll default to the old "just use sample column" and not break anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment Steven. This is correct. Could you swap back sample
and sample_name
Matthew? But make it so that sample
is the one that is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean by swap back sample
and sample_name
? Do you mean just within the schema_input.json
or for the whole pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly! Just for the schema_input.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all is made well in this commit: 71260a9
assets/schema_input.json
Outdated
}, | ||
"sample_name": { | ||
"type": "string", | ||
"pattern": "^[^\\.]\\S+$", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should remove any restrictions on sample_name
and instead follow a similar pattern as this block of code to replace any restricted characters with underscores _
:
The reason being that restricting sample_name
means that mikrokondo will fail to run sample names don't match the above pattern (and spaces and periods are allowed in sample names in IRIDA Next). Allowing all patterns through but cleaning them up in the workflow code means that mikrokondo will still run for samples with any name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this check here to verify that only the sample name does not start with a period as there are issues with nf-prov later on when it aggregates files for the providence reports so here was my intention.
But with your clarification above about sample_name
vs sample
I think I can revert this cahnge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed here: 71260a9
bin/report_summaries.py
Outdated
default_samp_suffix = "_flat_sample.json" | ||
parser = argparse.ArgumentParser("Table Summary") | ||
parser.add_argument("-f", "--file-in", help="Path to the mikrokondo json summary") | ||
parser.add_argument("-s", "--sample-tag", help="Optional suffix and extension to name output samples.", default=default_samp_suffix) | ||
parser.add_argument("-o", "--out-file", help="output name plus the .tsv extension e.g. prefix.tsv") | ||
parser.add_argument("-x", "--inx-id-token", help="A token to insert into the flattened json file names for separation of the irida next sample id.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious if, rather than inserting the IRIDA Next id into the flattened JSON file names, it can be used to create a folder with contents being the flattened JSON report? That way, you don't have to worry about inserting tokens into a filename and then parsing them out from the string later on. You can then also insert the sample name as part of the flattened report file name.
That is name output files like:
FlattenedReports/IRIDA_NEXT_ID/SAMPLE_NAME.flat_sample.json
Then, you can iterate over all subdirectories in FlattenedReports/
, and parse the IRIDA Next identifier from the sub-directory. That is in:
Lines 113 to 119 in c036fb5
def inx_string_suffix = params.report_aggregate.inx_string_insertion | |
def name_trim = sample.getName() | |
def trimmed_name = name_trim.substring(0, name_trim.length() - params.report_aggregate.sample_flat_suffix.length()) | |
def output_map = [ | |
"id": trimmed_name, | |
"sample": trimmed_name, | |
"external_id": trimmed_name] |
Pull out the IRIDA Next id (external_id
) from the directory name instead of from the file name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea!! I think I will do that, that is much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: 39c8505
The output directory structure did not change, just the outputs from the script are structured.
subworkflows/local/input_check.nf
Outdated
@@ -20,7 +22,6 @@ workflow INPUT_CHECK { | |||
meta -> tuple(meta.id[0], meta[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should use the meta.external_id
for checking for the when there are reads that need to be combined. As we created the sample_name
column to allow for repeat values. meta.id
is used to here to find reads to be merged.
grouped_tuples = reads_in.groupTuple(by: 0).branch {
it ->
merge_data: it[1].size() > 1
format: true
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good question, as the only way (after reverting sample_name
to sample
) to merge reads now would be if the IRIDANext ID is the same. But going forward reads are only getting merged within IRIDANext now? @apetkau
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking with Aaron, we decided to not let mikrokonod merge reads by default. We added a parameter to allow you too, but this will be a CLI feature. As in IRIDANext it is better to merge reads in the system where it is an auditable event and not something that may occur accidentally in a pipeline.
but it is fixed here: db5f420
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on my comment, there needs to be a renaming of meta.id
with meta.external_id
when no sample_name
is provided because it becomes null
and then wants to group everything in the COMBINE_DATA() process. I tried using the map{}
we have used in other pipelines but it wasn't working. I can give it more of a try.
What I tried doing was:
// Track processed IDs
def processedIDs = [] as Set
input = Channel.fromSamplesheet("input")
// and remove non-alphanumeric characters in sample_names (meta.id), whilst also correcting for duplicate sample_names (meta.id)
.map { meta ->
if (!meta.id) {
meta.id = meta.external_id
} else {
// Non-alphanumeric characters (excluding _,-,.) will be replaced with "_"
meta.id = meta.id.replaceAll(/[^A-Za-z0-9_.\-]/, '_')
}
// Ensure ID is unique by appending meta.external_id if needed
while (processedIDs.contains(meta.id)) {
meta.id = "${meta.id}_${meta.external_id}"
}
// Add the ID to the set of processed IDs
processedIDs << meta.id
tuple(meta)}.view()
in the input_check subworkflow but it tells me it cannot perform replaceAll
because it is an ArrayList type.
One last comment! I promise, and a suggestion. Could we use |
Just to provide the rationale for the name. I had used |
subworkflows/local/input_check.nf
Outdated
meta -> | ||
|
||
// Remove any unallowed charactars in the meta.id field | ||
meta[0].id = meta[0].id.replaceAll(/[^A-Za-z0-9_\-]/, '_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this. meta.id
only needs to be scrubbed of unallowed characters if sample_name
is provided in the samplesheet. This relates to my next comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. Thanks so much for all your work @mattheww95 . A few inline comments.
CHANGELOG.md
Outdated
|
||
### `Changed` | ||
|
||
- Added a `sample_name` field, `sample` still exists but is used for different purposes [PR 140](https://github.com/phac-nml/mikrokondo/pull/140) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be under Added. Also, maybe state that sample_name
is used primarily to incorporate an additional name/identifier when running the pipeline through IRIDA Next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: 899e35b
CHANGELOG.md
Outdated
|
||
- RASUSA now used for down sampling of Nanopore or PacBio data. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125) | ||
|
||
- Sample names (`sample_name` field) can no longer begin with a period. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could remove this statement since sample_name
was added as a new field in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: a1c3f3e
CHANGELOG.md
Outdated
|
||
- Added RASUSA for down sampling of Nanopore or PacBio data. [PR 125](https://github.com/phac-nml/mikrokondo/pull/125) | ||
|
||
- Added a new field to the `schema_input.json` file to allow for sample ID's from external systems such as IRIDA Next: [PR 140](https://github.com/phac-nml/mikrokondo/pull/140) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove this statement and just have one statement about adding sample_name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: a2c56a8
"errorMessage": "Sample name must be provided and cannot contain spaces", | ||
"meta": ["id"] | ||
"meta": ["id"], | ||
"errorMessage": "Sample name to be used in report generation. Invalid characters are replaces with underscores." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change the error message to state: Default sample identifier used by the pipeline. Also, invalid characters should not be replaced by underscores for sample
, so you can remove that statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: a2c56a8
We made need to review this though. I was implementing what was discussed in our meeting so if anything is wrong apologies!
}, | ||
"sample_name": { | ||
"type": "string", | ||
"errorMessage": "Optional. Used to override sample when used in tools like IRIDA-Next. Invalid characters will be replaced with underscores.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you list the valid characters (e.g., valid characters include alphanumeric and .
and _
. All other characters will be replaced by underscores).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: b1e60dd
subworkflows/local/input_check.nf
Outdated
meta -> | ||
|
||
// Remove any unallowed charactars in the meta.id field | ||
meta[0].id = meta[0].id.replaceAll(/[^A-Za-z0-9_\-]/, '_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in our call, this should be changed to be more similar to the way SNVPhyl handles this: https://github.com/phac-nml/snvphylnfc/blob/f1e5fae76af276acf0a8c98174978cb21ca5d7e0/workflows/snvphylnfc.nf#L98-L103
That is, meta.id
should correspond to the sample_name
by default, but if that column is empty it should be set to sample
. The meta.external_id
should instead correspond (when run through IRIDA Next) to the IRIDA Next identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: b1e60dd
subworkflows/local/input_check.nf
Outdated
// Remove any unallowed charactars in the meta.id field | ||
meta[0].id = meta[0].id.replaceAll(/[^A-Za-z0-9_\-]/, '_') | ||
|
||
if (meta[0].external_id != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sample_name
(i.e. meta.id
) that is optional, and contains the ability to have unallowed characters. So the if/else should be:
if (meta[0].id != null) {
// remove any charactars in the external_id that should not be used
meta[0].id = meta[0].id.replaceAll(/[^A-Za-z0-9_\-]/, '_')
}else{
meta[0].id = meta[0].external_id
}
Everything is named with meta.id
but if not provided use the old-fashioned sample
. Basically keep it as is for non-IRIDA users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: b1e60dd
subworkflows/local/input_check.nf
Outdated
if (meta[0].external_id != null) { | ||
// remove any charactars in the external_id that should not be used | ||
meta[0].id = meta[0].external_id.replaceAll(/[^A-Za-z0-9_\-]/, '_') | ||
}else{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the other comments suggested, I won't need this else
clause as grouping is by meta.id
which if duplicated by either sample
or sample_name
will take place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in: b1e60dd
Added support for the irida_next sample id.