Skip to content

Commit

Permalink
Merge pull request #1 from ebi-gdp/dev
Browse files Browse the repository at this point in the history
0.2.0 -> 1.0.0
  • Loading branch information
nebfield authored Sep 19, 2024
2 parents 4026147 + 511059c commit 731a0b6
Show file tree
Hide file tree
Showing 4 changed files with 90 additions and 92 deletions.
91 changes: 61 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
We needed a way to:

1) Reliably download files from a Globus collection over HTTPS
2) (Optionally) decrypt them on the fly ([crypt4gh](https://github.com/EGA-archive/crypt4gh))
2) Decrypt them on the fly ([crypt4gh](https://github.com/EGA-archive/crypt4gh))
3) Store the plaintext files in an object store (bucket), ready for cloud based data science workflows

The [file handler CLI](https://github.com/ebi-gdp/globus-file-handler-cli) takes care of 1) and 2).
Expand Down Expand Up @@ -35,34 +35,73 @@ Downloaded files can also be saved to a local filesystem.
`--config_secrets` must be a path to a spring boot application properties file with the following structure:

```
globus.guest-collection.domain=SECRET
globus.aai.client-id=SECRET
globus.aai.client-secret=SECRET
globus.aai.scopes=SECRET
#####################################################################################
# Application config
#####################################################################################
data.copy.buffer-size=8192
#####################################################################################
# Apache HttpClient connection config
#####################################################################################
webclient.connection.pipe-size=4096
webclient.connection.connection-timeout=5
webclient.connection.socket-timeout=0
webclient.connection.read-write-timeout=30000
#####################################################################################
# File download retry config
#####################################################################################
# EXPONENTIAL/FIXED
file.download.retry.strategy=FIXED
file.download.retry.attempts.max=3
# Exponential
file.download.retry.attempts.delay=1000
file.download.retry.attempts.maxDelay=30000
file.download.retry.attempts.multiplier=2
# Fixed
file.download.retry.attempts.back-off-period=2000
#####################################################################################
# Globus config
#####################################################################################
globus.guest-collection.domain=<url>
#Oauth
globus.aai.access-token.uri=https://auth.globus.org/v2/oauth2/token
globus.aai.client-id=<id>
globus.aai.client-secret=<token>
globus.aai.scopes=<url>
#####################################################################################
# Crypt4gh config
#####################################################################################
crypt4gh.binary-path=/opt/bin/crypt4gh
crypt4gh.shell-path=/bin/bash -c
#####################################################################################
# Logging config
#####################################################################################
logging.level.uk.ac.ebi.intervene=INFO
logging.level.org.springframework=WARN
logging.level.org.apache.http=WARN
logging.level.org.apache.http.wire=WARN
#####################################################################################
# key handler service config
#####################################################################################
intervene.key-handler.basic-auth=Basic <token>
intervene.key-handler.secret-key.password=<password>
intervene.key-handler.base-url=https://<url>/key-handler
intervene.key-handler.keys.uri=/key/{secretId}/version/{secretIdVersion}
```

(replace SECRET with your sensitive data)
See the [file handler CLI](https://github.com/ebi-gdp/globus-file-handler-cli) README for a description of the configuration.

`--key` must be the secret key pair of the recipients public key. It should probably be made by the crypt4gh CLI.

## Example use cases

### Downloading files to local storage in parallel
`--key` can be a crypt4gh private key path or a JSON file with the following structure:

```
$ nextflow run main.nf -profile <docker/singularity> \
--config_secrets assets/secret.properties \
--input assets/example_input.json \
--outdir downloads \
--threads 10
{
"secretId": "8D705854-9EEA-44C5-9937-E4E5228B8457",
"secretIdVersion": "1"
}
```

It's a good idea to:

* set --threads to do multiple downloads
* use a local executor, the overhead of submitting jobs to a grid executor like SLURM isn't worth it
which integrates with the key handler service.

By default parallel downloads are disabled (`--threads 1`).
## Example use cases

### Downloading files with crypt4gh decryption on the fly

Expand All @@ -73,17 +112,9 @@ $ nextflow run main.nf -profile <docker/singularity> \
--config_secrets assets/secret.properties \
--input assets/example_input.json \
--outdir downloads \
--secret_key key \
--threads 10
--secret_key key
```

When using a grid executor, `--threads` will control the number of jobs submitted to the scheduler.

If you're running globflow on a desktop computer, try setting `--threads` to the number of CPUs you have.

Globflow will only try to decrypt files with a `.crypt4gh` extension, and will download other files normally.

### Downloading files to an object store (bucket)

It's possible to use nextflow's support for object storage to transfer files from Globus directly to a bucket:
Expand Down
Empty file removed assets/NO_FILE
Empty file.
80 changes: 25 additions & 55 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -10,92 +10,62 @@ if (!params.input) {
error "Error: missing mandatory parameter --input"
}

if (!params.secret_key) {
error "Error: missing --secret_key"
}

process download_decrypt {
errorStrategy { sleep(Math.pow(2, task.attempt) * 200 as long); return 'retry' }
maxForks params.threads
maxRetries 3

tag "${in_map.filename}"
publishDir "$params.outdir", mode: "move"
// drops the "output" directory from path when publishing
publishDir "$params.outdir", mode: "move", saveAs: { "${file(it).getName()}" }
container "${ workflow.containerEngine == 'singularity' ?
"oras://ghcr.io/ebi-gdp/globus-file-handler-cli:1.0.0-singularity" :
"ghcr.io/ebi-gdp/globus-file-handler-cli:1.0.0" }"
"oras://ghcr.io/ebi-gdp/globus-file-handler-cli:1.0.4-singularity" :
"ghcr.io/ebi-gdp/globus-file-handler-cli:1.0.4" }"

input:
val in_map
path config_path
path secret_path
path secret_key
val in_map
path secret_config, stageAs: "secret.properties"
path secret_key, stageAs: "secret-config.json"

output:
path "${file(in_map.filename).baseName}"
path "output/*"

when:
in_map.filename.endsWith(".crypt4gh") && secret_key.name != "NO_FILE"

script:
"""
java -jar /opt/globus-file-handler-cli-1.0.0.jar \
-Dspring.config.location=${config_path},${secret_path} \
-s "${in_map.dir_path_on_guest_collection}/${in_map.filename}" \
-d "\$PWD/${file(in_map.filename).baseName}" \
-l ${in_map.size} \
mkdir output
java -jar /opt/globus-file-handler-cli-1.0.4.jar \
--spring.config.location=./secret.properties \
--globus_file_transfer_source_path "globus:///${in_map.dir_path_on_guest_collection}/${in_map.filename}" \
--globus_file_transfer_destination_path "file:///\$PWD/output/${file(in_map.filename).baseName}" \
--file_size ${in_map.size} \
--crypt4gh \
--sk ${secret_key}
"""
}

process download {
errorStrategy { sleep(Math.pow(2, task.attempt) * 200 as long); return 'retry' }
maxForks params.threads
tag "${in_map.filename}"
publishDir "$params.outdir", mode: "move"
container "${ workflow.containerEngine == 'singularity' ?
"oras://ghcr.io/ebi-gdp/globus-file-handler-cli:1.0.0-singularity" :
"ghcr.io/ebi-gdp/globus-file-handler-cli:1.0.0" }"
--sk "file:///\$PWD/secret-config.json"
input:
val in_map
path config_path
path secret_path
path secret_key

output:
path "${in_map.filename}"

when:
!in_map.filename.endsWith(".crypt4gh") || secret_key.name == "NO_FILE"

script:
"""
java -jar /opt/globus-file-handler-cli-1.0.0.jar \
-Dspring.config.location=${config_path},${secret_path} \
-s "${in_map.dir_path_on_guest_collection}/${in_map.filename}" \
-d "\$PWD/${in_map.filename}" \
-l ${in_map.size}
rm -f ./* 2>/dev/null || true # delete everything except output directory
"""
}

workflow {
// using first() to create reusable value channels
Channel.fromPath(params.secret_key, checkIfExists: true).first().set { secret_key }
Channel.fromPath(params.config_path, checkIfExists: true).first().set { config_path }
Channel.fromPath(params.config_secrets, checkIfExists: true).first().set { secrets_path }
Channel.fromPath(params.config_secrets, checkIfExists: true).first().set { secrets_config_path }

// this channel is a list of hashmaps, one for each file to be downloaded
Channel.fromPath(params.input, checkIfExists: true).map { parseInput(it) }.flatten().set { ch_input }

// decryption on the fly will automatically happen if a filename ends with .crypt4gh and a --key is provided
download_decrypt(ch_input, config_path, secrets_path, secret_key)
// if --key is missing or a file doesn't end with .crypt4gh, just download
download(ch_input, config_path, secrets_path, secret_key)
download_decrypt(ch_input, secrets_config_path, secret_key)
}


def parseInput(json_file) {
slurp = new JsonSlurper()
def slurped = slurp.parseText(json_file.text)
def meta = slurped.subMap("dir_path_on_guest_collection")
def parsed = slurped.files.collect { meta + it }
def parsed = slurped.files.collect { meta + it }

return parsed
}
11 changes: 4 additions & 7 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,7 @@ params {
input = null

// optional params
threads = 1
config_path = "$baseDir/assets/application-dev.properties"
outdir = "results"
secret_key = "$baseDir/assets/NO_FILE"
}

profiles {
Expand All @@ -16,7 +13,7 @@ profiles {
singularity.enabled = false
}
arm {
docker.runOptions = '--platform=linux/arm64'
docker.runOptions = '--platform=linux/arm64'
}
singularity {
singularity.enabled = true
Expand All @@ -39,8 +36,8 @@ manifest {
author = 'Benjamin Wingfield'
defaultBranch = 'main'
homePage = 'https://github.com/ebi-gdp/globflow'
description = 'Download files from Globus over HTTPS, with optional decryption on the fly'
description = 'Download files from Globus over HTTPS, with decryption on the fly'
mainScript = 'main.nf'
nextflowVersion = '>=23.10.1'
version = '0.2.0'
}
version = '1.0.0'
}

0 comments on commit 731a0b6

Please sign in to comment.