Skip to content

A workflow to download files from Globus over HTTPS, with optional decryption on the fly

License

Notifications You must be signed in to change notification settings

ebi-gdp/globflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What's this useful for?

We needed a way to:

  1. Reliably download files from a Globus collection over HTTPS
  2. Optionally decrypt them on the fly (crypt4gh)
  3. Store the plaintext files in an object store (bucket), ready for cloud based data science workflows

The file handler CLI takes care of 1) and 2).

globflow wraps the CLI using Nextflow to support parallel downloads and Fusion file system to transparently upload plaintext data to an object store.

Downloaded files can also be saved to a local filesystem.

Note

This workflow grabs crypt4gh secret keys from the INTERVENE key handler service, but could be adapted to work with local crypt4gh key pairs

Table of Contents

Parameters

File input

Important

This is parameter is mandatory

--input must be a JSON array with the following structure:

{
    "dir_path_on_guest_collection": "[email protected]/test_hapnest/",
    "files": [
        {
            "filename": "hapnest.pgen.crypt4gh",
            "size": 278825058
        }
    ]
}

Secret key

Important

This is parameter is optional

--secret_key must be a JSON file with the following structure:

{"secretId": "77451C57-0FCC-460F-91A3-E0DED05B440F", "secretIdVersion": "1"}

The secret key is used to contact the platform key handler service and grab the correct crypt4gh secret key.

Application properties

Important

This parameter is mandatory

Tip

Be careful of trailing whitespace in properties files

--config_application must be a path to a spring boot application properties file with the following structure:

#####################################################################################
# Application config
#####################################################################################
spring.main.web-application-type=none
data.copy.buffer-size=8192
#####################################################################################
# Apache HttpClient connection config
#####################################################################################
webclient.connection.pipe-size=${data.copy.buffer-size}
webclient.connection.connection-timeout=5
webclient.connection.socket-timeout=0
webclient.connection.read-write-timeout=30000
#####################################################################################
# File download retry config
#####################################################################################
# EXPONENTIAL/FIXED
file.download.retry.strategy=FIXED
file.download.retry.attempts.max=3
# Exponential
file.download.retry.attempts.delay=1000
file.download.retry.attempts.maxDelay=30000
file.download.retry.attempts.multiplier=2
# Fixed
file.download.retry.attempts.back-off-period=2000
#####################################################################################
# Globus config
#####################################################################################
[email protected]@
#Oauth
globus.aai.access-token.uri=https://auth.globus.org/v2/oauth2/token
[email protected]@
[email protected]@
globus.aai.scopes=https://auth.globus.org/scopes/c1e6310c-11d5-4e8a-9443-211884f04c6f/https
#####################################################################################
# Logging config
#####################################################################################
logging.level.uk.ac.ebi.intervene=INFO
logging.level.org.springframework=WARN
logging.level.org.apache.http=WARN
logging.level.org.apache.http.wire=WARN

See the file handler CLI README for a description of the configuration.

crypt4gh application properties

Important

This is parameter is optional

--config_crypt4gh must be a path to a spring boot application properties file with the following structure:

#####################################################################################
# Crypt4gh config
#####################################################################################
crypt4gh.binary-path=/opt/bin/crypt4gh
crypt4gh.shell-path=/bin/bash -c
#####################################################################################
# Intervene service config
#####################################################################################
intervene.key-handler.base-url=http://localhost:8040/bff/key-handler
intervene.key-handler.keys.uri=/key/{secretId}/version/{secretIdVersion}
intervene.key-handler.basic-auth=${KEY_HANDLER_BASIC_AUTH:basic-auth}
intervene.key-handler.secret-key.password=${SEC_KEY_PASSWD:test-password}

See the file handler CLI README for a description of the configuration.

--key can be a crypt4gh private key path or a JSON file with the following structure:

{
  "secretId": "8D705854-9EEA-44C5-9937-E4E5228B8457",
  "secretIdVersion": "1"
}

which integrates with the key handler service.

Example use cases

Tip

--debug can be helpful to keep files containing sensitive data if you're having problems with a transfer (disabled by default)

Download files from a Globus collection over HTTPS

$ nextflow run main.nf -profile docker \
  --input input.json \
  --config_application application.properties \
  --outdir downloads

Downloading files with crypt4gh decryption on the fly

It makes sense to submit these jobs to a grid executor, like SLURM or cloud batch, because decryption on the fly will use ~1 CPU for each file:

$ nextflow run main.nf -profile docker \
  --input input.json \
  --secret_key key.json \
  --config_application application.properties \
  --config_crypt4gh application-crypt4gh-secret-manager.properties \
  --config_secrets assets/secret.properties \
  --outdir downloads \
  --decrypt

Downloading files to an object store

It's possible to use nextflow's support for object storage to transfer files from Globus directly to a bucket:

$ nextflow run main.nf -profile docker \
  -c cloud.config \
  --input input.json \
  --secret_key key.json \
  --config_application application.properties \
  --config_crypt4gh application-crypt4gh-secret-manager.properties \
  --config_secrets assets/secret.properties \
  --outdir gs://pathtobucket/downloads \
  -w gs://pathworkbucket/work

For best performance use a cloud executor and enable fusion in the nextflow configuration:

process {
    executor = 'google-batch'
}

wave {
    enabled = true
}

fusion {
    enabled = true
}

tower {
    accessToken = 'token'
    workspaceId = 'work'
    enabled = true
}

google {
    project = 'prj-id'
    location = 'europe-west2'
    batch {
        spot = true
    }
}

Helm support

helm/ contains a helm chart which can install a Job to a Kubernetes cluster.

In the helm chart worker processes run in Cloud Batch by default with crypt4gh decryption on the fly enabled.

About

A workflow to download files from Globus over HTTPS, with optional decryption on the fly

Resources

License

Stars

Watchers

Forks