Skip to content

Commit

Permalink
Merge pull request #2 from ebi-gdp/dev
Browse files Browse the repository at this point in the history
Release 2.0.0
  • Loading branch information
nebfield authored Oct 31, 2024
2 parents 615466f + 40fddd5 commit cb74a04
Show file tree
Hide file tree
Showing 13 changed files with 454 additions and 65 deletions.
134 changes: 105 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
We needed a way to:

1) Reliably download files from a Globus collection over HTTPS
2) Decrypt them on the fly ([crypt4gh](https://github.com/EGA-archive/crypt4gh))
2) Optionally decrypt them on the fly ([crypt4gh](https://github.com/EGA-archive/crypt4gh))
3) Store the plaintext files in an object store (bucket), ready for cloud based data science workflows

The [file handler CLI](https://github.com/ebi-gdp/globus-file-handler-cli) takes care of 1) and 2).
Expand All @@ -12,18 +12,35 @@ The [file handler CLI](https://github.com/ebi-gdp/globus-file-handler-cli) takes

Downloaded files can also be saved to a local filesystem.

> [!NOTE]
> This workflow grabs crypt4gh secret keys from the INTERVENE key handler service, but could be adapted to work with local crypt4gh key pairs
### Table of Contents

- [Parameters](#parameters)
* [File input](#file-input)
* [Secret key](#secret-key)
* [Application properties](#application-properties)
* [crypt4gh application properties](#crypt4gh-application-properties)
- [Example use cases](#example-use-cases)
* [Download files from a Globus collection over HTTPS](#download-files-from-a-globus-collection-over-https)
* [Downloading files with crypt4gh decryption on the fly](#downloading-files-with-crypt4gh-decryption-on-the-fly)
* [Downloading files to an object store (bucket)](#downloading-files-to-an-object-store)
- [Helm support](#helm-support)

## Parameters

### File input

> [!IMPORTANT]
> This is parameter is mandatory
`--input` must be a JSON array with the following structure:

```
{
"dir_path_on_guest_collection": "[email protected]/test_hapnest/",
"files": [
{
"filename": "hapnest.pvar",
"size": 278705850
},
{
"filename": "hapnest.pgen.crypt4gh",
"size": 278825058
Expand All @@ -32,17 +49,39 @@ Downloaded files can also be saved to a local filesystem.
}
```

`--config_secrets` must be a path to a spring boot application properties file with the following structure:
### Secret key

> [!IMPORTANT]
> This is parameter is optional
`--secret_key` must be a JSON file with the following structure:

```
{"secretId": "77451C57-0FCC-460F-91A3-E0DED05B440F", "secretIdVersion": "1"}
```

The secret key is used to contact the platform key handler service and grab the correct crypt4gh secret key.

### Application properties

> [!IMPORTANT]
> This parameter is mandatory
> [!TIP]
> Be careful of trailing whitespace in properties files
`--config_application` must be a path to a spring boot application properties file with the following structure:

```
#####################################################################################
# Application config
#####################################################################################
spring.main.web-application-type=none
data.copy.buffer-size=8192
#####################################################################################
# Apache HttpClient connection config
#####################################################################################
webclient.connection.pipe-size=4096
webclient.connection.pipe-size=${data.copy.buffer-size}
webclient.connection.connection-timeout=5
webclient.connection.socket-timeout=0
webclient.connection.read-write-timeout=30000
Expand All @@ -61,31 +100,43 @@ file.download.retry.attempts.back-off-period=2000
#####################################################################################
# Globus config
#####################################################################################
globus.guest-collection.domain=<url>
globus.guest-collection.domain=@globus.guest-collection.url@
#Oauth
globus.aai.access-token.uri=https://auth.globus.org/v2/oauth2/token
globus.aai.client-id=<id>
globus.aai.client-secret=<token>
globus.aai.scopes=<url>
#####################################################################################
# Crypt4gh config
#####################################################################################
crypt4gh.binary-path=/opt/bin/crypt4gh
crypt4gh.shell-path=/bin/bash -c
[email protected]@
[email protected]@
globus.aai.scopes=https://auth.globus.org/scopes/c1e6310c-11d5-4e8a-9443-211884f04c6f/https
#####################################################################################
# Logging config
#####################################################################################
logging.level.uk.ac.ebi.intervene=INFO
logging.level.org.springframework=WARN
logging.level.org.apache.http=WARN
logging.level.org.apache.http.wire=WARN
```

See the [file handler CLI](https://github.com/ebi-gdp/globus-file-handler-cli) README for a description of the configuration.

### crypt4gh application properties

> [!IMPORTANT]
> This is parameter is optional
`--config_crypt4gh` must be a path to a spring boot application properties file with the following structure:

```
#####################################################################################
# key handler service config
# Crypt4gh config
#####################################################################################
intervene.key-handler.basic-auth=Basic <token>
intervene.key-handler.secret-key.password=<password>
intervene.key-handler.base-url=https://<url>/key-handler
crypt4gh.binary-path=/opt/bin/crypt4gh
crypt4gh.shell-path=/bin/bash -c
#####################################################################################
# Intervene service config
#####################################################################################
intervene.key-handler.base-url=http://localhost:8040/bff/key-handler
intervene.key-handler.keys.uri=/key/{secretId}/version/{secretIdVersion}
intervene.key-handler.basic-auth=${KEY_HANDLER_BASIC_AUTH:basic-auth}
intervene.key-handler.secret-key.password=${SEC_KEY_PASSWD:test-password}
```

See the [file handler CLI](https://github.com/ebi-gdp/globus-file-handler-cli) README for a description of the configuration.
Expand All @@ -103,29 +154,47 @@ which integrates with the key handler service.

## Example use cases

> [!TIP]
> `--debug` can be helpful to keep files containing sensitive data if you're having problems with a transfer (disabled by default)
### Download files from a Globus collection over HTTPS

```
$ nextflow run main.nf -profile docker \
--input input.json \
--config_application application.properties \
--outdir downloads
```

### Downloading files with crypt4gh decryption on the fly

It makes sense to submit these jobs to [a grid executor](https://www.nextflow.io/docs/latest/executor.html), like SLURM or cloud batch, because decryption on the fly will use ~1 CPU for each file:

```
$ nextflow run main.nf -profile <docker/singularity> \
$ nextflow run main.nf -profile docker \
--input input.json \
--secret_key key.json \
--config_application application.properties \
--config_crypt4gh application-crypt4gh-secret-manager.properties \
--config_secrets assets/secret.properties \
--input assets/example_input.json \
--outdir downloads \
--secret_key key
--decrypt
```

### Downloading files to an object store (bucket)
### Downloading files to an object store

It's possible to use nextflow's support for object storage to transfer files from Globus directly to a bucket:

```
$ nextflow run main.nf -profile <docker/singularity> \
$ nextflow run main.nf -profile docker \
-c cloud.config \
--input input.json \
--secret_key key.json \
--config_application application.properties \
--config_crypt4gh application-crypt4gh-secret-manager.properties \
--config_secrets assets/secret.properties \
--input assets/example_input.json \
--secret_key key \
--outdir gs://test-bucket/downloads \
-w gs://test-bucket/work
--outdir gs://pathtobucket/downloads \
-w gs://pathworkbucket/work
```

For best performance use a cloud executor and enable fusion in the nextflow configuration:
Expand All @@ -145,6 +214,7 @@ fusion {
tower {
accessToken = 'token'
workspaceId = 'work'
enabled = true
}
Expand All @@ -156,3 +226,9 @@ google {
}
}
```

## Helm support

`helm/` contains a [helm chart](https://helm.sh/docs/topics/charts/) which can install a [Job](https://kubernetes.io/docs/concepts/workloads/controllers/job/) to a Kubernetes cluster.

In the helm chart worker processes run in Cloud Batch by default with crypt4gh decryption on the fly enabled.
4 changes: 0 additions & 4 deletions assets/example_input.json
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
{
"dir_path_on_guest_collection": "[email protected]/test_hapnest/",
"files": [
{
"filename": "hapnest.pvar",
"size": 278705850
},
{
"filename": "hapnest.pgen.crypt4gh",
"size": 278825058
Expand Down
4 changes: 4 additions & 0 deletions assets/key.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"secretId": "8D705854-9EEA-44C5-9937-E4E5228B8457",
"secretIdVersion": "1"
}
1 change: 1 addition & 0 deletions helm/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
values.yaml
23 changes: 23 additions & 0 deletions helm/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
24 changes: 24 additions & 0 deletions helm/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: v2
name: globflow
description: A Helm chart for a globflow file transfer with crypt4gh decryption on the fly

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "2.0.0"
47 changes: 47 additions & 0 deletions helm/templates/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Release.Name }}-transfer-config
data:
input.json: {{ toJson .Values.globflowInput | quote }}
key.json: {{ toJson .Values.keyHandlerSecret | quote }}
params.yml: |
{{- range $key, $value := .Values.globflowParams }}
{{ $key }}: {{ $value }}
{{- end }}
nxf.config: |
workDir = {{ .Values.nxfParams.workBucketPath | quote }}
process {
executor = 'google-batch'
maxRetries = 1
}
google {
project = {{ .Values.nxfParams.gcpProject | quote }}
location = {{ .Values.nxfParams.location | quote }}
batch {
spot = {{ .Values.nxfParams.spot }}
}
}
wave {
enabled = {{ .Values.nxfParams.wave }}
}
fusion {
enabled = {{ .Values.nxfParams.fusion }}
}
tower {
accessToken = {{ .Values.secrets.towerToken | quote }}
workspaceId = {{ .Values.secrets.towerId | quote }}
enabled = true
}
scm: |
providers {
ebi {
server = 'https://gitlab.ebi.ac.uk'
platform = 'gitlab'
}
}
51 changes: 51 additions & 0 deletions helm/templates/job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
apiVersion: batch/v1
kind: Job
metadata:
name: {{ .Release.Name }}
spec:
ttlSecondsAfterFinished: 3600
backoffLimit: 0
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
serviceAccountName: nextflow
containers:
- name: globflow
image: {{ .Values.baseImage }}:{{ .Values.dockerTag }}
imagePullPolicy: {{ .Values.pullPolicy }}
command: ['sh', '-c', "nextflow run https://gitlab.ebi.ac.uk/gdp-public/globflow.git -params-file /opt/nxf/params.yml -c /opt/nxf/nxf.config --decrypt"]
env:
- name: NXF_SCM_FILE
value: /opt/nxf/scm
resources:
requests:
cpu: "1"
memory: 2G
ephemeral-storage: 10G
volumeMounts:
- name: transfer-config
mountPath: /opt/nxf
- name: globflow-secrets
mountPath: /opt/globflow/
readOnly: true
volumes:
- name: transfer-config
configMap:
name: {{ .Release.Name }}-transfer-config
items:
- key: nxf.config
path: nxf.config
- key: scm
path: scm
- key: params.yml
path: params.yml
- key: input.json
path: input.json
- key: key.json
path: key.json
- name: globflow-secrets
secret:
secretName: {{ .Release.Name }}-transfer-secrets
restartPolicy: Never
Loading

0 comments on commit cb74a04

Please sign in to comment.