Skip to content

seandavi/curatedMetagenomicsNextflow

Repository files navigation

README

The pipeline, in brief

  • fastq-dump
  • setup databases
    • metaphlan
    • chocophlan
    • uniref
  • metaphlan bug list
  • metaphlan markers
  • humann (including various aggregations)

Inputs and Outputs

The metadata_tsv file must be:

  • tab-separated
  • must contain columns
    • sample_id
    • study_name
    • NCBI_accession, a semicolon-separated list of SRRs
  • Can be a file or a web url

If using a Google Bucket, the name bucket must not have underscores.

On sample ids

We use a simple approach to create sample ids. The study_name and sample_id are first concatenated by ::. Then, we base64 encode. For example:

echo 'study_name1::sample_name1' | base64

This yields:

c3R1ZHlfbmFtZTE6OnNhbXBsZV9uYW1lMQo=

To decode a sample id:

echo 'c3R1ZHlfbmFtZTE6OnNhbXBsZV9uYW1lMQo=' | base64 -d

which gives back the original string:

study_name1::sample_name1

Install

export NXF_MODE=google
curl https://get.nextflow.io | bash

Google setup

You will need to be able to access google cloud storage as well as to run the Google Cloud Pipeline API. This requires credentials to do so. You can either use Google Default Application Credentials or a key file. The latter is the recommended approach. If you need a keyfile, contact the person who owns the Google Project you'll be using.

Keyfile setup

Once you get a keyfile (which is a json file), run the following:

# just examples:
export SVC_ACCOUNT='nextflow-service-account@curatedmetagenomicdata.iam.gserviceaccount.com' #example name
export GOOGLE_APPLICATION_CREDENTIALS=/data/curatedmetagenomicdata-f7fc1489b036.json
export GCP_PROJECT=curatedmetagenomicdata
gcloud auth activate-service-account \
   $SVC_ACCOUNT \
   --key-file=$GOOGLE_APPLICATION_CREDENTIALS \
   --project=$GCP_PROJECT

Execution

This assumes that you are running on Google, that credentials are set up correctly, and that you have a Google Storage Bucket already created. Note that bucket names must NOT contain the _ or other special characters.

# No '_' or other non-url-safe characters in bucket names
export GOOGLE_BUCKET_NAME='your-bucket-name'

# if bucket does not exist:
gsutil mb gs://$GOOGLE_BUCKET_NAME

You can now run test data. This will take a few hours the first time, so run on a system that will remain on during that time (laptops are not a good choice if you are going to close it and go home, for example).

./nextflow run seandavi/curatedMetagenomicsNextflow \
  -profile google \
  -work-dir gs://$GOOGLE_BUCKET_NAME/work \
  --publish_dir=gs://$GOOGLE_BUCKET_NAME/results \
  --store_dir=gs://$GOOGLE_BUCKET_NAME/store \
  -resume \
  -r main \
  --metadata_tsv https://raw.githubusercontent.com/seandavi/curatedMetagenomicsNextflow/main/samplesheet.test.tsv

To view results:

gsutil ls -larh gs://$GOOGLE_BUCKET_NAME/results

To view an individual file:

gsutil cat PATH_TO_GOOGLE_OBJECT # from above list

To cleanup:

gsutil -m rm -r gs://$GOOGLE_BUCKET_NAME

nf-core tools integration

pip install nf-core
nf-core launch seandavi/curatedMetagenomicsNextflow

About

Curated Metagenomics Data Nextflow workflows

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •