nextstrain.org/monkeypox/ingest

This is the ingest pipeline for Monkeypox virus sequences.

Usage

NOTE: All command examples assume you are within the ingest directory. If running commands from the outer monkeypox directory, please replace the . with ingest

Fetch sequences with

nextstrain build --cpus 1 . data/sequences.ndjson

Run the complete ingest pipeline with

nextstrain build --cpus 1 .

This will produce two files (within the ingest directory):

data/metadata.tsv
data/sequences.fasta

Run the complete ingest pipeline and upload results to AWS S3 with

nextstrain build . --configfiles config/config.yaml config/optional.yaml

Adding new sequences not from GenBank

Static Files

Do the following to include sequences from static FASTA files.

Convert the FASTA files to NDJSON files with:

./ingest/bin/fasta-to-ndjson \
    --fasta {path-to-fasta-file} \
    --fields {fasta-header-field-names} \
    --separator {field-separator-in-header} \
    --exclude {fields-to-exclude-in-output} \
    > ingest/data/{file-name}.ndjson

Add the following to the .gitignore to allow the file to be included in the repo:
```
!ingest/data/{file-name}.ndjson
```
Add the file-name (without the .ndjson extension) as a source to ingest/config/config.yaml. This will tell the ingest pipeline to concatenate the records to the GenBank sequences and run them through the same transform pipeline.

Configuration

Configuration takes place in config/config.yaml by default. Optional configs for uploading files and Slack notifications are in config/optional.yaml.

Environment Variables

The complete ingest pipeline with AWS S3 uploads and Slack notifications uses the following environment variables:

Required

AWS_DEFAULT_REGION
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
SLACK_TOKEN
SLACK_CHANNELS

Optional

These are optional environment variables used in our automated pipeline for providing detailed Slack notifications.

GITHUB_RUN_ID - provided via github.run_id in a GitHub Action workflow
AWS_BATCH_JOB_ID - provided via AWS Batch Job environment variables

Input data

GenBank data

GenBank sequences and metadata are fetched via NCBI Virus. The exact URL used to fetch data is constructed in bin/genbank-url.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
bin		bin
config		config
source-data		source-data
workflow/snakemake_rules		workflow/snakemake_rules
README.md		README.md
Snakefile		Snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nextstrain.org/monkeypox/ingest

Usage

Adding new sequences not from GenBank

Static Files

Configuration

Environment Variables

Required

Optional

Input data

GenBank data

About

Releases

Packages

Languages

neherlab/flu-ingest

Folders and files

Latest commit

History

Repository files navigation

nextstrain.org/monkeypox/ingest

Usage

Adding new sequences not from GenBank

Static Files

Configuration

Environment Variables

Required

Optional

Input data

GenBank data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages