-
Notifications
You must be signed in to change notification settings - Fork 22
Commonly used GMS Commands
In the following sections we will go through many genome model system (GMS) commands that are useful for creating, running, monitoring and exploring analyses. This tutorial assumes you have already completed the GMS installation. Some commands will be relevant with a fresh install. Some commands will assume that you have also downloaded our test data set (see installation document for details). The test data consists of whole genome (WGS), exome, and RNA-seq data from HCC1395 and HCC1395/BL breast cancer cell lines. Finally, some commands will only be relevant after you have successfully generated results from the test builds. You should complete the installation and simple tutorial before getting into some of the more advanced commands here. This document is provided as a reference. There are many more possibilities than what is depicted here but the hope is that between these examples and the help documentation for each command, you will be able to figure out most things needed to use the standalone GMS (sGMS).
GMS commands follow a simple pattern. The top level program is genome
and all other commands are subcommands of the genome command. Typing a command at the command line with either --help
or no arguments will display available subcommands and a brief description of them. Subcommands with additional subcommands are indicated by an ellipsis next to the name (...)
For example:
$ genome
Sub-commands for genome:
analysis-project ... work with analysis projects
config ... commands that deal with analysis project configuration
db ... external database interfaces
disk ... commands that work with allocations, volumes, etc
druggable-gene ... commands that work with druggable gene objects
feature-list ... work with feature-lists
individual ... work with individuals
instrument-data ... work with instrument data
library ... work with libraries
model ... commands that act on models
model-group ... work with model-groups
population-group ... work with population groups
processing-profile ... work with processing profiles.
project ... work with projects
project-part ... work with project parts
report work with reports
sample ... work with samples
search no description!!!: define 'doc' in the class definition for
Genome::Command::Search
software-result ... commands that work with software results
subject ... works with subjects
sys ... work with OS integration
task ... work with tasks
taxon ... work with taxons
tools ... bioinformatics tools for genomics
ERROR: Please specify valid params for 'genome'.
For a visual summary of some commonly used GMS commands, please refer to the GMS Commands Cheat sheet
Some commonly used commands have abbreviated versions. For example, genome model build view
can also be executed as 'gmbv'. This convention is followed throughout the GMS and this document.
To view documentation on any GMS command, simply run the command without options and add --help
to the end. For example, genome model build view --help
.
Many commands, sub-commands, and options will also display by tab-completion. In many of the GMS lister commands used below, multiple output styles can be selected. Most listers allow filtering where a query can be constructed with an 'SQL like' syntax.
-
Install exercises
-
General usage
-
Taxa
-
Individuals
-
Libraries
-
Models
-
Model groups
-
Processing profiles
-
Feature lists
-
Instrument data
-
Databases
-
Defining new models
-
Finding the reference sequence fasta files for analysis
-
Finding reference annotation files used for analysis
-
Updating models and builds
-
Finding result directories, BAM files, and other output
-
Querying input models from a somatic-variation or clin-seq/med-seq model
-
Genome model tools
-
Taxa
Review the output from these commands:
lsid # You should see the openlava cluster identification
lsload # You should see a report of available resources
bjobs # You should not have any unfinished jobs yet
bsub 'sleep 60' # You should be able to submit a job to openlava (run bjobs again to see it)
bhosts # You should see one host
bqueues # You should see four queues
Review the output from these commands:
genome disk group list
genome disk volume list
genome sys gateway list
Once you have completed the installation, if you have not already done so, don't forget to sync over the test data as follows:
wget https://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/GMS1/export/18177dd5eca44514a47f367d9804e17a-2014.1.16.dat
genome model import metadata 18177dd5eca44514a47f367d9804e17a-2014.1.16.dat
genome sys gateway attach GMS1 --protocol ftp --rsync
If you have already synced over the example data, or imported your own data, skip this step.
A taxon object is how a species is defined in the GMS.
List all taxon records in the system:
genome taxon list
An individual in the context of the GMS is usually a patient in a clinical study, a subject consented to a research project, or a cell line. Multiple samples derived from this person or cell line will all be associated with a single individual in the system.
List all individuals in the system:
genome individual list
View a detailed report on metadata and instrument data associated with a particular individual:
genome individual view --individual=H_NJ-HCC1395
Samples are typically nucleic acid material (e.g. DNA or RNA) associated with an individual. For example, the HCC1395 individual has four samples: normal DNA, tumor DNA, normal RNA, tumor RNA.
List all sample in the system:
genome sample list
List only DNA samples with a particular patient sample name and refine the info shown:
genome sample list --filter 'patient_common_name=TST1,common_name=tumor' --show name,patient_common_name,common_name,extraction_type,extraction_label
List attributes of a particular sample:
genome sample attribute list --filter sample.name='H_NJ-HCC1395-HCC1395'
Use a pattern match to find all HCC1395 related samples:
genome sample list --filter 'name like "%HCC1395%"' --show id,name,common_name,tissue_desc,extraction_type,extraction_label
A library is a derivative of a sample that has undergone processing (i.e. a series of molecular biology steps) in preparation for sequencing or other high-throughput analysis. Each sequencing library is associated with a single sample or pool of samples (in the case of pooling of indexed libraries). Multiple libraries may be created from a single sample. For example, for whole genome sequencing, DNA might be fragmented, divided into more than one size fraction, and a sequencing library created from each.
List all libraries in the system:
genome library list
List more detailed information for tumor, rna samples:
genome library list --filter 'sample.common_name=tumor,sample.extraction_type=rna' --show 'name,original_insert_size,library_insert_size,protocol,transcript_strand,sample_source'
Genome models are a way of organizing and conceptualizing analyses. It represents a particular conclusion about a genome given a particular set of input data and a given "processing profile". A model may take instrument data (e.g. lanes of Illumina sequence data), annotation, reference sequences or other relevant inputs. For example, a reference-alignment model might consist of a processing-profile describing alignment with BWA, three lanes of Illumina instrument data to be aligned, and a reference sequence representing the our prior expectations about humans in general. (The reference sequence is itself model constructed by other means.) A processing-profile is a description or refinement of how the analysis of a model is to be performed. A 'build' is an attempt to perform the analysis described in a model. Builds, processing-profiles and other features of the GMS will be described in more detail below. Also please refer to the GMS manuscript and supplementary materials for detailed descriptions of GMS concepts.
List all models in the system:
genome model list
List all models with a specific type:
genome model list --filter type_name='imported reference sequence'
List all models but show a custom set of information on each:
genome model list --show name,type_name,creation_date
List all models but change the output format to one of 'text', 'pretty', 'html', 'xml', 'tsv', 'csv':
genome model list --show name,type_name,creation_date --style tsv
List models by filtering on a group of ID's, the 'in' syntax is very helpful to filter based on a group of valid values:
genome model reference-alignment list --filter 'subject.extraction_label in ["TCGA-A2-A04P-10A-01D-A128-09","TCGA-A2-A04Q-10A-01D-A128-09"]'
Show the inputs on a model (in addition to 'show', refer to documentation for input 'add', 'remove', 'update'):
genome model input show '$model'
Replace '$model' with a valid model id or name from the list you displayed with the genome model list
command.
In the following commands you will need to replace $build_id or $model_id with a valid id from your system. The first command will provide all build IDs currently in the system
List all builds in the system:
genome model build list
List all models again and find one that needs a build:
genome model list
Start a new build for an existing model:
genome model build start $model_id
View the status of a build:
genome model build status $build_id
View detailed run info of a build:
genome model build view $build_id
Abandon a build (if it has crashed or was created by accident perhaps):
genome model build abandon $build_id
If you find that a build has crashed (e.g. due to a disk outage during the run) you can simply start a new build of that model and abandon the failed build. Many parts of the analysis workflows are stored as independent software results. The GMS will attempt wherever possible to detect what steps of complex workflows were successful in a previous attempt and shortcut on these steps to prevent repetition of work.
A model group is a collection of models placed into a group to help organize analysis, converge data across multiple samples, etc.:
List models of the type 'reference-alignment':
genome model list --filter type_name="reference alignment"
Create a new model group with four reference-alignment models:
genome model-group create --models="2891325873,2891325882,2891377978,2891377997" "HCC1395 reference models"
List the members of a model-group:
genome model-group member list --filter model_group.name="HCC1395 reference models"
Remove a model from the model group:
genome model-group member remove --model-group="HCC1395 reference models" --models=2891377978
Add a model to the model group:
genome model-group member add --model-group="HCC1395 reference models" --models=2891377978
Get the last complete builds for all models of a model group:
genome model-group get-last-completed-builds --model-group="HCC1395 reference models" --print-output
Get the last run builds(succeeded/failed) for all models of a model group:
genome model-group member list model_group.id=73905 --show +model.latest_build.status,+model.latest_build.id
Use the reference-alignment model lister and filter on model-group name:
genome model reference-alignment list --filter model_groups.name="HCC1395 reference models"
Use the somatic-variation model lister, filter on model-group name, and display input models:
genome model-group create --models="34c6469f77cd47de9f5731394594dada,77ae30e13e154ddb918b8903fb02ff8d" "HCC1395 somatic models"
genome model somatic-variation list --filter model_groups.name="HCC1395 somatic models" --show id,name,tumor_model,normal_model
Copy an existing model-group to a new model-group and update the processing-profile:
genome model-group copy 73905 "HCC ClinSeq v4" processing_profile=ac594e8c10ac420aaf8e6cfa64888e92
Processing profiles are the detailed plans, parameters, and options of an analysis. For example, a reference-alignment processing profile describes what aligner will be used with what options, how variants will be called from the resulting BAM, etc. Genome 'models' of each type have a single processing-profile associated with them. To run an analysis on a set of data in a different way, one could create a new processing-profile and create a new model that uses it.
List reference-alignment processing profiles:
genome processing-profile list reference-alignment
Describe a specific processing profile in detail:
genome processing-profile describe --processing-profiles 'Default Reference Alignment'
View a specific processing profile in more human readable format:
genome processing-profile view --processing-profile 'Default Reference Alignment'
Create a new reference-alignment processing profile, based on an existing one and increase the number of threads used by BWA from 4 to 8:
genome processing-profile create reference-alignment --based-on=2635769 --read-aligner-params='-t 8 -q 5::' --name='Feb 2014 Test Reference Alignment'
genome processing-profile create rna-seq --based-on=56d4842f4db64d199cf18c55ff20705a --name=December 2014 OvationV2 RNA-seq - No Chimerascan - Candidate 1 --fusion-detector "" --fusion-detector-params "" --fusion-detector-version ""
Now view this new reference-alignment processing profile:
genome processing-profile view --processing-profile 'Feb 2014 Test Reference Alignment'
Now compare this new processing profile to the original one:
genome processing-profile diff --from 2635769 --to 'Feb 2014 Test Reference Alignment'
List the somatic-variation processing profiles currently defined in the system:
genome processing-profile list somatic-variation --show name,id --style pretty
View details of the WGS somatic-variation processing profile:
genome processing-profile view --processing-profile 2762562
A feature list is an arbitrary set of coordinates of interest within a reference genome. For example, when analyzing Exome data we specify a set of regions of interest that correspond to the exons targeted by the Exome capture reagent. The feature list might also correspond to a lift-over of such a list from one version of the reference genome to another. For more information on the concept of lift-over refer to this tutorial: http://www.biostars.org/p/65558/
To list all feature lists in the system
genome feature-list list
To list information for a specific ROI (region of interest) file
genome feature-list list --filter name="11111001 capture chip set"
Instrument data are single pieces of data from some high-throughput platform. In the GMS we use primarily Illumina sequence data and genotype microarray data. A single piece of Illumina instrument data is usually a single lane from a flowcell. If multiple samples were pooled, indexed and sequenced in a single lane, there will be one piece of instrument data for each lane/index combination. Note that exome or other capture data is defined as such by the presence of a target_region_set_name
associated with the instrument data. Whole genome data and RNA-seq data will not have a target region set associated with it.
List all Illumina instrument data in the system:
genome instrument-data list solexa
List all genotype microarray data in the system:
genome instrument-data list imported --filter sequencing_platform=infinium,import_format='genotype file'
List only exome data from the NimbleGen v3 platform:
genome instrument-data list solexa --filter target_region_set_name='NimbleGen v3 Capture Chip Set'
List only tumor Illumina data:
genome instrument-data list solexa --filter sample.common_name=tumor --show id,flow_cell_id,lane,index_sequence,sample_name,library_name,clusters | grep -v Pooled
Show input BAM paths for all data and include some extra sample metadata:
genome instrument-data list solexa --show id,flow_cell_id,lane,index_sequence,sample_name,target_region_set_name,sample.common_name,sample.extraction_type,bam_path
Show instrument data associated with a particular reference alignment model:
genome model list --filter type_name="reference alignment"
genome model instrument-data list --model='hcc1395-tumor-refalign-wgs'
Show all instrument data that is technically compatible with a model:
genome model instrument-data list --model='hcc1395-tumor-refalign-wgs' --compatible
Remove a lane of instrument data from a model:
genome model instrument-data unassign --model='hcc1395-tumor-refalign-wgs' --instrument-data=2891323147
Assign all instrument data to a model:
genome model instrument-data assign all-compatible --model='hcc1395-tumor-refalign-wgs'
Various file based databases are imported and used in the GMS. These are stored in github and imported during the installation and basic tutorial steps. Some of these databases are used as inputs on certain model types. Once imported, the databases present in your system can be queried as follows:
genome db list
Using various concepts and inputs described above the new section will review the creation of models of each type. For illustration purposes we will re-create the models used in the HCC1395 demonstration analysis.
Genotype microarray models
Define tumor and normal genotype-microarray models to perform microarray based genotyping and copy number analysis:
#normal
genome model define genotype-microarray --processing-profile='2575175' --variation-list-build='127786607' --subject-name='H_NJ-HCC1395-HCC1395_BL' --model-name='HCC1395 Normal SNP Array'
genome instrument-data list imported --filter sample_name='H_NJ-HCC1395-HCC1395_BL'
genome model instrument-data assign expression --model='HCC1395 Normal SNP Array' --instrument-data=2891080776
#tumor
genome model define genotype-microarray --processing-profile='2575175' --variation-list-build='127786607' --subject-name='H_NJ-HCC1395-HCC1395' --model-name='HCC1395 Tumor SNP Array'
genome instrument-data list imported --filter sample_name='H_NJ-HCC1395-HCC1395'
genome model instrument-data assign expression --model='HCC1395 Tumor SNP Array' --instrument-data=2891080777
DNA reference alignment models - WGS
Define tumor and normal WGS reference-alignment models to align all reads to the human genome reference sequence, perform germline variant calling, summarize coverage, and complete extensive quality assessments of the data:
#normal
genome model define reference-alignment --reference-sequence-build='GRCh37-lite-build37' --annotation-reference-build='124434505' --subject='H_NJ-HCC1395-HCC1395_BL' --processing-profile='2635769' --dbsnp-build='127786607' --genotype-microarray-model='HCC1395 Normal SNP Array' --model-name='HCC1395 Normal Ref Align WGS'
genome model instrument-data assign all-compatible --model='HCC1395 Normal Ref Align WGS'
genome model build start 'HCC1395 Normal Ref Align WGS'
#tumor
genome model define reference-alignment --reference-sequence-build='GRCh37-lite-build37' --annotation-reference-build='124434505' --subject='H_NJ-HCC1395-HCC1395' --processing-profile='2635769' --dbsnp-build='127786607' --genotype-microarray-model='HCC1395 Tumor SNP Array' --model-name='HCC1395 Tumor Ref Align WGS'
genome model instrument-data assign all-compatible --model='HCC1395 Tumor Ref Align WGS'
genome model build start 'HCC1395 Tumor Ref Align WGS'
DNA reference alignment models - Exome
Define tumor and normal Exome reference-alignment models. Note the main difference from the previous models is that we must now specify additional inputs to define the regions targeted by the Exome capture reagent:
#normal
genome model define reference-alignment --reference-sequence-build='GRCh37-lite-build37' --annotation-reference-build='124434505' --subject='H_NJ-HCC1395-HCC1395_BL' --processing-profile='2635769' --dbsnp-build='127786607' --target-region-set-names='NimbleGen v3 Capture Chip Set' --region-of-interest-set-name='NimbleGen v3 Capture Chip Set' --genotype-microarray-model='HCC1395 Normal SNP Array' --model-name='HCC1395 Normal Ref Align Exome'
genome model instrument-data assign all-compatible --model='HCC1395 Normal Ref Align Exome'
genome model build start 'HCC1395 Normal Ref Align Exome'
#tumor
genome model define reference-alignment --reference-sequence-build='GRCh37-lite-build37' --annotation-reference-build='124434505' --subject='H_NJ-HCC1395-HCC1395' --processing-profile='2635769' --dbsnp-build='127786607' --target-region-set-names='NimbleGen v3 Capture Chip Set' --region-of-interest-set-name='NimbleGen v3 Capture Chip Set' --genotype-microarray-model='HCC1395 Tumor SNP Array' --model-name='HCC1395 Tumor Ref Align Exome'
genome model instrument-data assign all-compatible --model='HCC1395 Tumor Ref Align Exome'
genome model build start 'HCC1395 Tumor Ref Align Exome'
Somatic variation models
Define a somatic-variation model that compares tumor and normal to identify somatic variants. Note that this model uses the reference-alignment models created above as inputs:
#WGS data
genome model define somatic-variation --processing-profile='Default Somatic Variation WGS' --normal-model='HCC1395 Normal Ref Align WGS' --tumor-model='HCC1395 Tumor Ref Align WGS' --annotation-build='124434505' --previously-discovered-variations-build='127786607' --subject='H_NJ-HCC1395-HCC1395' --model-name='HCC1395 Somatic Variation WGS'
genome model build start 'HCC1395 Somatic Variation WGS'
#Exome data
genome model define somatic-variation --processing-profile='Default Somatic Variation Exome' --normal-model='HCC1395 Normal Ref Align Exome' --tumor-model='HCC1395 Tumor Ref Align Exome' --annotation-build='124434505' --previously-discovered-variations-build='127786607' --subject='H_NJ-HCC1395-HCC1395' --model-name='HCC1395 Somatic Variation Exome'
genome model build start 'HCC1395 Somatic Variation Exome'
RNA-seq models
Define tumor and normal RNA-seq models to align all RNA-seq reads to the human reference sequence, perform transcript assembly, estimate isoform and gene expression, and complete extensive quality assessments of the data:
#normal
genome model define rna-seq --reference-sequence-build='GRCh37-lite-build37' --annotation-build='124434505' --subject='H_NJ-HCC1395-HCC1395_BL_RNA' --processing-profile='Default Ovation V2 RNA-seq' --model-name='HCC1395 Normal RNA-seq'
genome model instrument-data assign all-compatible --model='HCC1395 Normal RNA-seq'
genome model build start 'HCC1395 Normal RNA-seq'
#tumor
genome model define rna-seq --reference-sequence-build='GRCh37-lite-build37' --annotation-build='124434505' --subject='H_NJ-HCC1395-HCC1395_RNA' --processing-profile='Default Ovation V2 RNA-seq' --model-name='HCC1395 Tumor RNA-seq'
genome model instrument-data assign all-compatible --model='HCC1395 Tumor RNA-seq'
genome model build start 'HCC1395 Tumor RNA-seq'
Differential expression models
Define a differential expression model to compare expression of genes between tumor and normal:
genome model define differential-expression --processing-profile='cuffcompare/cuffdiff 2.0.2 protein_coding only' --condition-labels-string='normal,tumor' --condition-model-ids-string='$normal_rnaseq_model_id $tumor_rnaseq_model_id' --reference-sequence-build='GRCh37-lite-build37' --annotation-build='124434505' --model-name='HCC1395 Differential Expression' --subject='H_NJ-HCC1395'
genome model build start 'HCC1395 Differential Expression'
In the above command, --condition-model-ids-string accepts a list of samples and optionally replicates of those samples. The format to provide them is as a space separated list of samples. If available replicates are grouped together and comma separated. For example: 'sample1_rep1,sample1_rep2 sample2_rep1,sample2_rep2'
Med-Seq models (aka clin-seq)
Define a med-seq (aka clin-seq) model to integrate WGS, exome and transcriptome data, to perform additional annotations, to create extensive visualizations of all somatic mutation types, and to assess druggability and clinical relevance of these events:
genome model define clin-seq --wgs-model='HCC1395 Somatic Variation WGS' --exome-model='HCC1395 Somatic Variation Exome' --normal-rnaseq-model='HCC1395 Normal RNA-seq' --tumor-rnaseq-model='HCC1395 Tumor RNA-seq' --de-model='HCC1395 Differential Expression' --name='HCC1395 Clin-Seq' --processing-profile='Default Clinical Sequencing'
genome model build start 'HCC1395 Somatic Variation WGS'
List all models, show inputs on a reference-alignment model, find the reference sequence build associated with this model, query the system for the location of the reference sequence fasta file:
genome model list
genome model input show --model='hcc1395-normal-refalign-wgs'
genome model build list --filter id=106942997 --show model_name,data_directory
ls /opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/
perl -e 'use Genome; $reference_build=Genome::Model::Build->get(106942997); $ref_fasta_path = $reference_build->full_consensus_path('fa'); print "\n\n$ref_fasta_path\n\n";'
List all models, show inputs on a reference-alignment model, find the reference annotation build associated with this model, query the system for the location of the reference annotation GTF file:
genome model list
genome model input show --model='hcc1395-normal-refalign-wgs'
genome model build list --filter id=124434505 --show model_name,data_directory
ls /opt/gms/GMS1/fs/gc12001/info/model_data/2772828715/build124434505/annotation_data/rna_annotation/
perl -e 'use Genome; $annotation_build=Genome::Model::Build->get(124434505); $gtf_path = $annotation_build->annotation_file('gtf','106942997'); print "\n\n$gtf_path\n\n"'
You can create a model, launch a build and later decide that you would like to update that model. For example, you might decide to use a different reference genome sequence, or newer transcript annotations, or add more lanes of data, etc. To do such an update, you can make a copy of the model and change values or you can update the original model and simply launch a new build.
Copy an existing model and apply a new processing-profile to the new model:
genome model input show 'hcc1395-tumor-refalign-wgs'
genome model reference-alignment list --filter name='hcc1395-tumor-refalign-wgs'
genome processing-profile create reference-alignment --based-on=2635769 --read-aligner-params='-t 8 -q 5::' --name='Feb 2014 Test Reference Alignment'
genome processing-profile list reference-alignment --filter name='Feb 2014 Test Reference Alignment' --show id
genome model copy "hcc1395-tumor-refalign-wgs" "name=hcc1395-tumor-refalign-wgs-new" "processing_profile=$processing_profile_id"
Update inputs on an existing model (note that processing-profile is not an input and can not be updated in this manner):
genome model input show hcc1395-tumor-refalign-wgs
genome model input remove hcc1395-tumor-refalign-wgs --name=instrument_data --value=2891322951
genome model input add hcc1395-tumor-refalign-wgs --name=instrument_data --value=2891322951
genome model input update hcc1395-tumor-refalign-wgs --name=annotation_reference_build --value=124434505
Update models using the 'advise' tool:
genome model clin-seq advise --individual "common_name = 'TST1'" --samples='id in [2889981253,2889981254,2889953341,2889953342]'
Find the input BAMs used for a reference-alignment model and then the resulting, merged, de-duplicated alignment BAM produced by BWA. Note how we must access the alignment result via the 'last succeeded build' of the model:
#obtain input bam files (unaligned data) via an instrument-data object
genome model instrument-data list --model='hcc1395-normal-refalign-wgs' --show id,full_name,sample_name,bam_path
#obtain result bam files (aligned data) via a reference-alignment object
genome model reference-alignment list --filter name='hcc1395-normal-refalign-wgs' --show last_succeeded_build.whole_rmdup_bam_file
Find the normal and tumor BAMs used as input to a somatic-variation model:
#obtain normal and tumor bam files (aligned data) via a somatic-variation object
genome model somatic-variation list --filter name='hcc1395-somatic-wgs' --show last_succeeded_build.normal_bam,last_succeeded_build.tumor_bam
Find the input BAMs used for an RNA-seq alignment model and then the resulting alignment BAM produced by tophat:
#obtain input bam files (unaligned data) via an instrument-data object
genome model instrument-data list --model='hcc1395-normal-rnaseq' --show id,full_name,sample_name,bam_path
#obtain result bam files (aligned data) via an rna-seq model
genome model rna-seq list --filter name='hcc1395-normal-rnaseq' --show last_succeeded_build.alignment_result.bam_file
Use a generic lister to get resulting alignment BAMs for any model with alignments:
#obtain result bam files (aligned data) via a merged_alignment_result (software result) object and starting with a reference-alignment model
genome model list --filter name='hcc1395-normal-refalign-wgs' --show id,name,last_complete_build.merged_alignment_result.bam_path
#obtain result bam files (aligned data) via a merged_alignment_result (software result) object and starting with an rna-seq model
genome model list --filter name='hcc1395-normal-rnaseq' --show id,name,last_complete_build.merged_alignment_result.bam_path
#obtain result bam files (aligned data) via a merged_alignment_result (software result) object and starting with a somatic-variation model
genome model list --filter name='hcc1395-somatic-wgs' --show id,name,tumor_model.last_complete_build.merged_alignment_result.bam_path
#obtain result bam files (aligned data) via a merged_alignment_result (software result) object and starting with a clin-seq model
genome model list --filter name='hcc1395-clinseq' --show id,name,wgs_model.tumor_model.last_complete_build.merged_alignment_result.bam_path
Show the top-level path to all results for the last succeeded build of each model type. Again note that we must access data directories via a specific build ... in this case the last succeeded build:
genome model list --filter 'type_name in ["reference alignment", "rna seq","somatic variation", "differential expression", "clin seq"]' --show name,last_succeeded_build.data_directory
Show the top-level data directory for a single build:
genome model reference-alignment list --filter name='hcc1395-tumor-refalign-wgs' --show name,last_succeeded_build.data_directory
genome model rna-seq list --filter name='hcc1395-tumor-rnaseq' --show name,last_succeeded_build.data_directory
genome model somatic-variation list --filter name='hcc1395-somatic-wgs' --show name,last_succeeded_build.data_directory
genome model differential-expression list --filter name='hcc1395-differential-expression' --show name,last_succeeded_build.data_directory
genome model clin-seq list --filter name='hcc1395-clinseq' --show name,last_succeeded_build.data_directory
Refer to the following link for more details on critical output files of each GMS pipeline: Important results files in GMS builds
In genome listing commands you can traverse objects by using '.' between method names. For example, in this way you can do the following: med-seq model -> underlying exome somatic variation model -> underlying tumor reference alignment model -> last succeeded build of that model -> bam file of that build.
Show the tumor and normal Exome reference alignment models used as input to an Exome somatic variation model:
genome model somatic-variation list --filter name='hcc1395-somatic-exome' --show normal_model,tumor_model
genome model somatic-variation list --filter name='hcc1395-somatic-exome' --show normal_model.last_complete_build.whole_rmdup_bam_file,tumor_model.last_complete_build.whole_rmdup_bam_file --style pretty
Show the tumor and normal WGS reference alignment models used as input to the WGS somatic variation model of a med-seq model:
genome model clin-seq list --filter name='hcc1395-clinseq' --show wgs_model.normal_model,wgs_model.tumor_model --style pretty
Check the reference sequence builds being used by tumor and normal WGS alignments:
genome model clin-seq list --filter name='hcc1395-clinseq' --show wgs_model.normal_model.reference_sequence_build,wgs_model.tumor_model.reference_sequence_build --style pretty
List the tumor and normal exome BAM files associated with a med-seq(clin-seq) model:
genome model clin-seq list --style csv --filter name='hcc1395-clinseq' --show exome_model.last_succeeded_build.normal_build.subject.name,exome_model.last_succeeded_build.normal_build.whole_rmdup_bam_file, exome_model.last_succeeded_build.tumor_build.subject.name,exome_model.last_succeeded_build.tumor_build.whole_rmdup_bam_file
genome model clin-seq list --style csv --filter id=$my_id --show exome_model.last_succeeded_build.normal_build.subject.name,exome_model.last_succeeded_build.normal_build.whole_rmdup_bam_file, exome_model.last_succeeded_build.tumor_build.subject.name,exome_model.last_succeeded_build.tumor_build.whole_rmdup_bam_file
Show the status of all input models for a clin-seq model:
genome model clin-seq list --filter name='hcc1395-clinseq' --show wgs_model.normal_model.latest_build.status,wgs_model.tumor_model.latest_build.status,exome_model.normal_model.latest_build.status,exome_model.tumor_model.latest_build.status,normal_rnaseq_model.latest_build.status,tumor_rnaseq_model.latest_build.status,exome_model.latest_build.status,wgs_model.latest_build.status,de_model.latest_build.status --style pretty
There are many genome model tools ('gmt's) used in the GMS pipelines and many more that may not be used in a pipeline. The following are some useful examples:
Run the GMS annotator on an example list of variants:
wget https://xfer.genome.wustl.edu/gxfer1/project/gms/examples/example_variants.tsv
grep -v chr example_variants.tsv > example_variants.noheader.tsv
gmt annotate transcript-variants --annotation-filter=top --variant-file=example_variants.noheader.tsv --reference-transcripts='NCBI-human.ensembl/67_37l_v2' --output-file='example_variants.annotated.tsv'
Add read counts and variant allele frequencies to the annotated variants using some Exome BAMs:
genome model list --filter 'name like "%refalign-exome%"' --show last_complete_build.whole_rmdup_bam_file
perl -e 'use Genome; $reference_build=Genome::Model::Build->get(106942997); $ref_fasta_path = $reference_build->full_consensus_path('fa'); print "\n\n$ref_fasta_path\n\n";'
gmt analysis coverage add-readcounts --genome-build=/opt/gms/GMS1/fs/ams1102/info/model_data/2869585698/build106942997/all_sequences.fa --variant-file=example_variants.annotated.tsv --header-prefixes='normal,tumor' --bam-files='/opt/gms/KE56B14/fs/KE56B14/info/build_merged_alignments/merged-alignment-clia1.gsc.wustl.edu-gmsuser-15727-6305caabaf594e6bb28878c4ff116913/6305caabaf594e6bb28878c4ff116913.bam,/opt/gms/KE56B14/fs/KE56B14/info/build_merged_alignments/merged-alignment-clia1.gsc.wustl.edu-gmsuser-36717-86e2649d86fe4be1808e0b9a79bf2aba/86e2649d86fe4be1808e0b9a79bf2aba.bam' --output-file=example_variants.annotated.readcounts.tsv
Create a mutation diagram or 'lollipop plot' for each gene mutation:
cat example_variants.annotated.tsv | grep -v chromosome_name > example_variants.annotated.noheader.tsv
gmt graph mutation-diagram --reference-transcripts='NCBI-human.ensembl/67_37l_v2' --annotation='example_variants.annotated.noheader.tsv' --annotation-format=tgi
Create a cn-view plot to visualize somatic copy-number results
gmt copy-number cn-view --annotation-build='d00a39c84382427fa0efdec3229e8f5f' --cancer-annotation-db='tgi/cancer-annotation/human/build37-20140205.1' --output-dir='relapse2/' --segments-file='/gscmnt/gc13027/info/model_data/52f3b8ad88fa4df79196d179aa29e00b/build4c73477e164b470eab1348c90396b435/AML103/cnv/wgs_cnv/cnview/CNView_All/cnaseq.cnvhmm.tsv' --image-type='pdf' --somatic-build='da309160185a464c9707ad6ccb2fed9d' --gene-targets-file='/gscmnt/sata132/techd/mgriffit/reference_annotations/GeneSymbolLists/CancerGeneCensusPlus_Sanger.txt' --name='Cancer_Genes' --cnv-file='/gscmnt/gc13027/info/model_data/52f3b8ad88fa4df79196d179aa29e00b/build4c73477e164b470eab1348c90396b435/AML103/cnv/wgs_cnv/cnvs.hq'
List available menu-items
genome config analysis-menu item list --show id,name,file_path
genome config analysis-menu item list --show name,file_path --filter name='Human Reference Alignment'
Software results are associated with most pipelines in the GMS. These are independent results that are linked into particular builds that, where appropriate, can be used by subsequent builds of the same model, or even builds of different models. This prevents unnecessary duplication of analyses. For example, when a new aligner is used, an index of the reference genome must be built for that aligner. This index is stored as a software result. In the future, if an alignment is being performed with the same aligner and parameters for that aligner, the existing index will be retrieved as a software result.
List all software results in the system:
ur list objects --subj Genome::SoftwareResult --show id,class
ur list objects --subj Genome::SoftwareResult --show id,class,subclass_name,inputs,output_dir
Get all results for a particular reference sequence build and display basic results:
perl -e 'use Genome; my @r = Genome::Model::Build::ReferenceSequence::AlignerIndex->get(reference_build_id=>106942997); foreach my $r (@r){$a_id=$r->id; $a_name=$r->aligner_name; $a_version=$r->aligner_version; $a_params=$r->aligner_params; $test_name=$r->test_name; print "$a_id\t$a_name\t$a_version\t$a_params\t$test_name\n"}';
Listed below are the location of various GMS components.
-
/opt/gms/$GENOME_SYS_ID
is the location of all GMS software and results. Where $SYS_ID is a unique 7 character alphanumeric string, randomly generated at install time. -
/opt/gms/GMS1
is the location of the test data -
/opt/gms/$GENOME_SYS_ID/db
is the location of certain file based databases used during analysis (e.g. COSMIC) -
/opt/gms/$GENOME_SYS_ID/fs
is where disk allocations for all analysis results are stored. -
/opt/gms/$GENOME_SYS_ID/sw
is where all software, git repositories, packages, etc. related to the GMS system are stored.
The above examples are just that, examples. There is an almost limitless combination of potentially useful commands composed from the ideas represented above. Determining what is possible with GMS commands can require experimentation. Refer to the help documentation for each command to determine what basic options and associated objects are available. Everything done above can also be done in a Perl script by simply adding 'use Genome' to the script and calling various classes by name. To know what is possible you may have to review the code itself.