Genomics England Clinical Interpretation Partnership Conference 04_Nov_2019

This repository will recreate the poster used at the first GECIP conference and provide the associated information required for reproducibility.

Poster Production

The poster was created using the posterdown package. To run, render the Posterdown.Rmd file. The output will be in html format. A PDF version is also provided in this repo. If you don't have the required libraries please run Rscript install.R from the repository directory.

SEPATH Pipeline Parameters:

The SEPATH pipeline is available as a singularity on github.

This pipeline was ran on all cancer whole genome sequences (v6 release) from the 100,000 Genomes Project with the following parameters (all others parameters were as default):

krakendb - All genomes in NCBI Refseq scaffold level and above (bacterial, viral, fungal, protozoal). Dustmasked to remove low complexity. See resources/names.dmp and resources/nodes.dmp for taxonomy information. database.kdb MD5:1e5bbafdef19775b6a9a9055598fb709

kraken_confidence - 0.2 - 20% of assigned k-mers within a read must assign to a taxonomy before classification can be determined

min_clade_reads - 0 - left unfiltered until analysis

bbduk_db - Human reference genome 38 (no decoys) with additional cancer sequences from the COSMIC database

minimum_quality - 20 - in addition to quality control from Illumina

minimum_length - 35 - sufficient for k=31 runs with kraken. set low to preserve data

Data Formatting

Sample metadata was accessed via the R labkey API in Genomics England Research Environment. Main programme data used was for v7 data release as of 2019-07-25. Associated scripts for data cleaning and formatting can be provided to GeCIP members from within the research environment upon request.

Principal Coordinates Analysis

Community matrix filtering:

All samples were aligned to genome build GRCh38, passed Illumina internal QC, were obtained from fresh-frozen tissue and library prep was PCR-free
Taxa within samples with less than 50 assigned sequencing reads were set to 0 to reduce false positives
Converted to binary with kraken_pa <- decostand(kraken_prep, method='pa')

PCoA:

Distance matrix obtained with jacc <- vegdist(kraken_pa, method='jaccard', binary=TRUE)
Multidimensional scaling performed with cmdscale(jacc, k=2, eig=TRUE, x.ret=TRUE)
vegan pacakge version: 2.5.3
Data points and principal MDS axes variance were extracted, remerged with metadata by sequencing plate_key and plotted with ggplot.

Boruta Feature Selection

Scripts available from within research environment. Boruta settings required for reproducibility:

R version: 3.5.1
Random seed setting: set.seed(1122)
Boruta package: 6.0.0
Boruta Parameters: maxRuns=500. All other parameters as default

The importance decisions for each boruta run were extracted for each technical variable. The combined number of times a genus was confirmed as important for predicting a technical variable was ranked the the top 15 decisions were compared to table 1 in Eisenhofer et al 2019. The resulting dataframe is provided in code of the poster markdown file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Posterdown_files/kePrint-0.0.1		Posterdown_files/kePrint-0.0.1
plots		plots
resources		resources
.Rhistory		.Rhistory
Posterdown.Rmd		Posterdown.Rmd
Posterdown.html		Posterdown.html
Posterdown.pdf		Posterdown.pdf
README.html		README.html
README.md		README.md
biomed-central.csl		biomed-central.csl
install.R		install.R
packages.bib		packages.bib
references.bib		references.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genomics England Clinical Interpretation Partnership Conference 04_Nov_2019

Poster Production

SEPATH Pipeline Parameters:

Data Formatting

Principal Coordinates Analysis

Boruta Feature Selection

About

Releases

Packages

Languages

Agihawi/Gel_Conference_Poster_1

Folders and files

Latest commit

History

Repository files navigation

Genomics England Clinical Interpretation Partnership Conference 04_Nov_2019

Poster Production

SEPATH Pipeline Parameters:

Data Formatting

Principal Coordinates Analysis

Boruta Feature Selection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages