This repository provides access to the predicted of microbial pathways of non-human contigs assembled from cancer sequence data.
HTML version with an interactive table is available https://UEA-Cancer-Genetics-Lab.github.io/Pancancer_Microbial_Pathways/
SEPATH was employed on approximately 10,000 cancer WGS samples from Genomics England 100,000 Genomes Project.
Cancer types include: bladder, breast, childhood, endocrine, endometrial, adult glioma, haematological, hepatopancreatobiliary, lung, melanoma, nasopharyngeal, oral/oropharyngeal, other, ovarian, prostate, renal, sarcoma, sinonasal, testicular, unknown, uppergi.
The resulting non-human reads were pooled by cancer type subject to metagenomic assembly using MEGAHIT. Colorectal cancer has been exluded from the analysis due to technical difficulties and limitations in assembling the data.
Taxonomic classifications were achieved with diamond using the NCBI non-redundant protein database. All contigs with a mammalian genera were removed and all contaminant contigs according to Salter et al.. Prokka was then used to predict proteins from the assembled contigs. The putative proteins were subject to pathway prediction using InterProScan using multiple databases. The number of pathway 'hits' (the number of times a particular function matched a certain protein across all databses) was summed for all cancer types to form the tsv
file in the data directory.
There are some caveats to this analysis that should be considered. First and foremost the contigs are highly likely to contain sequencing contaminants. On the other hand, biologically informative contigs may have been removed when removing common sequencing contaminants. The number of pathway hits is not necessarily indicative of the amount a function is being carried out and likewise a number of pathway 'hits' is not definitive evidence for the existence of a metabolic pathway in a sample. The data presented in this repository should be considered only for hypothesis generation purposes.
This research was made possible through access to the data and findings generated by the 100,000 Genomes Project; http://www.genomicsengland.co.uk.