MachineLearningTCGAHNSC-BINF

Supplementary materials and code for the BMC Bioinformatics journal article "Machine learning with the TCGA-HNSC dataset: Improving performance by addressing inconsistency, sparsity, and high-dimensionality" by authors Michael C. Rendleman, B.S.E.; John M. Buatti, MD; Terry A. Braun, Ph.D.; Brian J. Smith, Ph.D.; Bart Brown; Chibuzo Nwakama; Reinhard Beichel, Ph.D.; Thomas L. Casavant, Ph.D.

To install the necessary dependencies for the R scripts, we supply the install_prereqs.R script. Any questions about this analysis or the manuscript can be sent to [email protected].

Supplied Data

Clinical Data

Preprocessed pre-imputation and post-imputation datasets are provided in .arff format (WEKA's attribute-relation file format) in clinical_NO_imp.arff and clinical_rf_imp.arff, respectively. Importance values for these datasets are provided in raw_importance_noimp.csv and raw_importance_rfimp.csv.

Tumor grading variables and corresponding patient outcomes are stored in clintum_tx_grade.Rda for convenience of use in R-based SPCA experiments. Only the 520 patients with tumor expression data are included in this data frame.

Raw Solid-Tumor RNA Expression Data and Transformations

RNA expression data for the 520 patients (alongside tumor grading and treatment information) is supplied in rnatum_tx_grade_surv.Rda.

Transformations of RNA expression data via SPCA can be found in spcaXXcomponents.Rda, where XX is the number of components. These data were transformed from the rnatum_tx_grade_surv.Rda data using the SPCA_generation.r script.

Experiments

Treatment variable imputation experiments

Classifier training on pre- and post-imputation data can be done in WEKA as described in our manuscript: https://pubmed.ncbi.nlm.nih.gov/31208324/

Importance values for these variables can be calculated with CIRF_importance.r, though the raw (pre_averaged) results can be examined in the raw_importance_noimp.csv and raw_importance_rfimp.csv files.

Full RNA training

Classifier training on the full set of solid-tumor RNA expression data can be replicated with the Full_RNA_training.R script. The models from this script are not supplied, as they can be quite large (on the order of hundreds of MB to GB). This script requires a high-performance computing environment, and we recommend no less than 100 GB of memory to ensure training will complete.

SPCA transformation of RNA: training, importance, and timing

Training of classifiers, calculations of variable importance, and training timing for the SPCA-transformed data can be performed using the SPCA_training_and_importance.R script, though the resulting models are also available in the model_fits/ directory.

Gene Ontology Enrichment Analysis

SPC gene weights can be obtained from the SPCA_generation.r file. For our analysis, genes with absolute weight greater than 0.1 are considered contributors. After obtaining the genes comprising the SPC under consideration, GOEA can be performed here: http://geneontology.org/page/go-enrichment-analysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MachineLearningTCGAHNSC-BINF

Supplied Data

Clinical Data

Raw Solid-Tumor RNA Expression Data and Transformations

Experiments

Treatment variable imputation experiments

Full RNA training

SPCA transformation of RNA: training, importance, and timing

Gene Ontology Enrichment Analysis

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
model_fits		model_fits
CIRF_importance.r		CIRF_importance.r
Full_RNA_training.R		Full_RNA_training.R
README.md		README.md
SPCA_generation.r		SPCA_generation.r
SPCA_training_and_importance.R		SPCA_training_and_importance.R
clinical_NO_imp.arff		clinical_NO_imp.arff
clinical_rf_imp.arff		clinical_rf_imp.arff
clintum_tx_grade.Rda		clintum_tx_grade.Rda
install_prereqs.R		install_prereqs.R
raw_importance_noimp.csv		raw_importance_noimp.csv
raw_importance_rfimp.csv		raw_importance_rfimp.csv
rnatum_tx_grade_surv.Rda		rnatum_tx_grade_surv.Rda
spca10components.Rda		spca10components.Rda
spca15components.Rda		spca15components.Rda
spca20components.Rda		spca20components.Rda
spca25components.Rda		spca25components.Rda
spca50components.Rda		spca50components.Rda
spca5components.Rda		spca5components.Rda

mrendleman/MachineLearningTCGAHNSC-BINF

Folders and files

Latest commit

History

Repository files navigation

MachineLearningTCGAHNSC-BINF

Supplied Data

Clinical Data

Raw Solid-Tumor RNA Expression Data and Transformations

Experiments

Treatment variable imputation experiments

Full RNA training

SPCA transformation of RNA: training, importance, and timing

Gene Ontology Enrichment Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages