diff --git a/.Rbuildignore b/.Rbuildignore index 91114bf..bcf1535 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -1,2 +1,5 @@ ^.*\.Rproj$ ^\.Rproj\.user$ +^data-raw$ +^README\.Rmd$ +^README-.*\.png$ diff --git a/.gitignore b/.gitignore index 5b6a065..a17274e 100644 --- a/.gitignore +++ b/.gitignore @@ -1,4 +1,16 @@ +# Hidden files automatically created by mac os or autosaved from +# some other program +.DS_Store +._* +*~ +~$* + +# R-files and data/directories not to keep .Rproj.user .Rhistory +.Rapp.history .RData .Ruserdata +temp*.R +inst/doc +data-raw diff --git a/DESCRIPTION b/DESCRIPTION new file mode 100644 index 0000000..d121d3c --- /dev/null +++ b/DESCRIPTION @@ -0,0 +1,23 @@ +Package: EPIC +Type: Package +Title: Estimate the Proportion of Immune and Cancer cells +Version: 1.0.0 +Authors@R: as.person(c( + "Julien Racle [aut, cre]", + "David Gfeller [aut]" + )) +Description: Package implementing EPIC method to estimate the proportion of + immune and cancer cells from bulk gene expression data. It is based on + reference gene expression profiles for the main immune cell types and it + predicts the proportion of these cells and of the remaining "other cells" + that are the remaining cells (cancer, stromal, endothelial) for which no + reference profile is given. +Depends: R (>= 3.2.0) +License: file LICENSE +LazyData: TRUE +RoxygenNote: 5.0.1 +Suggests: testthat, + knitr, + rmarkdown +Imports: stats +VignetteBuilder: knitr diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..86cf543 --- /dev/null +++ b/LICENSE @@ -0,0 +1 @@ +SOFTWARE LICENSE AGREEMENT FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY This Agreement is made between Ludwig Institute for Cancer Research Ltd, a not-for-profit organization with its registered office at Stadelhoferstrasse 22, 8001 Zurich, Switzerland, and having a place of business at 666 Third Avenue, New York, NY 10017,USA (“LICR”) and the (“LICENSEE”) and is effective at the date the downloading is completed (“EFFECTIVE DATE”). WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and LICR wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license; and WHEREAS, LICENSEE desires to license the PROGRAM and LICR desires to grant a license on the following terms and conditions. NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: 1. DEFINITIONS 1.1 PROGRAM shall mean the source code known as EPIC (version 1.0), as it exists on the EFFECTIVE DATE. 2. LICENSE 2.1 Grant. Subject to the terms of this Agreement, LICR hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to LICR a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to LICR promptly upon their creation. The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from LICR. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this Agreement. 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of LICR, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. 3. OWNERSHIP OF INTELLECTUAL PROPERTY LICENSEE acknowledges that title to the PROGRAM shall remain with LICR. The PROGRAM shall be marked with the notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. Notice of attribution: The EPIC (version 1.0) program was made available through the generosity of Ludwig Institute for Cancer Research Ltd. LICENSEE shall not use any trademark or trade name of LICR, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of LICR except as states above for attribution purposes. 4. INDEMNIFICATION LICENSEE shall indemnify, defend, and hold harmless LICR, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. 5. NO REPRESENTATIONS OR WARRANTIES THE PROGRAM IS DELIVERED AS IS. LICR MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. LICR EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. IN NO EVENT SHALL LICR OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER LICR SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. 6. ASSIGNMENT This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of LICR shall be null and void. 7. MISCELLANEOUS 7.1 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to LICR. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, LICR may terminate this Agreement immediately. Upon termination, LICENSEE shall provide LICR with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from LICR, LICENSEE may retain a copy for archive purposes. 7.2 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. 7.3 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. 7.4 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. 7.5 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. 7.6 Governing Law. This Agreement is governed by the laws of Switzerland and the Canton of Vaud. \ No newline at end of file diff --git a/NAMESPACE b/NAMESPACE new file mode 100644 index 0000000..0909525 --- /dev/null +++ b/NAMESPACE @@ -0,0 +1,3 @@ +# Generated by roxygen2: do not edit by hand + +export(EPIC) diff --git a/NEWS b/NEWS new file mode 100644 index 0000000..b050f55 --- /dev/null +++ b/NEWS @@ -0,0 +1,3 @@ +Version 1.0.0 +------------------------------------------------------------------------ +* Public release of EPIC package. diff --git a/R/EPIC_descr.R b/R/EPIC_descr.R new file mode 100644 index 0000000..92e5745 --- /dev/null +++ b/R/EPIC_descr.R @@ -0,0 +1,126 @@ +#' EPIC: a package to Estimate the Proportion of Immune and Cancer cells from +#' tumor gene expression data. +#' +#' EPIC package provides the function and immune cell reference profiles to +#' estimate the proportion of immune and other cells from bulk gene expression +#' data. +#' +#' @section EPIC functions: +#' \code{\link{EPIC}} is the main function to call to estimate the +#' various cells proportions from a bulk sample. +#' +#' @section Included datasets: +#' \code{\link{BRef}}, \code{\link{BRef.tpm}}: reference profiles from +#' circulating immune cells. +#' +#' \code{\link{TRef.tpm}}: reference profiles from immune cells obtained +#' from single cell data of tumor infiltrating cells from melanoma patients. +#' +#' \code{\link{hoek_data}}: example dataset containing data from Hoek et al, +#' 2015, PLoS One. +#' +#' \code{\link{mRNA_cell_default}}: values of mRNA per cell for the main cell +#' types. +#' +#' @section Authors: +#' Julien Racle <\email{julien.racle@unil.ch}> and David Gfeller <\email{ +#' david.gfeller@unil.ch}>. +#' +#' @docType package +#' @name EPIC.package +NULL + + +#' Reference profiles from circulating immune cells. +#' +#' A dataset containing the reference profiles obtained from immune cell +#' samples of \emph{B-}, \emph{NK-}, \emph{T-cells}, \emph{Monocytes} +#' and \emph{Neutrophils} purified from PBMC or whole blood. +#' +#' The original samples were obtained from healthy donors and donors after +#' influenza vaccination or with diabetes, sepsis or multiple sclerosis. +#' +#' @section Similar datasets: +#' \code{BRef} (main reference profiles, using data from sources 1-3 below) +#' +#' +#' @format A list of 3 elements: \describe{ \item{$refProfiles, +#' $refProfiles.var}{Matrices (nGenes x nRefCells) of the normalized gene +#' expression from the reference cells and the variability of this gene +#' expression for each gene and each cell type} \item{$sigGenes}{A list of 100 +#' signature genes used to deconvolve the cell proportions} } +#' +#' @source \enumerate{ +#' \item \url{https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-64655}, +#' \item \url{https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-60424/}, +#' \item \url{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51984} +#' } +"BRef" + +#' @section Similar datasets: +#' \code{BRef.tpm} (reference profiles based on same data as \code{BRef}, +#' but given in TPM counts instead of raw counts) +#' @rdname BRef +"BRef.tpm" + +#' Reference profiles obtained from single cell data of tumor infiltrating +#' cells. +#' +#' A dataset containing the reference profiles given in TPM from various cell +#' types: \emph{B-}, \emph{NK-}, \emph{T-cells} and \emph{Macrophages}. +#' +#' These were obtained from single-cell RNA-seq data from 9 donors from +#' the publication of Tirosh et al., 2016, Science. The samples +#' come from cancer metastases of melanoma (extracted from primary tumors +#' and non-lymphoid tissue metastases). The classification for each sample with +#' respect to each cell type is the one given by Tirosh et al. +#' +#' @format A list of 3 elements: \describe{ \item{$refProfiles, +#' $refProfiles.var}{Matrices (nGenes x nRefCells) of the normalized gene +#' expression from the reference cells and the variability of this gene +#' expression for each gene and each cell type} \item{$sigGenes}{A list of 80 +#' signature genes used to deconvolve the cell proportions} } +#' +#' @source \url{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72056} +"TRef.tpm" + +#' Values of mRNA / cell for the main cell types. +#' +#' These values have been obtained from experiments (see \cite{EPIC} publication). +#' For the other uncharacterized cells, we use a value of 0.4 as described +#' in \cite{EPIC} publication. For macrophages we don't have specific values but +#' assumed here it is the same value as for monocytes. +#' +#' @format A named numeric vector of the relative amount of mRNA per cell type. +#' There are two additional "special cell types": the \emph{otherCells} which +#' correspond to the uncharacterized cells present in the sample but without +#' any reference profile and the \emph{default} which is the default value used +#' for cells with reference profiles but without a value specified in the +#' \code{mRNA_cell_default} vector. +"mRNA_cell_default" + +#' Example dataset containing data from Hoek et al, 2015, PLoS One. +#' +#' This dataset contains a subset of the full Hoek et al data. It contains only +#' the data from the two healthy donors PBMC before influenza vaccination. +#' +#' @format This is a list of 3 elements: \describe{ +#' \item{$rawCounts}{(matrix of 51574 genes x 2 donors) The raw read counts +#' from the two donors. It has been obtained by mapping the original +#' fastq files to \emph{hg19} genome with help of \emph{tophat} and +#' \emph{htseq-count}.} +#' \item{$cellFractions.obs}{(matrix of 2 donors x 6 cell types) The +#' proportions of the different immune cells in the PBMC from the 2 donors, +#' as measured by FACS by Hoek et al.} +#' \item{$cellFractions.pred}{(matrix of 2 donors x 7 cell types) The +#' proportions of the different immune cells and of a potential additional +#' uncharacterized cell, as predicted by EPIC based on the rawCounts and +#' the profiles \code{reference=BRef}.} +#' } +#' +#' @source The description of this data can be found here: +#' \href{http://dx.doi.org/10.1371/journal.pone.0118528}{link to paper} +#' and \href{https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-64655}{link +#' to data}. +"hoek_data" + diff --git a/R/EPIC_fun.R b/R/EPIC_fun.R new file mode 100644 index 0000000..60324cd --- /dev/null +++ b/R/EPIC_fun.R @@ -0,0 +1,352 @@ +# ####################################' +# +# Package EPIC - developed by Julien Racle from David Gfeller's group of the +# University of Lausanne and Ludwig Institute for Cancer Research. Copyrights +# remain reserved as described in the included LICENSE file. +# +# ####################################' + +#' Estimate the proportion of immune and cancer cells. +#' +#' \code{EPIC} takes as input bulk gene expression data (RNA-seq) and returns +#' the proportion of mRNA and cells composing the various samples. +#' +#' This function uses a constrained least square minimization to estimate the +#' proportion of each cell type with a reference profile and another +#' uncharacterized cell type in bulk gene expression samples. +#' +#' The names of the genes in the bulk samples, the reference samples and in the +#' gene signature list need to be the same format (gene symbols are used in the +#' predefined reference profiles). The full list of gene names don't need to be +#' exactly the same between the reference and bulk samples: \emph{EPIC} +#' will use the intersect of the genes. +#' +#' @param bulk A matrix (\code{nGenes} x \code{nSamples}) of the raw genes +#' expression from each bulk sample. This matrix needs to have rownames +#' telling the gene names. In principle given as raw read counts, but it +#' could be given in tpm as well if using corresponding reference profiles +#' normalized in tpm as well. +#' @param reference (optional): A string or a list defining the reference cells. +#' It can take multiple formats, either: \itemize{ +#' \item\code{NULL}: to use the default reference profiles and genes +#' signature \code{\link{BRef}}. +#' \item a char: one of \emph{"BRef"}, \emph{"BRef.tpm"} or \emph{"TRef.tpm"} +#' to use the reference cells and genes signature of the corresponding +#' datasets (see \code{\link{BRef}}, \code{\link{BRef.tpm}} and +#' \code{\link{TRef.tpm}}). +#' \item a list. When a list it should include: \describe{ +#' \item{\code{$refProfiles}}{a matrix (\code{nGenes} x \code{nCellTypes}) +#' of the reference cells genes expression (without the cancer cell type); +#' the rownames needs to be defined as well as the colnames giving the +#' names of each reference cell types; +#' } +#' \item{\code{$sigGenes}}{a character vector of the gene names to use as +#' signature - sigGenes can also be given as a direct input to EPIC +#' function;} +#' \item{\code{$refProfiles.var}}{(optional): a matrix (\code{nGenes} x +#' \code{nCellTypes}) of the variability of each gene expression for each +#' cell type (if absent, we assume an identical variability for all genes +#' in all cells) - it needs to have the same dimnames than refProfiles;} +#' } +#' } +#' @param mRNA_cell (optional): A named numeric vector: tells (in arbitrary +#' units) the amount of mRNA for each of the reference cells and of the +#' other uncharacterized (cancer) cell. Two names are of special meaning: +#' \emph{"otherCells"} - used for the mRNA/cell value of the "other cells" +#' from the sample (i.e. the cell type that don't have any reference gene +#' expression profile) ; and \emph{default} - used for the mRNA/cell of the +#' cells from the reference profiles for which no specific value is given +#' in mRNA_cell (i.e. if mRNA_cell=c(Bcells=2, NKcells=2.1, otherCells=3.5, +#' default=1), then if the refProfiles described Bcells, NKcells and Tcells, +#' we would use a value of 3.5 for the "otherCells" that didn't have any +#' reference profile and a default value of 1 for the Tcells when computing +#' the cell fractions). +#' To note: if the data is given as raw counts, then mRNA per cell should +#' correspond to some weight of mRNA per cell (or to a total number of mRNA +#' nucleotides per cell); while if data is in tpm, then this mRNA per cell +#' would ideally correspond more to some number of transcripts per cell. +#' The default values correspond to data in raw counts. +#' @param mRNA_cell_sub (optional): This can be given instead of \code{mRNA_cell} (or +#' in addition to it). It is also a named numeric vector, used to replace +#' only the mRNA/cell values from some cell types (or to add values for new +#' cell types). The values given in mRNA_cell_sub will overwrite the default +#' values as well as those that might have been given by mRNA_cell. +#' @param sigGenes (optional): a character vector of the gene names to use as +#' signature for the deconvolution. In principle this is given with the +#' reference as the "reference$sigGenes" but if we give a value for this +#' input variable, it is these signature genes that will be used instead of +#' the ones given with the reference profile. +#' @return A list of 3 matrices:\describe{ +#' \item{\code{mRNAProportions}}{(\code{nSamples} x (\code{nCellTypes+1})) the +#' proportion of mRNA coming from all cell types with a ref profile + the +#' uncharacterized other cell.} +#' \item{\code{cellFractions}}{(\code{nSamples} x (\code{nCellTypes+1})) this +#' gives the proportion of cells from each cell type after accounting for +#' the mRNA / cell value.} +#' \item{\code{fit.gof}}{(\code{nSamples} x 12) a matrix telling the quality +#' for the fit of the signature genes in each sample. It tells if the +#' minimization converged, and other info about this fit comparing the +#' measured gene expression in the sigGenes vs predicted gene expression in +#' the sigGenes.} +#' } +#' +#' @examples +#' res1 <- EPIC(hoek_data$rawCounts) +#' res1$cellFractions +#' res2 <- EPIC(hoek_data$rawCounts, BRef) +#' res3 <- EPIC(bulk=hoek_data$rawCounts, reference=BRef) +#' res4 <- EPIC(hoek_data$rawCounts, reference="BRef") +#' res5 <- EPIC(hoek_data$rawCounts, mRNA_cell_sub=c(Bcells=1, otherCells=5)) +#' res6 <- EPIC(bulk=hoek_data$rawCounts, reference="BRef.tpm") +#' # Various possible ways of calling EPIC function. res 1 to 4 should +#' # give exactly the same outputs, and the elements res1$cellFractions +#' # should be equal to the example predictions found in +#' # hoek_data$cellFractions.pred for these first 4 results. +#' # The values of cellFraction for res5 will be different due to the use of +#' # other mRNA per cell values for the B and other cells. +#' # And res6 will also give different results and is not advised: the reference +#' # BRef.tpm corresponds to tpm while the bulk was given as raw counts. +#' +#' @export +EPIC <- function(bulk, reference=NULL, mRNA_cell=NULL, mRNA_cell_sub=NULL, + sigGenes=NULL){ + # First get the value of the reference profiles depending on the input + # 'reference'. + if (is.null(reference)){ + reference <- EPIC::BRef + } else if (is.character(reference)){ + if (reference %in% prebuiltRefNames){ + reference <- get(reference, pos="package:EPIC") + # Replace the char defining the reference name by the corresponding + # pre-built reference values. + } else + stop("The reference, '", reference, "' is not part of the allowed ", + "references:", paste(prebuiltRefNames, collapse=", ")) + } else if (is.list(reference)){ + refListNames <- names(reference) + if ( (!all(c("refProfiles", "sigGenes") %in% refListNames)) || + (("refProfiles" %in% refListNames) && !is.null(sigGenes)) ) + stop("Reference, when given as a list needs to contain at least the ", + "fields 'refProfiles' and 'sigGenes' (sigGenes could also be ", + "given as input to EPIC instead)") + if (!("refProfiles.var" %in% refListNames)){ + warning("'refProfiles.var' not defined; using identical variability ", + "for all genes and cell types") + reference$refProfiles.var <- 1 + } + if ((length(reference$refProfiles.var) > 1) && + (!identical(dim(reference$refProfiles.var), dim(reference$refProfiles)) + || !identical(dimnames(reference$refProfiles.var), + dimnames(reference$refProfiles)))) + stop("The dimensions and dimnames of 'reference$refProfiles' and ", + "'reference$refProfiles.var' need to be the same") + } else { + stop("Unknown format for 'reference'") + } + + refProfiles <- reference$refProfiles + refProfiles.var <- reference$refProfiles.var + + nSamples <- NCOL(bulk); samplesNames <- colnames(bulk) + if (is.null(samplesNames)){ + samplesNames <- 1:nSamples + colnames(bulk) <- samplesNames + } + nRefCells <- NCOL(refProfiles); refCellsNames <- colnames(refProfiles) + + # Checking the correct format of the input variables + if (!is.matrix(bulk) && !is.data.frame(bulk)) + stop("'bulk' needs to be given as a matrix or data.frame") + if (!is.matrix(refProfiles) && !is.data.frame(refProfiles)) + stop("'reference$refProfiles' needs to be given as a matrix or data.frame") + if (!is.matrix(refProfiles.var) && !is.data.frame(refProfiles.var)) + stop("'reference$refProfiles.var' needs to be given as a matrix or ", + "data.frame") + + # Keeping only common genes and normalizing the counts based on these common + # genes + bulkGenes <- rownames(bulk) + if (anyDuplicated(bulkGenes)) + stop("There are some duplicated gene names in 'bulk'") + refGenes <- rownames(refProfiles) + if (anyDuplicated(refGenes)) + stop("There are some duplicated gene names in 'refGenes'") + commonGenes <- intersect(bulkGenes, refGenes) + + if (is.null(sigGenes)) + sigGenes <- reference$sigGenes + sigGenes <- sigGenes[sigGenes %in% commonGenes] + nSigGenes <- length(sigGenes) + if (nSigGenes < nRefCells) + stop("There are only ", nSigGenes, " signature genes", + " matching common genes between bulk and reference profiles,", + " but there should be more signature genes than reference cells") + + if (length(commonGenes) < 2e3) + warning("there are few genes in common between the bulk samples and ", + "reference cells:", length(commonGenes), ", so the normalization ", + "might be an issue") + # The value of 2e3 is arbitrary, but should be a respectable number for the + # data renormalization. + + bulk <- scaleCounts(bulk, sigGenes, commonGenes)$counts + temp <- scaleCounts(refProfiles, sigGenes, commonGenes) + refProfiles <- temp$counts + refProfiles.var <- scaleCounts(refProfiles.var, sigGenes, + normFact=temp$normFact)$counts + # the refProfiles.var is normalized by the same factors as refProfiles. + + if (is.null(mRNA_cell)) + mRNA_cell <- EPIC::mRNA_cell_default + + if (!is.null(mRNA_cell_sub)){ + if (is.null(names(mRNA_cell_sub)) || !is.numeric(mRNA_cell_sub)) + stop("When mRNA_cell_sub is given, it needs to be a named numeric vector") + mRNA_cell[names(mRNA_cell_sub)] <- mRNA_cell_sub + } + + minFun1 <- function(x, A, b, w){ + # basic minimization function used to minimize the squared sum of the error + # between the fit and observed value (A*x - b). We also give a weight, w, + # for each gene to give more or less importance to the fit of each (can use + # a scale 1 if don't want to give any weight). + return(sum( (w * (A %*% x - b)^2 ), na.rm = TRUE)) + } + + # Computing the weight to give to each gene + w <- rowSums(refProfiles / (refProfiles.var + 1e-12), na.rm=TRUE) + # 1e-12 to avoid divisions by 0: like this if refProfiles and refProfiles.var + # both are 0 for a given element, it will result in a weight of 0. + med_w <- stats::median(w[w>0], na.rm=TRUE) + w[w> 100*med_w] <- 100*med_w + # Set a limit for the big w to still not give too much weights to these genes + + # Defining the constraints for the fit of the proportions. + cMin <- 0; cMax <- 1 + ui <- rbind(diag(nRefCells), rep(1, nRefCells), rep(-1, nRefCells)) + ci <- c(rep(0,nRefCells), cMin, -cMax) + # ui and ci define the constraints, in the form "ui %*% x - ci >= 0". + # The first constraints are that the proportion of each cell must be + # positive; then we want the sum of all proportions to be bigger than + # cMin and this sum must also be smaller or equal to cMax. + cInitProp <- (min(1, cMax)-1e-5) / nRefCells + # used as an initial guess for the optimized - start with all cell fractions + # equal. We added the -1e-5 in cInitProp because the optimizer needs to have + # the initial guess inside the admissible region and not on its boundary + + minFun <- minFun1 + + # Estimating for each sample the proportion of the mRNA per cell type. + tempPropPred <- lapply(1:nSamples, FUN=function(cSample){ + b <- bulk[,cSample] + fit <- stats::constrOptim(theta = rep(cInitProp, nRefCells), f=minFun, + grad=NULL, ui=ui, ci=ci, A=refProfiles, b=b, w=w) + fit$x <- fit$par + + # Checking how well the estimated proportions predict the gene expression + b_estimated <- refProfiles %*% fit$x + + if (nSigGenes > 2){ + corSp.test <- stats::cor.test(b, b_estimated, method="spearman") + corPear.test <- stats::cor.test(b, b_estimated, method="pearson") + } else { + # cannot compute correlations with less than 3 observations. + corSp.test <- corPear.test <- list() + corSp.test$estimate <- corSp.test$p.value <- corPear.test$estimate <- + corPear.test$p.value <- NA + } + regLine <- stats::lm(b_estimated ~ b) + regLine_through0 <- stats::lm(b_estimated ~ b+0) + gof <- data.frame(fit$convergence, ifelse(is.null(fit$message), "", fit$message), + sqrt(minFun(x=fit$x, A=refProfiles, b=b, w=w)/nSigGenes), + sqrt(minFun(x=0, A=0, b=b, w=w)/nSigGenes), + # to only have sum((w*b)^2) + corSp.test$estimate, corSp.test$p.value, + corPear.test$estimate, corPear.test$p.value, + regLine$coefficients[2], regLine$coefficients[1], + regLine_through0$coefficients[1], sum(fit$x), + stringsAsFactors=F) + + return(list(mRNAProportions=fit$x, fit.gof=gof)) + } ) + mRNAProportions <- do.call(rbind, lapply(tempPropPred, + function(x) x$mRNAProportions)) + dimnames(mRNAProportions) <- list(samplesNames, refCellsNames) + + fit.gof <- do.call(rbind, lapply(tempPropPred, function(x) x$fit.gof)) + dimnames(fit.gof) <- list(samplesNames, c( + "convergeCode", "convergeMessage", "RMSE_weighted", + "Root_mean_squared_geneExpr_weighted", + "spearmanR", "spearmanP", "pearsonR", "pearsonP", "regline_a_x", + "regline_b", "regline_a_x_through0", "sum_mRNAProportions")) + # Some matrix giving information on the goodness of the fit. + + if (any(fit.gof$convergeCode != 0)) + warning("The optimization didn't converge for some samples:\n", + paste(samplesNames[fit.gof$convergeCode!=0], collapse="; "), + "\n - check fit.gof for the convergeCode and convergeMessage") + + # Adding a row to the proportion matrix corresponding to the other / + # uncharacterized / cancer cells. + mRNAProportions <- cbind(mRNAProportions, otherCells=1-rowSums(mRNAProportions)) + + tInds <- match(colnames(mRNAProportions), names(mRNA_cell)) + if (anyNA(tInds)){ + defaultInd <- match("default", names(mRNA_cell)) + if (is.na(defaultInd)){ + tStr <- paste(" and no default value is given for this mRNA per cell,", + "so we cannot estimate the cellFractions, only", + "the mRNA proportions") + } else { + tStr <- paste(" - using the default value of", mRNA_cell[defaultInd], + "for these but this might bias the true cell proportions from", + "all cell types.") + } + warning("mRNA_cell value unknown for some cell types: ", + paste(colnames(mRNAProportions)[is.na(tInds)], collapse=", "), + tStr) + tInds[is.na(tInds)] <- defaultInd + } + cellFractions <- t( t(mRNAProportions) / mRNA_cell[tInds]) + cellFractions <- cellFractions / rowSums(cellFractions, na.rm=FALSE) + # So that the cell fractions sum up to 1. In case some values were NA (due + # to unknown mRNA_cell value for some cell types without default value), then + # the full cellFractions will be NA - if we used na.rm=T, then we'd have the + # sum of the other than "NA" cells equal to 1 as if this "NA" cell was not + # present in the sample. + + return(list(mRNAProportions=mRNAProportions, cellFractions=cellFractions, + fit.gof=fit.gof)) +} + +#' Scaling raw counts from each sample. +#' +#' Normalizing the sum of counts from each sample to 1e6. +#' +#' Function taking a matrix (\emph{genes} x \emph{samples}) of counts as input +#' and returning the scaled counts for the subset signature genes (or all genes +#' if it sigGenes is \code{NULL}), with the scaled counts computed based on all +#' the 'renormGenes' (or all genes if it is NULL). The renormalization is made +#' independently for each sample, to have the sum of each columns over the +#' renormGenes equal to 1e6. +#' +#' normFact, if not null, is used as the normalization factor instead of +#' the colSums (used to renormalize the refProfiles.var by the same amount +#' than the refProfiles). +#' +#' @keywords internal +scaleCounts <- function(counts, sigGenes=NULL, renormGenes=NULL, normFact=NULL){ + if (is.null(sigGenes)) + sigGenes <- 1:nrow(counts) + + if (is.null(normFact)){ + if (is.null(renormGenes)) + renormGenes <- 1:nrow(counts) + normFact <- colSums(counts[renormGenes,,drop=FALSE], na.rm=TRUE) + } + counts <- t( t(counts[sigGenes,,drop=FALSE]) / normFact) * 1e6 + # Need to take the transpose so that the division is made on the correct + # elements + return(list(counts=counts, normFact=normFact)) +} + diff --git a/R/sysdata.rda b/R/sysdata.rda new file mode 100644 index 0000000..76e61bf Binary files /dev/null and b/R/sysdata.rda differ diff --git a/README.Rmd b/README.Rmd new file mode 100644 index 0000000..b438aaa --- /dev/null +++ b/README.Rmd @@ -0,0 +1,45 @@ +--- +output: github_document +--- + + + +```{r, echo = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.path = "README-" +) +``` + + +# EPIC - package +## Description +Package implementing EPIC method to estimate the proportion of immune and cancer +cells from bulk gene expression data. It is based on reference gene expression +profiles for the main immune cell types and it predicts the proportion of these +cells and of the remaining "other cells" that are the remaining cells (cancer, +stromal, endothelial) for which no reference profile is given. + +## Usage +The main function in this package is `EPIC`. It needs as input a matrix of the raw gene expression counts from the samples for which to estimate cell proportions. One can also define the reference cells to use +```{r, eval = FALSE} +out <- EPIC(bulk = bulkSamplesMatrix) +out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList) +``` + +`out` is a list containing the various mRNA and cell fractions in each samples as well as some *data.frame* of the goodness of fit. + +Values of mRNA per cell and signature genes to use can also be changed: +```{r, eval = FALSE} +out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList, mRNA_cell = mRNA_cell_vector, sigGenes = sigGenes_vector) +out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList, mRNA_cell_sub = mRNA_cell_sub_vector) +``` + + +## Installation +```{r, eval = FALSE} +install.packages("devtools") +devtools::install_github("GfellerLab/EPIC") +``` + diff --git a/README.md b/README.md new file mode 100644 index 0000000..adef9b2 --- /dev/null +++ b/README.md @@ -0,0 +1,36 @@ + + +EPIC - package +============== + +Description +----------- + +Package implementing EPIC method to estimate the proportion of immune and cancer cells from bulk gene expression data. It is based on reference gene expression profiles for the main immune cell types and it predicts the proportion of these cells and of the remaining "other cells" that are the remaining cells (cancer, stromal, endothelial) for which no reference profile is given. + +Usage +----- + +The main function in this package is `EPIC`. It needs as input a matrix of the raw gene expression counts from the samples for which to estimate cell proportions. One can also define the reference cells to use + +``` r +out <- EPIC(bulk = bulkSamplesMatrix) +out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList) +``` + +`out` is a list containing the various mRNA and cell fractions in each samples as well as some *data.frame* of the goodness of fit. + +Values of mRNA per cell and signature genes to use can also be changed: + +``` r +out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList, mRNA_cell = mRNA_cell_vector, sigGenes = sigGenes_vector) +out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList, mRNA_cell_sub = mRNA_cell_sub_vector) +``` + +Installation +------------ + +``` r +install.packages("devtools") +devtools::install_github("GfellerLab/EPIC") +``` diff --git a/data/BRef.rda b/data/BRef.rda new file mode 100644 index 0000000..526af21 Binary files /dev/null and b/data/BRef.rda differ diff --git a/data/BRef.tpm.rda b/data/BRef.tpm.rda new file mode 100644 index 0000000..5815ea0 Binary files /dev/null and b/data/BRef.tpm.rda differ diff --git a/data/TRef.tpm.rda b/data/TRef.tpm.rda new file mode 100644 index 0000000..a47ef7f Binary files /dev/null and b/data/TRef.tpm.rda differ diff --git a/data/hoek_data.rda b/data/hoek_data.rda new file mode 100644 index 0000000..c9fd0b1 Binary files /dev/null and b/data/hoek_data.rda differ diff --git a/data/mRNA_cell_default.rda b/data/mRNA_cell_default.rda new file mode 100644 index 0000000..5263f80 Binary files /dev/null and b/data/mRNA_cell_default.rda differ diff --git a/man/BRef.Rd b/man/BRef.Rd new file mode 100644 index 0000000..c8309e9 --- /dev/null +++ b/man/BRef.Rd @@ -0,0 +1,43 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/EPIC_descr.R +\docType{data} +\name{BRef} +\alias{BRef} +\alias{BRef.tpm} +\title{Reference profiles from circulating immune cells.} +\format{A list of 3 elements: \describe{ \item{$refProfiles, + $refProfiles.var}{Matrices (nGenes x nRefCells) of the normalized gene + expression from the reference cells and the variability of this gene + expression for each gene and each cell type} \item{$sigGenes}{A list of 100 + signature genes used to deconvolve the cell proportions} }} +\source{ +\enumerate{ + \item \url{https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-64655}, + \item \url{https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-60424/}, + \item \url{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE51984} + } +} +\usage{ +BRef + +BRef.tpm +} +\description{ +A dataset containing the reference profiles obtained from immune cell +samples of \emph{B-}, \emph{NK-}, \emph{T-cells}, \emph{Monocytes} +and \emph{Neutrophils} purified from PBMC or whole blood. +} +\details{ +The original samples were obtained from healthy donors and donors after +influenza vaccination or with diabetes, sepsis or multiple sclerosis. +} +\section{Similar datasets}{ + + \code{BRef} (main reference profiles, using data from sources 1-3 below) + + +\code{BRef.tpm} (reference profiles based on same data as \code{BRef}, + but given in TPM counts instead of raw counts) +} +\keyword{datasets} + diff --git a/man/EPIC.Rd b/man/EPIC.Rd new file mode 100644 index 0000000..9c4bbb8 --- /dev/null +++ b/man/EPIC.Rd @@ -0,0 +1,119 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/EPIC_fun.R +\name{EPIC} +\alias{EPIC} +\title{Estimate the proportion of immune and cancer cells.} +\usage{ +EPIC(bulk, reference = NULL, mRNA_cell = NULL, mRNA_cell_sub = NULL, + sigGenes = NULL) +} +\arguments{ +\item{bulk}{A matrix (\code{nGenes} x \code{nSamples}) of the raw genes +expression from each bulk sample. This matrix needs to have rownames +telling the gene names. In principle given as raw read counts, but it +could be given in tpm as well if using corresponding reference profiles +normalized in tpm as well.} + +\item{reference}{(optional): A string or a list defining the reference cells. +It can take multiple formats, either: \itemize{ + \item\code{NULL}: to use the default reference profiles and genes + signature \code{\link{BRef}}. + \item a char: one of \emph{"BRef"}, \emph{"BRef.tpm"} or \emph{"TRef.tpm"} + to use the reference cells and genes signature of the corresponding + datasets (see \code{\link{BRef}}, \code{\link{BRef.tpm}} and + \code{\link{TRef.tpm}}). + \item a list. When a list it should include: \describe{ + \item{\code{$refProfiles}}{a matrix (\code{nGenes} x \code{nCellTypes}) + of the reference cells genes expression (without the cancer cell type); + the rownames needs to be defined as well as the colnames giving the + names of each reference cell types; + } + \item{\code{$sigGenes}}{a character vector of the gene names to use as + signature - sigGenes can also be given as a direct input to EPIC + function;} + \item{\code{$refProfiles.var}}{(optional): a matrix (\code{nGenes} x + \code{nCellTypes}) of the variability of each gene expression for each + cell type (if absent, we assume an identical variability for all genes + in all cells) - it needs to have the same dimnames than refProfiles;} + } +}} + +\item{mRNA_cell}{(optional): A named numeric vector: tells (in arbitrary +units) the amount of mRNA for each of the reference cells and of the +other uncharacterized (cancer) cell. Two names are of special meaning: +\emph{"otherCells"} - used for the mRNA/cell value of the "other cells" +from the sample (i.e. the cell type that don't have any reference gene +expression profile) ; and \emph{default} - used for the mRNA/cell of the +cells from the reference profiles for which no specific value is given +in mRNA_cell (i.e. if mRNA_cell=c(Bcells=2, NKcells=2.1, otherCells=3.5, +default=1), then if the refProfiles described Bcells, NKcells and Tcells, +we would use a value of 3.5 for the "otherCells" that didn't have any +reference profile and a default value of 1 for the Tcells when computing +the cell fractions). +To note: if the data is given as raw counts, then mRNA per cell should +correspond to some weight of mRNA per cell (or to a total number of mRNA +nucleotides per cell); while if data is in tpm, then this mRNA per cell +would ideally correspond more to some number of transcripts per cell. +The default values correspond to data in raw counts.} + +\item{mRNA_cell_sub}{(optional): This can be given instead of \code{mRNA_cell} (or +in addition to it). It is also a named numeric vector, used to replace +only the mRNA/cell values from some cell types (or to add values for new +cell types). The values given in mRNA_cell_sub will overwrite the default +values as well as those that might have been given by mRNA_cell.} + +\item{sigGenes}{(optional): a character vector of the gene names to use as +signature for the deconvolution. In principle this is given with the +reference as the "reference$sigGenes" but if we give a value for this +input variable, it is these signature genes that will be used instead of +the ones given with the reference profile.} +} +\value{ +A list of 3 matrices:\describe{ + \item{\code{mRNAProportions}}{(\code{nSamples} x (\code{nCellTypes+1})) the + proportion of mRNA coming from all cell types with a ref profile + the + uncharacterized other cell.} + \item{\code{cellFractions}}{(\code{nSamples} x (\code{nCellTypes+1})) this + gives the proportion of cells from each cell type after accounting for + the mRNA / cell value.} + \item{\code{fit.gof}}{(\code{nSamples} x 12) a matrix telling the quality + for the fit of the signature genes in each sample. It tells if the + minimization converged, and other info about this fit comparing the + measured gene expression in the sigGenes vs predicted gene expression in + the sigGenes.} +} +} +\description{ +\code{EPIC} takes as input bulk gene expression data (RNA-seq) and returns + the proportion of mRNA and cells composing the various samples. +} +\details{ +This function uses a constrained least square minimization to estimate the +proportion of each cell type with a reference profile and another +uncharacterized cell type in bulk gene expression samples. + +The names of the genes in the bulk samples, the reference samples and in the +gene signature list need to be the same format (gene symbols are used in the +predefined reference profiles). The full list of gene names don't need to be +exactly the same between the reference and bulk samples: \emph{EPIC} +will use the intersect of the genes. +} +\examples{ +res1 <- EPIC(hoek_data$rawCounts) +res1$cellFractions +res2 <- EPIC(hoek_data$rawCounts, BRef) +res3 <- EPIC(bulk=hoek_data$rawCounts, reference=BRef) +res4 <- EPIC(hoek_data$rawCounts, reference="BRef") +res5 <- EPIC(hoek_data$rawCounts, mRNA_cell_sub=c(Bcells=1, otherCells=5)) +res6 <- EPIC(bulk=hoek_data$rawCounts, reference="BRef.tpm") +# Various possible ways of calling EPIC function. res 1 to 4 should +# give exactly the same outputs, and the elements res1$cellFractions +# should be equal to the example predictions found in +# hoek_data$cellFractions.pred for these first 4 results. +# The values of cellFraction for res5 will be different due to the use of +# other mRNA per cell values for the B and other cells. +# And res6 will also give different results and is not advised: the reference +# BRef.tpm corresponds to tpm while the bulk was given as raw counts. + +} + diff --git a/man/EPIC.package.Rd b/man/EPIC.package.Rd new file mode 100644 index 0000000..34870d5 --- /dev/null +++ b/man/EPIC.package.Rd @@ -0,0 +1,40 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/EPIC_descr.R +\docType{package} +\name{EPIC.package} +\alias{EPIC.package} +\alias{EPIC.package-package} +\title{EPIC: a package to Estimate the Proportion of Immune and Cancer cells from + tumor gene expression data.} +\description{ +EPIC package provides the function and immune cell reference profiles to +estimate the proportion of immune and other cells from bulk gene expression +data. +} +\section{EPIC functions}{ + +\code{\link{EPIC}} is the main function to call to estimate the + various cells proportions from a bulk sample. +} + +\section{Included datasets}{ + +\code{\link{BRef}}, \code{\link{BRef.tpm}}: reference profiles from + circulating immune cells. + + \code{\link{TRef.tpm}}: reference profiles from immune cells obtained + from single cell data of tumor infiltrating cells from melanoma patients. + +\code{\link{hoek_data}}: example dataset containing data from Hoek et al, + 2015, PLoS One. + +\code{\link{mRNA_cell_default}}: values of mRNA per cell for the main cell + types. +} + +\section{Authors}{ + +Julien Racle <\email{julien.racle@unil.ch}> and David Gfeller <\email{ + david.gfeller@unil.ch}>. +} + diff --git a/man/TRef.tpm.Rd b/man/TRef.tpm.Rd new file mode 100644 index 0000000..8810a9d --- /dev/null +++ b/man/TRef.tpm.Rd @@ -0,0 +1,31 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/EPIC_descr.R +\docType{data} +\name{TRef.tpm} +\alias{TRef.tpm} +\title{Reference profiles obtained from single cell data of tumor infiltrating +cells.} +\format{A list of 3 elements: \describe{ \item{$refProfiles, + $refProfiles.var}{Matrices (nGenes x nRefCells) of the normalized gene + expression from the reference cells and the variability of this gene + expression for each gene and each cell type} \item{$sigGenes}{A list of 80 + signature genes used to deconvolve the cell proportions} }} +\source{ +\url{http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE72056} +} +\usage{ +TRef.tpm +} +\description{ +A dataset containing the reference profiles given in TPM from various cell +types: \emph{B-}, \emph{NK-}, \emph{T-cells} and \emph{Macrophages}. +} +\details{ +These were obtained from single-cell RNA-seq data from 9 donors from +the publication of Tirosh et al., 2016, Science. The samples +come from cancer metastases of melanoma (extracted from primary tumors +and non-lymphoid tissue metastases). The classification for each sample with +respect to each cell type is the one given by Tirosh et al. +} +\keyword{datasets} + diff --git a/man/hoek_data.Rd b/man/hoek_data.Rd new file mode 100644 index 0000000..b38d072 --- /dev/null +++ b/man/hoek_data.Rd @@ -0,0 +1,34 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/EPIC_descr.R +\docType{data} +\name{hoek_data} +\alias{hoek_data} +\title{Example dataset containing data from Hoek et al, 2015, PLoS One.} +\format{This is a list of 3 elements: \describe{ + \item{$rawCounts}{(matrix of 51574 genes x 2 donors) The raw read counts + from the two donors. It has been obtained by mapping the original + fastq files to \emph{hg19} genome with help of \emph{tophat} and + \emph{htseq-count}.} + \item{$cellFractions.obs}{(matrix of 2 donors x 6 cell types) The + proportions of the different immune cells in the PBMC from the 2 donors, + as measured by FACS by Hoek et al.} + \item{$cellFractions.pred}{(matrix of 2 donors x 7 cell types) The + proportions of the different immune cells and of a potential additional + uncharacterized cell, as predicted by EPIC based on the rawCounts and + the profiles \code{reference=BRef}.} +}} +\source{ +The description of this data can be found here: + \href{http://dx.doi.org/10.1371/journal.pone.0118528}{link to paper} + and \href{https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-64655}{link + to data}. +} +\usage{ +hoek_data +} +\description{ +This dataset contains a subset of the full Hoek et al data. It contains only +the data from the two healthy donors PBMC before influenza vaccination. +} +\keyword{datasets} + diff --git a/man/mRNA_cell_default.Rd b/man/mRNA_cell_default.Rd new file mode 100644 index 0000000..3a2a21b --- /dev/null +++ b/man/mRNA_cell_default.Rd @@ -0,0 +1,23 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/EPIC_descr.R +\docType{data} +\name{mRNA_cell_default} +\alias{mRNA_cell_default} +\title{Values of mRNA / cell for the main cell types.} +\format{A named numeric vector of the relative amount of mRNA per cell type. + There are two additional "special cell types": the \emph{otherCells} which + correspond to the uncharacterized cells present in the sample but without + any reference profile and the \emph{default} which is the default value used + for cells with reference profiles but without a value specified in the + \code{mRNA_cell_default} vector.} +\usage{ +mRNA_cell_default +} +\description{ +These values have been obtained from experiments (see \cite{EPIC} publication). +For the other uncharacterized cells, we use a value of 0.4 as described +in \cite{EPIC} publication. For macrophages we don't have specific values but +assumed here it is the same value as for monocytes. +} +\keyword{datasets} + diff --git a/man/scaleCounts.Rd b/man/scaleCounts.Rd new file mode 100644 index 0000000..e238260 --- /dev/null +++ b/man/scaleCounts.Rd @@ -0,0 +1,25 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/EPIC_fun.R +\name{scaleCounts} +\alias{scaleCounts} +\title{Scaling raw counts from each sample.} +\usage{ +scaleCounts(counts, sigGenes = NULL, renormGenes = NULL, normFact = NULL) +} +\description{ +Normalizing the sum of counts from each sample to 1e6. +} +\details{ +Function taking a matrix (\emph{genes} x \emph{samples}) of counts as input +and returning the scaled counts for the subset signature genes (or all genes +if it sigGenes is \code{NULL}), with the scaled counts computed based on all +the 'renormGenes' (or all genes if it is NULL). The renormalization is made +independently for each sample, to have the sum of each columns over the +renormGenes equal to 1e6. + +normFact, if not null, is used as the normalization factor instead of + the colSums (used to renormalize the refProfiles.var by the same amount + than the refProfiles). +} +\keyword{internal} + diff --git a/tests/testthat.R b/tests/testthat.R new file mode 100644 index 0000000..04c7d8e --- /dev/null +++ b/tests/testthat.R @@ -0,0 +1,4 @@ +library(testthat) +library(EPIC) + +test_check("EPIC") diff --git a/tests/testthat/test_EPIC_fun.R b/tests/testthat/test_EPIC_fun.R new file mode 100644 index 0000000..120820c --- /dev/null +++ b/tests/testthat/test_EPIC_fun.R @@ -0,0 +1,22 @@ +context("Testing EPIC function") + +test_that("Test of bad inputs", { + a <- matrix(1:20, nrow=5) + expect_error(EPIC(bulk=a), "There are only 0 signature") + refStr <- "unknownRef" + expect_error(EPIC(bulk=hoek_data$rawCounts, reference=refStr), + paste0(refStr, ".* not part of the allowed references")) + tRef <- BRef; tRef$sigGenes <- NULL + expect_error(EPIC(bulk=hoek_data$rawCounts, reference=tRef), + "needs to contain .* 'sigGenes'") + tRef <- BRef; + tRef$refProfiles.var <- tRef$refProfiles.var[ + nrow(tRef$refProfiles.var):1,] + expect_error(EPIC(bulk=hoek_data$rawCounts, reference=tRef), + "dimnames of .*refProfiles.*refProfiles.var.*same") +}) + +test_that("Test for correct result on hoek data with default input", { + testFract <- EPIC(hoek_data$rawCounts)$cellFractions + expect_equal(testFract, hoek_data$cellFractions.pred) +}) diff --git a/vignettes/info.Rmd b/vignettes/info.Rmd new file mode 100644 index 0000000..ae4a198 --- /dev/null +++ b/vignettes/info.Rmd @@ -0,0 +1,40 @@ +--- +title: "EPIC package" +author: "Julien Racle and David Gfeller" +date: "`r Sys.Date()`" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{EPIC package} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +## Description +Package implementing EPIC method to estimate the proportion of immune and cancer +cells from bulk gene expression data. It is based on reference gene expression +profiles for the main immune cell types and it predicts the proportion of these +cells and of the remaining "other cells" that are the remaining cells (cancer, +stromal, endothelial) for which no reference profile is given. + +## Usage +The main function in this package is `EPIC`. It needs as input a matrix of the raw gene expression counts from the samples for which to estimate cell proportions. One can also define the reference cells to use +```{r, eval = FALSE} +out <- EPIC(bulk = bulkSamplesMatrix) +out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList) +``` + +`out` is a list containing the various mRNA and cell fractions in each samples as well as some *data.frame* of the goodness of fit. + +Values of mRNA per cell and signature genes to use can also be changed: +```{r, eval = FALSE} +out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList, mRNA_cell = mRNA_cell_vector, sigGenes = sigGenes_vector) +out <- EPIC(bulk = bulkSamplesMatrix, reference = referenceCellsList, mRNA_cell_sub = mRNA_cell_sub_vector) +``` + + +## Installation +```{r, eval = FALSE} +install.packages("devtools") +devtools::install_github("GfellerLab/EPIC") +``` +