diff --git a/.Rbuildignore b/.Rbuildignore index dcd837c0..35b12209 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -13,6 +13,7 @@ ^.gitlab-ci.yml$ ^CONTRIBUTION.md$ ^_pkgdown.yml$ +^pkgdown$ ^code-of-conduct.md$ ^LICENSE$ ^immunarch-citation.xml$ diff --git a/CONTRIBUTION.md b/CONTRIBUTION.md index 8011cdf2..54a68a67 100644 --- a/CONTRIBUTION.md +++ b/CONTRIBUTION.md @@ -1,14 +1,19 @@ # Contributing to immunarch -This outlines how to propose a change to immunarch. +This document outlines how to propose a change to immunarch. -### Fixing typos -Small typos or grammatical errors in documentation may be edited directly using -the GitLab web IDE functionality / GitHub web interface, so long as the changes are made in the _source_ file. +## How to contribute? -* YES: you edit a roxygen comment in a `.R` file below `R/`. -* NO: you edit an `.Rd` file below `man/`. +There are three general ways to contribute to the package. + +1. Code contribution via pull requests. Helps us improve the codebase of the package, add new features or fix bugs and typos in the documentation. + +2. Bug reports and feature requests via GitHub issues. Helps us notice important issues and improvements to address. + +3. Helping others by answering tickets in GitHub issues. Greatly helps build the community and accelerates the immune repertoire research progress. + +## Code contribution via pull requests ### Prerequisites @@ -20,23 +25,79 @@ that via GitHub, please note it in the issue. ### Pull request process +* Always start by forking the `dev` branch [from here](https://github.com/immunomind/immunarch/tree/dev) to make sure you have the latest pre-release version of `immunarch`. Name it accordingly to your contributions. `repClonality refactor` or `fix-issue-12` would work. * We recommend that you create a Git branch for each pull request (PR). -* Look at the CI (Continuous Integration) build status before and after making changes. -The `README` should contain badges for any continuous integration services used -by the package. -* New code should follow our style guide that is the tidyverse [style guide](http://style.tidyverse.org) -with a very minor addition of using `=` instead of `<-` in variable assignment. -You can use the [styler](https://CRAN.R-project.org/package=styler) package to +* We follow the [following guidelines for commit naming](https://medium.com/@kevinkreuzer/the-way-to-fully-automated-releases-in-open-source-projects-44c015f38fd6). +* New code should follow our style guide that is the tidyverse [style guide](http://style.tidyverse.org) You can use the [styler](https://CRAN.R-project.org/package=styler) package to apply these styles, but please don't restyle code that has nothing to do with your PR. * We use [roxygen2](https://cran.r-project.org/package=roxygen2) for documentation. * We use [testthat](https://cran.r-project.org/package=testthat). Contributions -with test cases included are easier to accept. -* For user-facing changes, add a bullet to the top of `NEWS.md` below the current -development version header describing the changes made followed by your GitHub -username, and links to relevant issue(s)/PR(s). +with test cases included are easier to accept. +* Look at the CI (Continuous Integration) build status before and after making changes. +The `README` should contain badges for any continuous integration services used +by the package. + +### Fixing typos + +Small typos or grammatical errors in documentation may be edited directly using +the GitLab web IDE functionality / GitHub web interface, so long as the changes are made in the _source_ file. Please make sure to create a Pull Request instead of commiting directly to the `master` branch. + +* YES: you edit a roxygen comment in a `.R` file below `R/`. +* NO: you edit an `.Rd` file below `man/`. + +### Version naming + +We employ Major.Minor.Patch. E.g., 0.4.0. + +Hotfixes: Major.Minor.Patch → Major.Minor.(Patch + 1) + +Guidelines: [http://r-pkgs.had.co.nz/release.html](http://r-pkgs.had.co.nz/release.html) + +### Commit naming + +We follow the [following guidelines for commit naming](https://medium.com/@kevinkreuzer/the-way-to-fully-automated-releases-in-open-source-projects-44c015f38fd6). + +#### Commit types + +There are eight types of commits: `chore`, `docs`, `feat`, `fix`, `refactor`, `test`, `perf`, `style`. Most used are `feat` for implementation of a new feauture, `docs` for updating the documentation, `fix` for fixing a bug. + +Commit name examples: `feat(diversity): added the Chao1 method for diversity estimations`, `fix(clonality): fixed a bug in clonality computations #12`, where `#12` is a link to the issue on the immunarch issue page. + +#### Commit scopes + +- Changes in analysis- and visualisation-specific functions: `diversity`, `overlap`, `pub-rep`, `clonality`, `gene-usage`, `explore`, `kmers`, `spectratype`, `dynamics`, `tools`, `single-cell`. + +- General changes in visualisation functions (e.g., replace one package with another, or change a non-specific visualisation function such as `vis_bar`): `vis` + +- Changes in parsing: `io` + +- Changes in databases support: `db` + +- Changes in additional functions such as general statistics functions: `utility` + +- Changes in data: `data` + +- Changes in NAMESPACE, DESCRIPTION, citations, ISSUE_TEMPLATE.md, etc., without README: `upkeep` + +- Changes in README and vignettes: `vignette` + +- Changes in Continuous Integration: `ci` + +- Changes in Shiny applications: `shiny` + + +## Bug reports and feature requests + +### How to create an Issue + +We have a rich list of templates for Issues [here](https://github.com/immunomind/immunarch/tree/master/.github/ISSUE_TEMPLATE). Go to the [GitHub Issues page](https://github.com/immunomind/immunarch/issues) for `immunarch` and create a new Issue ticket from there. + +## Helping others by answering tickets + +Got to [GitHub Issues page](https://github.com/immunomind/immunarch/issues) and find Issues that you are familiar with to answer. -### Code of Conduct +## Code of Conduct Please note that this project is released with a [Contributor Code of Conduct](code-of-conduct.md). By participating in this project you agree to diff --git a/DESCRIPTION b/DESCRIPTION index a7af1aaa..6c43c51c 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,18 +1,18 @@ Package: immunarch Type: Package Title: Bioinformatics Analysis of T-Cell and B-Cell Immune Repertoires -Version: 0.6.4 +Version: 0.6.5 Authors@R: c( person("Vadim I.", "Nazarov", , "vdm.nazarov@gmail.com", c("aut", "cre")), person("Vasily O.", "Tsvetkov", , role = "aut"), person("Eugene", "Rumynskiy", , role = "aut"), - person("Anna", "Lorenc", , role = "aut"), - person("Daniel J.", "Moore", , role = "aut"), - person("Victor", "Greiff", , role = "aut"), + person("Anna", "Lorenc", , role = "ctb"), + person("Daniel J.", "Moore", , role = "ctb"), + person("Victor", "Greiff", , role = "ctb"), person("ImmunoMind", role = c("cph", "fnd")) ) Contact: support@immunomind.io -Description: A comprehensive framework for bioinformatics analysis of bulk and single-cell +Description: A comprehensive framework for bioinformatics exploratory analysis of bulk and single-cell T-cell receptor and antibody repertoires. It provides seamless data loading, analysis and visualisation for AIRR (Adaptive Immune Receptor Repertoire) data, both bulk immunosequencing (RepSeq) and single-cell sequencing (scRNAseq). It implements most of the widely used AIRR analysis methods, @@ -24,11 +24,12 @@ License: AGPL-3 URL: https://immunarch.com/, https://github.com/immunomind/immunarch BugReports: https://github.com/immunomind/immunarch/issues Imports: + factoextra (>= 1.0.4), + fpc, + UpSetR (>= 1.4.0), pheatmap (>= 1.0.12), ggrepel (>= 0.8.0), reshape2 (>= 1.4.2), - factoextra (>= 1.0.4), - fpc, circlize, MASS (>= 7.3), Rtsne (>= 0.15), @@ -36,10 +37,8 @@ Imports: readxl (>= 1.3.1), shiny (>= 1.4.0), shinythemes, - treemap, airr, ggseqlogo, - UpSetR (>= 1.4.0), stringr (>= 1.4.0), ggalluvial (>= 0.10.0), Rcpp (>= 1.0), @@ -57,7 +56,7 @@ Depends: dplyr (>= 0.8.0), dtplyr (>= 1.0.0), data.table (>= 1.12.6), - gridExtra (>= 2.2.1) + patchwork LinkingTo: Rcpp Suggests: knitr (>= 1.8), diff --git a/NAMESPACE b/NAMESPACE index e4a6e98c..f9df9f02 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -43,7 +43,6 @@ export(cross_entropy) export(dbAnnotate) export(dbLoad) export(entropy) -export(filter_barcodes) export(fixVis) export(geneUsage) export(geneUsageAnalysis) @@ -79,6 +78,8 @@ export(repOverlap) export(repOverlapAnalysis) export(repSample) export(repSave) +export(select_barcodes) +export(select_clusters) export(spectratype) export(split_to_kmers) export(top) @@ -93,7 +94,6 @@ export(vis_hist) export(vis_immunr_kmer_profile_main) export(vis_seqlogo) export(vis_textlogo) -export(vis_treemap) import(ggplot2) importFrom(Rcpp,cppFunction) importFrom(Rcpp,sourceCpp) @@ -105,6 +105,7 @@ importFrom(data.table,as.data.table) importFrom(data.table,data.table) importFrom(data.table,melt.data.table) importFrom(data.table,set) +importFrom(data.table,setDF) importFrom(data.table,setDT) importFrom(data.table,setcolorder) importFrom(data.table,setnames) @@ -115,6 +116,7 @@ importFrom(dplyr,contains) importFrom(dplyr,desc) importFrom(dplyr,distinct) importFrom(dplyr,filter) +importFrom(dplyr,first) importFrom(dplyr,full_join) importFrom(dplyr,group_by) importFrom(dplyr,group_by_at) @@ -149,10 +151,10 @@ importFrom(grDevices,colorRampPalette) importFrom(graphics,plot) importFrom(grid,gpar) importFrom(grid,rectGrob) -importFrom(gridExtra,arrangeGrob) -importFrom(gridExtra,grid.arrange) importFrom(magrittr,"%>%") importFrom(methods,as) +importFrom(patchwork,plot_annotation) +importFrom(patchwork,wrap_plots) importFrom(pheatmap,pheatmap) importFrom(plyr,mapvalues) importFrom(readr,col_character) @@ -213,7 +215,6 @@ importFrom(stringr,str_replace_all) importFrom(stringr,str_sort) importFrom(stringr,str_split) importFrom(tibble,tibble) -importFrom(treemap,treemap) importFrom(utils,packageVersion) importFrom(utils,read.table) importFrom(utils,setTxtProgressBar) diff --git a/R/search.R b/R/annotation.R similarity index 100% rename from R/search.R rename to R/annotation.R diff --git a/R/data.R b/R/data.R deleted file mode 100644 index fb3ea0ee..00000000 --- a/R/data.R +++ /dev/null @@ -1,14 +0,0 @@ -#' Immune repertoire dataset -#' -#' @concept data -#' -#' @description A testing dataset containing clonotypes. -#' -#' @format A list of two elements. First element "data" is a list with data frames with clonotype tables. -#' Second element "meta" is a metadata file. -#' \describe{ -#' \item{data}{List of immune repertoire data frames.} -#' \item{meta}{Metadata} -#' ... -#' } -"immdata" diff --git a/R/data_docs.R b/R/data_docs.R index 10df36f1..d07cad43 100644 --- a/R/data_docs.R +++ b/R/data_docs.R @@ -60,3 +60,38 @@ AA_TABLE_REVERSED <- sapply(unique(AA_TABLE), function(aa) { names(AA_TABLE)[AA_TABLE == aa] }) AA_TABLE_REVERSED <- AA_TABLE_REVERSED[order(names(AA_TABLE_REVERSED))] + + +#' Single chain immune repertoire dataset +#' +#' @concept data +#' +#' @description A dataset with single chain TCR data for testing and examplatory purposes. +#' +#' @format A list of two elements. First element ("data") is a list with data frames with clonotype tables. +#' Second element ("meta") is a metadata table. +#' \describe{ +#' \item{data}{List of immune repertoire data frames.} +#' \item{meta}{Metadata} +#' ... +#' } +"immdata" + + +#' Paired chain immune repertoire dataset +#' +#' @concept data +#' +#' @description A dataset with paired chain IG data for testing and examplatory purposes. +#' +#' @format A list of four elements. +#' "data" is a list with data frames with clonotype tables. +#' "meta" is a metadata table. +#' "bc_patients" is a list of barcodes corresponding to specific patients. +#' "bc_clusters" is a list of barcodes corresponding to specific cell clusters. +#' \describe{ +#' \item{data}{List of immune repertoire data frames.} +#' \item{meta}{Metadata} +#' ... +#' } +"scdata" diff --git a/R/diversity.R b/R/diversity.R index 31b4837e..62f69d39 100644 --- a/R/diversity.R +++ b/R/diversity.R @@ -169,7 +169,7 @@ repDiversity <- function(.data, .method = "chao1", .col = "aa", .max.q = 6, .min .extrapolation = .extrapolation, .perc = .perc, .verbose = .verbose, .do.norm = .do.norm, .laplace = .laplace ) - new_class <- tail(class(res[[1]]), 1) + new_class <- head(class(res[[1]]), 1) res <- do.call(rbind, res) if (.method == "hill") { res <- reshape2::melt(res) diff --git a/R/immunr_data_format.R b/R/immunr_data_format.R index d8a572f2..1b3f5662 100644 --- a/R/immunr_data_format.R +++ b/R/immunr_data_format.R @@ -73,4 +73,9 @@ IMMCOL$type <- c( "integer", "integer", "integer", "character" ) +# Additional information +IMMCOL_ADD <- new.env() +# separator for paired chain data +IMMCOL_ADD$scsep <- ";" + IMMUNR_ERROR_NOT_IMPL <- "ERROR: not implemented yet. Please contact us via support@immunomind.io if you need it in your research." diff --git a/R/io.R b/R/io.R index be7835af..d5eb3210 100644 --- a/R/io.R +++ b/R/io.R @@ -19,8 +19,9 @@ if (getRversion() >= "2.15.1") { #' @importFrom readr read_delim read_tsv read_csv col_integer col_character col_double col_logical col_guess cols write_lines #' @importFrom stringr str_split str_detect str_replace_all #' @importFrom methods as -#' @importFrom dplyr contains +#' @importFrom dplyr contains first #' @importFrom utils read.table +#' @importFrom data.table setDF #' #' @description The \code{repLoad} function loads repertoire files #' into R workspace in the immunarch format where you can immediately use them for @@ -49,10 +50,18 @@ if (getRversion() >= "2.15.1") { #' R data frames with only one type of chain and cell presented. The metadata file will have additional columns specifying #' cell and chain types for different samples. #' -#' @param .format A character string specifying what format to use. See "Details" for more information on supported formats. +#' @param .format A character string specifying what format to use. Do NOT use it. See "Details" for more information on supported formats. #' #' Leave NA (which is default) if you want `immunarch` to detect formats automatically. #' +#' @param .mode Either "single" for single chain data or "paired" for paired chain data. +#' +#' Currently "single" works for every format, and "paired" works only for 10X Genomics data. +#' +#' By default, 10X Genomics data will be loaded as paired chain data, and other files will be loaded as single chain data. +#' +#' @param .coding A logical value. Pass TRUE to get coding-only clonotypes (by defaul). Pass FALSE to get all clonotypes. +#' #' @details #' The metadata has to be a tab delimited file with first column named "Sample". #' It can have any number of additional columns with arbitrary names. @@ -66,9 +75,10 @@ if (getRversion() >= "2.15.1") { #' } #' #' \code{repLoad} has the ".format" argument that sets the format for input repertoire files. -#' Do not pass it if you want immunarch to detect the formats and parse files automatically! -#' In case you want to force the package to parse the data in a specific format, -#' you can choose one of the several options for the argument: +#' Immunarch detects the file format automatically, and the argument is left only for the compatability +#' purposes. It will be soon removed. Do not pass it or your code will stop working! +#' +#' Currently, Immunarch support the following formats: #' #' - "immunoseq" - ImmunoSEQ of any version. http://www.adaptivebiotech.com/immunoseq #' @@ -132,13 +142,18 @@ if (getRversion() >= "2.15.1") { #' # > names(immdata) #' # [1] "data" "meta" #' @export repLoad -repLoad <- function(.path, .format = NA) { +repLoad <- function(.path, .format = NA, .mode = "paired", .coding = TRUE) { + if (!is.na(.format)) { + warning("Please don't provide the .format argument, + immunarch detects the format automatically. + The .format argument will soon be removed.") + } - exclude_extensions <- c("so", "exe", "bam", "fasta", "fai", "fastq", "bed", "rds") + exclude_extensions <- c("so", "exe", "bam", "fasta", "fai", "fastq", "bed", "rds", "report", "vdjca") # Process a repertoire file: detect format and load the data # Return: a named list with a repertoire data frame and it's name - .read_repertoire <- function(.path, .format) { + .read_repertoire <- function(.path, .format, .mode, .coding) { parse_res <- list() # Detect format @@ -175,12 +190,16 @@ repLoad <- function(.path, .format = NA) { message("unknown format, skipping") } else { - parse_res <- parse_fun(.path) + parse_res <- parse_fun(.path, .mode) if (is.null(parse_res)) { message(" [!] Warning: zero clonotypes found, skipping") parse_res <- list() } else { + if (.coding) { + parse_res <- coding(parse_res) + } + if (!has_class(parse_res, "list")) { parse_res <- list(parse_res) names(parse_res) <- .remove.ext(.path) @@ -197,7 +216,7 @@ repLoad <- function(.path, .format = NA) { # just load all repertoire files. # Do NOT (!) create a dummy metadata, return en empty data frame instead # Return: list with data, metadata and barcodes (if necessary) - .process_batch <- function(.files, .format) { + .process_batch <- function(.files, .format, .mode, .coding) { parsed_batch <- list() metadata <- tibble() @@ -219,10 +238,10 @@ repLoad <- function(.path, .format = NA) { } } else if (stringr::str_detect(.filepath, "barcode")) { - + # TODO: add the barcode processing subroutine to split by samples } else { - repertoire <- .read_repertoire(.filepath, .format) + repertoire <- .read_repertoire(.filepath, .format, .mode, .coding) if (length(repertoire) != 0) { parsed_batch <- c(parsed_batch, repertoire) } @@ -325,14 +344,14 @@ repLoad <- function(.path, .format = NA) { for (batch_i in 1:length(batches)) { if (length(batches[[batch_i]])) { message('Processing "', names(batches)[batch_i], '" ...') - parsed_batches[[names(batches)[batch_i]]] <- .process_batch(batches[[batch_i]], .format) + parsed_batches[[names(batches)[batch_i]]] <- .process_batch(batches[[batch_i]], .format, .mode, .coding) } } # # Step 2: check metadata files # - message("\n== Step 2/3: checking metadata files and merging... ==\n") + message("\n== Step 2/3: checking metadata files and merging files... ==\n") for (batch_i in 1:length(parsed_batches)) { message('Processing "', names(parsed_batches)[batch_i], '" ...') @@ -350,7 +369,7 @@ repLoad <- function(.path, .format = NA) { # # Step 3: split by chains and barcodes # - message("\n== Step 3/3: splitting data by barcodes and chains... ==\n") + message("\n== Step 3/3: processing paired chain data... ==\n") # Check for mixed chains repertoire files and split them if necessary rep_names <- names(parsed_batches$data) @@ -364,29 +383,40 @@ repLoad <- function(.path, .format = NA) { } if ("chain" %in% colnames(parsed_batches$data[[source_name]])) { - df_split <- split(parsed_batches$data[[source_name]], parsed_batches$data[[source_name]]$chain) - if (length(df_split) > 1) { - chain_col <- "Chain" - source_col <- "Source" - - if (!(chain_col %in% colnames(metadata))) { - message("Splitting repertoires by their chain type. Columns ", chain_col, " and ", source_col, " added to the metadata.") - message(" -- Column ", chain_col, " specifies chain types: TRA, TRB, etc.") - message(" -- Column ", source_col, " specifies the initial sample name for a repertoire after splitting by chain types") - } + # + # If mode is "single" then split paired chain data to two chain-specific immune repertoires + # + if (.mode == "single") { + df_split <- split(parsed_batches$data[[source_name]], parsed_batches$data[[source_name]]$chain) + if (length(df_split) > 1) { + chain_col <- "Chain" + source_col <- "Source" + + if (!(chain_col %in% colnames(metadata))) { + message("Splitting repertoires by their chain type. Columns ", chain_col, " and ", source_col, " added to the metadata.") + message(" -- Column ", chain_col, " specifies chain types: TRA, TRB, etc.") + message(" -- Column ", source_col, " specifies the initial sample name for a repertoire after splitting by chain types") + } - for (i in 1:length(df_split)) { - sample_name <- paste0(source_name, "_", names(df_split)[i]) - parsed_batches$data[[sample_name]] <- df_split[[i]] - new_record <- metadata[which(metadata$Sample == source_name), ] - new_record$Sample <- sample_name - new_record[[chain_col]] <- names(df_split)[i] - new_record[[source_col]] <- source_name - metadata <- merge(metadata, new_record, all = TRUE) - } + for (i in 1:length(df_split)) { + sample_name <- paste0(source_name, "_", names(df_split)[i]) + parsed_batches$data[[sample_name]] <- df_split[[i]] + new_record <- metadata[which(metadata$Sample == source_name), ] + new_record$Sample <- sample_name + new_record[[chain_col]] <- names(df_split)[i] + new_record[[source_col]] <- source_name + metadata <- merge(metadata, new_record, all = TRUE) + } - parsed_batches$data[[source_name]] <- NULL - metadata <- metadata[-(which(metadata$Sample == source_name)), ] + parsed_batches$data[[source_name]] <- NULL + metadata <- metadata[-(which(metadata$Sample == source_name)), ] + } + } + # + # If mode is "paired" then nothing to do here + # + else { + # pass } } } @@ -428,10 +458,9 @@ repLoad <- function(.path, .format = NA) { #' repSave(immdata, dirpath) #' # Load it and check if it is the same #' new_immdata <- repLoad(dirpath) -#' sum(immdata$data[[1]] != new_immdata$data[[1]], na.rm = TRUE) -#' sum(immdata$data[[2]] != new_immdata$data[[2]], na.rm = TRUE) -#' sum(immdata$meta != new_immdata$meta, na.rm = TRUE) -#' +#' # sum(immdata$data[[1]] != new_immdata$data[[1]], na.rm = TRUE) +#' # sum(immdata$data[[2]] != new_immdata$data[[2]], na.rm = TRUE) +#' # sum(immdata$meta != new_immdata$meta, na.rm = TRUE) #' @export repSave repSave <- function(.data, .path, .format = c("immunarch", "vdjtools"), .compress = TRUE) { @@ -646,7 +675,7 @@ repSave <- function(.data, .path, .format = c("immunarch", "vdjtools"), } } -.postprocess <- function(.data) { +.postprocess <- function(.data, .mode) { .data[[IMMCOL$cdr3nt]][.data[[IMMCOL$cdr3nt]] == "NONE"] <- NA logic <- is.na(.data[[IMMCOL$cdr3aa]]) & !is.na(.data[[IMMCOL$cdr3nt]]) if (any(logic)) { @@ -657,39 +686,11 @@ repSave <- function(.data, .path, .format = c("immunarch", "vdjtools"), if (any(logic)) { warn_msg <- c(" [!] Removed ", sum(logic)) warn_msg <- c(warn_msg, " clonotypes with no nucleotide and amino acid CDR3 sequence.") - # warn_msg = c(warn_msg, "\n Please check if your files are the final repertoire files, not the intermediate ones before filtering out bad clonotypes.\n") message(warn_msg) } .data <- .data[!logic, ] if (nrow(.data)) { - # Process 10xGenomics filtered contigs files - count barcodes, merge consensues ids, clonotype ids and contig ids - if (all(c("raw_clonotype_id", "barcode") %in% names(.data))) { - .data <- .data %>% - group_by(CDR3.nt, V.name, J.name) %>% - summarise( - Clones = length(unique(barcode)), - CDR3.aa = head(CDR3.aa, 1), - D.name = head(D.name, 1), - Sequence = head(CDR3.nt, 1), - V.end = head(V.end, 1), - J.start = head(J.start, 1), - D.end = head(D.end, 1), - D.start = head(D.start, 1), - VD.ins = head(VD.ins, 1), - DJ.ins = head(DJ.ins, 1), - VJ.ins = head(VJ.ins, 1), - chain = head(chain, 1), - barcode = paste0(unique(barcode), collapse = ";"), - raw_clonotype_id = gsub("clonotype", "", paste0(raw_clonotype_id, collapse = ";")), - raw_consensus_id = gsub("clonotype|consensus", "", paste0(raw_consensus_id, collapse = ";")), - contig_id = paste0(contig_id, collapse = ";") - ) %>% - ungroup() - .data[[IMMCOL$prop]] <- .data[[IMMCOL$count]] / sum(.data[[IMMCOL$count]]) - setcolorder(.data, IMMCOL$order) - } - for (colname in c(IMMCOL$ve, IMMCOL$ds, IMMCOL$de, IMMCOL$js, IMMCOL$vnj, IMMCOL$vnd, IMMCOL$dnj)) { if (colname %in% colnames(.data)) { logic <- is.na(.data[[colname]]) @@ -725,7 +726,7 @@ repSave <- function(.data, .path, .format = c("immunarch", "vdjtools"), ##### Parsers ##### -parse_repertoire <- function(.filename, .nuc.seq, .aa.seq, .count, +parse_repertoire <- function(.filename, .mode, .nuc.seq, .aa.seq, .count, .vgenes, .jgenes, .dgenes, .vend, .jstart, .dstart, .dend, .vd.insertions, .dj.insertions, .total.insertions, @@ -752,6 +753,7 @@ parse_repertoire <- function(.filename, .nuc.seq, .aa.seq, .count, .skip = .skip, .sep = "\t", .add ) + # IO_REFACTOR suppressMessages(df <- readr::read_delim(.filename, col_names = TRUE, col_types = col.classes, delim = .sep, @@ -759,6 +761,7 @@ parse_repertoire <- function(.filename, .nuc.seq, .aa.seq, .count, comment = "", trim_ws = TRUE, skip = .skip, na = c("", "NA", ".") )) + # suppressMessages(df <- fread(.filename, skip = .skip, data.table = FALSE, na.strings = c("", "NA", "."))) names(df) <- tolower(names(df)) recomb_type <- .which_recomb_type(df[[.vgenes]]) @@ -887,10 +890,10 @@ parse_repertoire <- function(.filename, .nuc.seq, .aa.seq, .count, colnames(df)[13] <- IMMCOL$vnd colnames(df)[14] <- IMMCOL$dnj - .postprocess(df) + .postprocess(df, .mode) } -parse_immunoseq <- function(.filename, .wash.alleles = TRUE) { +parse_immunoseq <- function(.filename, .mode, .wash.alleles = TRUE) { .fix.immunoseq.genes <- function(.col) { # fix "," .col <- gsub(",", ", ", .col, fixed = TRUE, useBytes = TRUE) @@ -971,12 +974,14 @@ parse_immunoseq <- function(.filename, .wash.alleles = TRUE) { file_cols[[IMMCOL$js]] <- IMMCOL$js file_cols[[IMMCOL$seq]] <- IMMCOL$seq + # IO_REFACTOR suppressMessages(df <- readr::read_delim(.filename, col_names = TRUE, col_types = cols(), delim = "\t", quote = "", escape_double = FALSE, comment = "", trim_ws = TRUE, skip = 0 )) + # suppressMessages(df <- fread(.filename, data.table = FALSE)) names(df) <- tolower(names(df)) @@ -1045,7 +1050,7 @@ parse_immunoseq <- function(.filename, .wash.alleles = TRUE) { } } -parse_mitcr <- function(.filename) { +parse_mitcr <- function(.filename, .mode) { .skip <- 0 f <- file(.filename, "r") l <- readLines(f, 1) @@ -1103,14 +1108,14 @@ parse_mitcr <- function(.filename) { } parse_repertoire( - .filename = filename, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, + .filename = filename, .mode = .mode, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, .vend = vend, .jstart = jstart, .dstart = dstart, .dend = dend, .vd.insertions = vd.insertions, .dj.insertions = dj.insertions, .total.insertions = total.insertions, .skip = .skip, .sep = .sep ) } -parse_mixcr <- function(.filename) { +parse_mixcr <- function(.filename, .mode) { fix.alleles <- function(.data) { .data[[IMMCOL$v]] <- gsub("[*][[:digit:]]*", "", .data[[IMMCOL$v]]) .data[[IMMCOL$d]] <- gsub("[*][[:digit:]]*", "", .data[[IMMCOL$d]]) @@ -1214,11 +1219,14 @@ parse_mixcr <- function(.filename) { message("Error: can't find a column with D genes") } + + # IO_REFACTOR df <- read_delim( file = .filename, col_types = cols(), delim = .sep, skip = 0, comment = "", quote = "", escape_double = FALSE, trim_ws = TRUE ) + # df <- fread(.filename, data.table = FALSE) # # return NULL if there is no clonotypes in the data frame @@ -1320,14 +1328,14 @@ parse_mixcr <- function(.filename) { warn_msg <- c(" [!] Warning: can't find a column with clonal counts. Setting all clonal counts to 1.") warn_msg <- c(warn_msg, "\n Did you apply repLoad to MiXCR file *_alignments.txt?") warn_msg <- c(warn_msg, " If so please consider moving all *.clonotypes.*.txt MiXCR files to") - warn_msg <- c(warn_msg, " a separate folder and applying repLoad to it.") + warn_msg <- c(warn_msg, " a separate folder and apply repLoad to the folder.") warn_msg <- c(warn_msg, "\n Note: The *_alignments.txt file IS NOT a repertoire file suitable for any analysis.") message(warn_msg) df[[.count]] <- 1 } .freq <- "Proportion" - df$Proportion <- df[[.count]] / sum(df[[.count]]) + df$Proportion <- df[[.count]] / sum(df[[.count]], na.rm = TRUE) .aa.seq <- IMMCOL$cdr3aa df[[.aa.seq]] <- bunch_translate(df[[.nuc.seq]]) @@ -1374,7 +1382,7 @@ parse_mixcr <- function(.filename) { .postprocess(fix.alleles(df)) } -parse_migec <- function(.filename) { +parse_migec <- function(.filename, .mode) { filename <- .filename nuc.seq <- "CDR3 nucleotide sequence" aa.seq <- "CDR3 amino acid sequence" @@ -1393,7 +1401,7 @@ parse_migec <- function(.filename) { .sep <- "\t" parse_repertoire( - .filename = filename, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, + .filename = filename, .mode = .mode, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, .vend = vend, .jstart = jstart, .dstart = dstart, .dend = dend, .vd.insertions = vd.insertions, .dj.insertions = dj.insertions, @@ -1401,7 +1409,7 @@ parse_migec <- function(.filename) { ) } -parse_migmap <- function(.filename) { +parse_migmap <- function(.filename, .mode) { filename <- .filename nuc.seq <- "cdr3nt" aa.seq <- "cdr3aa" @@ -1420,14 +1428,14 @@ parse_migmap <- function(.filename) { .sep <- "\t" parse_repertoire( - .filename = filename, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, + .filename = filename, .mode = .mode, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, .vend = vend, .jstart = jstart, .dstart = dstart, .dend = dend, .vd.insertions = vd.insertions, .dj.insertions = dj.insertions, .total.insertions = total.insertions, .skip = .skip, .sep = .sep ) } -parse_tcr <- function(.filename) { +parse_tcr <- function(.filename, .mode) { f <- file(.filename, "r") l <- readLines(f, 2)[2] close(f) @@ -1453,14 +1461,14 @@ parse_tcr <- function(.filename) { } parse_repertoire( - .filename = .filename, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, + .filename = .filename, .mode = .mode, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, .vend = vend, .jstart = jstart, .dstart = dstart, .dend = dend, .vd.insertions = vd.insertions, .dj.insertions = dj.insertions, .total.insertions = total.insertions, .skip = .skip, .sep = .sep ) } -parse_vdjtools <- function(.filename) { +parse_vdjtools <- function(.filename, .mode) { skip <- 0 # Check for different VDJtools outputs @@ -1510,14 +1518,14 @@ parse_vdjtools <- function(.filename) { } parse_repertoire( - .filename = filename, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, + .filename = filename, .mode = .mode, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, .vend = vend, .jstart = jstart, .dstart = dstart, .dend = dend, .vd.insertions = vd.insertions, .dj.insertions = dj.insertions, .total.insertions = total.insertions, .skip = .skip, .sep = .sep ) } -parse_imgt <- function(.filename) { +parse_imgt <- function(.filename, .mode) { .fix.imgt.alleles <- function(.col) { sapply(strsplit(.col, " "), function(x) { if (length(x) > 1) { @@ -1550,7 +1558,7 @@ parse_imgt <- function(.filename) { junc_start <- .make_names("JUNCTION start") df <- parse_repertoire( - .filename = .filename, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, + .filename = .filename, .mode = .mode, .nuc.seq = nuc.seq, .aa.seq = aa.seq, .count = .count, .vgenes = vgenes, .jgenes = jgenes, .dgenes = dgenes, .vend = vend, .jstart = jstart, .dstart = dstart, .dend = dend, .vd.insertions = vd.insertions, .dj.insertions = dj.insertions, .total.insertions = total.insertions, .skip = .skip, .sep = .sep, .add = junc_start @@ -1587,7 +1595,7 @@ parse_imgt <- function(.filename) { # stop(IMMUNR_ERROR_NOT_IMPL) # } -parse_airr <- function(.filename) { +parse_airr <- function(.filename, .mode) { df <- airr::read_rearrangement(.filename) df <- df %>% @@ -1638,7 +1646,7 @@ parse_airr <- function(.filename) { df } -parse_immunarch <- function(.filename) { +parse_immunarch <- function(.filename, .mode) { df <- readr::read_tsv(.filename, col_types = cols(), comment = "#") if (ncol(df) == 1) { # "," in the files, parse differently then @@ -1648,8 +1656,9 @@ parse_immunarch <- function(.filename) { df } -parse_10x_consensus <- function(.filename) { +parse_10x_consensus <- function(.filename, .mode) { df <- parse_repertoire(.filename, + .mode = .mode, .nuc.seq = "cdr3_nt", .aa.seq = NA, .count = "umis", .vgenes = "v_gene", .jgenes = "j_gene", .dgenes = "d_gene", .vend = NA, .jstart = NA, .dstart = NA, .dend = NA, @@ -1661,23 +1670,79 @@ parse_10x_consensus <- function(.filename) { df } -parse_10x_filt_contigs <- function(.filename) { +parse_10x_filt_contigs <- function(.filename, .mode) { df <- parse_repertoire(.filename, + .mode = .mode, .nuc.seq = "cdr3_nt", .aa.seq = NA, .count = "umis", .vgenes = "v_gene", .jgenes = "j_gene", .dgenes = "d_gene", .vend = NA, .jstart = NA, .dstart = NA, .dend = NA, .vd.insertions = NA, .dj.insertions = NA, .total.insertions = NA, - .skip = 0, .sep = ",", .add = c("chain", "raw_clonotype_id", "raw_consensus_id", "barcode", "contig_id") + .skip = 0, .sep = ",", # .add = c("chain", "raw_clonotype_id", "raw_consensus_id", "barcode", "contig_id") + .add = c("chain", "barcode", "raw_clonotype_id", "contig_id") ) + # setnames(df, "raw_clonotype_id", "RawClonotypeID") + # setnames(df, "raw_consensus_id", "RawConsensusID") + + # Process 10xGenomics filtered contigs files - count barcodes, merge consensues ids, clonotype ids and contig ids + df <- df[order(df$chain),] + setDT(df) + + if (.mode == "paired") { + df <- df %>% + lazy_dt() %>% + group_by(barcode, raw_clonotype_id) %>% + summarise( + CDR3.nt = paste0(CDR3.nt, collapse = IMMCOL_ADD$scsep), + CDR3.aa = paste0(CDR3.aa, collapse = IMMCOL_ADD$scsep), + V.name = paste0(V.name, collapse = IMMCOL_ADD$scsep), + J.name = paste0(J.name, collapse = IMMCOL_ADD$scsep), + D.name = paste0(D.name, collapse = IMMCOL_ADD$scsep), + chain = paste0(chain, collapse = IMMCOL_ADD$scsep), + # raw_clonotype_id = gsub("clonotype", "", paste0(raw_clonotype_id, collapse = IMMCOL_ADD$scsep)), + # raw_consensus_id = gsub("clonotype|consensus", "", paste0(raw_consensus_id, collapse = IMMCOL_ADD$scsep)), + contig_id = gsub("_contig_", "", paste0(contig_id, collapse = IMMCOL_ADD$scsep)) + ) %>% + as.data.table() + } + df <- df %>% + lazy_dt() %>% + group_by(CDR3.nt, V.name, J.name) %>% + summarise( + Clones = length(unique(barcode)), + CDR3.aa = first(CDR3.aa), + D.name = first(D.name), + chain = first(chain), + barcode = paste0(unique(barcode), collapse = IMMCOL_ADD$scsep), + raw_clonotype_id = gsub("clonotype|None", "", paste0(unique(raw_clonotype_id), collapse = IMMCOL_ADD$scsep)), + # raw_clonotype_id = gsub("clonotype", "", paste0(raw_clonotype_id, collapse = IMMCOL_ADD$scsep)), + # raw_consensus_id = gsub("clonotype|consensus", "", paste0(raw_consensus_id, collapse = IMMCOL_ADD$scsep)), + contig_id = paste0(contig_id, collapse = IMMCOL_ADD$scsep) + ) %>% + as.data.table() + + df$V.end <- NA + df$J.start <- NA + df$D.end <- NA + df$D.start <- NA + df$VD.ins <- NA + df$DJ.ins <- NA + df$VJ.ins <- NA + df$Sequence <- df$CDR3.nt + setnames(df, "contig_id", "ContigID") - setnames(df, "raw_clonotype_id", "RawClonotypeID") - setnames(df, "raw_consensus_id", "RawConsensusID") setnames(df, "barcode", "Barcode") - df + + df[[IMMCOL$prop]] <- df[[IMMCOL$count]] / sum(df[[IMMCOL$count]]) + setcolorder(df, IMMCOL$order) + + setDF(df) + + .postprocess(df) } -parse_archer <- function(.filename) { +parse_archer <- function(.filename, .mode) { parse_repertoire(.filename, + .mode = .mode, .nuc.seq = "Clonotype Sequence", .aa.seq = NA, .count = "Clone Abundance", .vgenes = "Predicted V Region", .jgenes = "Predicted J Region", .dgenes = "Predicted D Region", .vend = NA, .jstart = NA, .dstart = NA, .dend = NA, diff --git a/R/preprocessing.R b/R/preprocessing.R new file mode 100644 index 00000000..0a52eac4 --- /dev/null +++ b/R/preprocessing.R @@ -0,0 +1,167 @@ +#' Get the N most abundant clonotypes +#' +#' @concept preprocessing +#' +#' @importFrom dplyr top_n collect +#' +#' @param .data The data to be processed. Can be \link{data.frame}, +#' \link{data.table}, or a list of these objects. +#' +#' Every object must have columns in the immunarch compatible format. +#' \link{immunarch_data_format} +#' +#' Competent users may provide advanced data representations: +#' DBI database connections, Apache Spark DataFrame from \link{copy_to} or a list +#' of these objects. They are supported with the same limitations as basic objects. +#' +#' Note: each connection must represent a separate repertoire. +#' @param .n Numeric. Number of the most abundant clonotypes to return. +#' +#' @return +#' Data frame with the \code{.n} most abundant clonotypes only. +#' +#' @examples +#' data(immdata) +#' top(immdata$data) +#' top(immdata$data[[1]]) +#' @export top +top <- function(.data, .n = 10) { + if (has_class(.data, "list")) { + lapply(.data, top, .n = .n) + } else { + # ToDo: fix Clones with IMMDATA$count + top_n(.data, .n, Clones) %>% collect(n = Inf) + } +} + + +#' Filter out coding and non-coding clonotype sequences +#' +#' @concept preprocessing +#' +#' @aliases coding noncoding inframes outofframes +#' +#' @description Filter out clonotypes with non-coding, coding, in-frame or out-of-frame CDR3 sequences: +#' +#' `coding()` - remove all non-coding sequences (i.e., remove all sequences with stop codons and frame shifts); +#' +#' `noncoding()` - remove all coding sequences (i.e., leave sequences with stop codons and frame shifts only); +#' +#' `inframes()` - remove all out-of-frame sequences (i.e., remove all sequences with frame shifts); +#' +#' `outofframes()` - remove all in-frame sequences (i.e., leave sequences with frame shifts only). +#' +#' Note: the function will remove all clonotypes sequences with NAs in the CDR3 amino acid column. +#' +#' @usage +#' coding(.data) +#' +#' noncoding(.data) +#' +#' inframes(.data) +#' +#' outofframes(.data) +#' +#' @param .data The data to be processed. Can be \link{data.frame}, +#' \link{data.table}, or a list of these objects. +#' +#' Every object must have columns in the immunarch compatible format. +#' \link{immunarch_data_format} +#' +#' Competent users may provide advanced data representations: +#' DBI database connections, Apache Spark DataFrame from \link{copy_to} or a list +#' of these objects. They are supported with the same limitations as basic objects. +#' +#' Note: each connection must represent a separate repertoire. +#' +#' @return +#' Filtered data frame. +#' +#' @examples +#' data(immdata) +#' immdata_cod <- coding(immdata$data) +#' immdata_cod1 <- coding(immdata$data[[1]]) +#' @export coding noncoding inframes outofframes +coding <- function(.data) { + if (has_class(.data, "list")) { + lapply(.data, coding) + } else { + # immdata$data[[1]] %>% mutate_(Len = "nchar(CDR3.nt)", Nonc = "CDR3.nt %like% '[*, ~]'") %>% filter((Len %% 3 == 0) & (!Nonc)) + + dt_flag <- FALSE + if (has_class(.data, "data.table")) { + dt_flag <- TRUE + .data <- .data %>% lazy_dt() + } + d <- collect(.data, n = Inf) + d <- filter(d, !is.na(d[[IMMCOL$cdr3aa]])) + d <- d[grep("[*, ~]", d[[IMMCOL$cdr3aa]], invert = TRUE), ] + if (dt_flag) { + data.table(d) + } else { + d + } + } +} + +noncoding <- function(.data) { + if (has_class(.data, "list")) { + lapply(.data, noncoding) + } else { + dt_flag <- FALSE + if (has_class(.data, "data.table")) { + dt_flag <- TRUE + .data <- .data %>% lazy_dt() + } + d <- collect(.data, n = Inf) + d <- filter(d, !is.na(d[[IMMCOL$cdr3aa]])) + d <- d[grep("[*, ~]", d[[IMMCOL$cdr3aa]], invert = FALSE), ] + if (dt_flag) { + data.table(d) + } else { + d + } + } +} + +inframes <- function(.data) { + if (has_class(.data, "list")) { + lapply(.data, inframes) + } else { + dt_flag <- FALSE + if (has_class(.data, "data.table")) { + dt_flag <- TRUE + .data <- .data %>% lazy_dt() + } + d <- collect(.data, n = Inf) + d <- filter(d, !is.na(d[[IMMCOL$cdr3aa]])) + # subset(.data, nchar(.data[[IMMCOL$cdr3nt]]) %% 3 == 0) + d <- d[grep("[~]", d[[IMMCOL$cdr3aa]], invert = TRUE), ] + if (dt_flag) { + data.table(d) + } else { + d + } + } +} + +outofframes <- function(.data) { + if (has_class(.data, "list")) { + lapply(.data, outofframes) + } else { + dt_flag <- FALSE + if (has_class(.data, "data.table")) { + dt_flag <- TRUE + .data <- .data %>% lazy_dt() + } + d <- collect(.data, n = Inf) + d <- filter(d, !is.na(d[[IMMCOL$cdr3aa]])) + # subset(.data, nchar(.data[[IMMCOL$cdr3nt]]) %% 3 != 0) + d <- d[grep("[~]", d[[IMMCOL$cdr3aa]], invert = FALSE), ] + if (dt_flag) { + data.table(d) + } else { + d + } + } +} diff --git a/R/shared.R b/R/public.R similarity index 100% rename from R/shared.R rename to R/public.R diff --git a/R/processing.R b/R/sampling.R similarity index 78% rename from R/processing.R rename to R/sampling.R index d0be11fd..f81b67f2 100644 --- a/R/processing.R +++ b/R/sampling.R @@ -217,63 +217,3 @@ sample_clonotypes <- function(.data, .n, .prob) { .data[[IMMCOL$prop]] <- .data[[IMMCOL$count]] / sum(.data[[IMMCOL$count]]) .data[order(.data[[IMMCOL$prop]], decreasing = TRUE), ] } - - -#' Filter clonotypes using barcodes from single-cell metadata -#' -#' @concept single_cell -#' -#' @description Given the input data and a vector of barcodes, -#' the function returns the input with cells which has -#' at least one of the given barcode. Columns with clonotype counts -#' and proportions are changed accordingly to the filtered barcodes. -#' -#' @param .data The data to be processed. Can be \link{data.frame}, -#' \link{data.table}, or a list of these objects. -#' -#' Every object must have columns in the immunarch compatible format. -#' \link{immunarch_data_format} -#' -#' Competent users may provide advanced data representations: -#' DBI database connections, Apache Spark DataFrame from \link{copy_to} or a list -#' of these objects. They are supported with the same limitations as basic objects. -#' -#' Note: each connection must represent a separate repertoire. -#' -#' @param .barcodes Character vector. Vector of barcodes to use for -#' filtering. -#' -#' @return An immune repertoire or a list of immune repertoires with clonotypes which have -#' one or more barcodes in the input barcode vector and with clonotype abundances corrected. -#' -#' @examples -#' data(immdata) -#' # Create a fake single-cell data -#' df <- immdata$data[[1]] -#' df$Barcode <- "AAAAACCCCC" -#' df$Barcode[51:nrow(df)] <- "GGGGGCCCCC" -#' barcodes <- "AAAAACCCCC" -#' df <- filter_barcodes(df, barcodes) -#' nrow(df) -#' @export -filter_barcodes <- function(.data, .barcodes) { - if (!has_class(.data, "list")) { - if (!("Barcode" %in% colnames(.data))) { - stop('Can\'t find the "Barcode" column in the input data. Did you apply the function to a single-cell repertoire?') - } - found_barcodes <- lapply( - strsplit(.data[["Barcode"]], ";"), - function(x) .barcodes %in% x - ) - barcode_count <- sapply(found_barcodes, sum) - - .data <- .data %>% - filter(barcode_count > 0) - barcode_count <- barcode_count[barcode_count > 0] - .data[[IMMCOL$count]] <- barcode_count - .data[[IMMCOL$prop]] <- barcode_count / sum(barcode_count) - .data - } else { - lapply(.data, filter_barcodes, .barcodes = .barcodes) - } -} diff --git a/R/singlecell.R b/R/singlecell.R new file mode 100644 index 00000000..3822d77b --- /dev/null +++ b/R/singlecell.R @@ -0,0 +1,188 @@ +if (getRversion() >= "2.15.1") { + utils::globalVariables(c( + "Barcode", "V1" + )) +} + + +#' Select specific clonotypes using barcodes from single-cell metadata +#' +#' @concept single_cell +#' +#' @description Subset the input immune repertoire by barcodes. Pass a vector of +#' barcodes to subset or a vector cluster IDs and corresponding barcodes to +#' get a list of immune repertoires corresponding to cluster IDs. +#' Columns with clonotype counts +#' and proportions are changed accordingly to the filtered barcodes. +#' +#' @param .data The data to be processed. Can be \link{data.frame}, +#' \link{data.table}, or a list of these objects. +#' +#' Every object must have columns in the immunarch compatible format. +#' \link{immunarch_data_format} +#' +#' Competent users may provide advanced data representations: +#' DBI database connections, Apache Spark DataFrame from \link{copy_to} or a list +#' of these objects. They are supported with the same limitations as basic objects. +#' +#' Note: each connection must represent a separate repertoire. +#' +#' @param .barcodes Either a character vector with barcodes or a named character/factor vector with +#' barcodes as names and cluster IDs a vector elements. The output of Seurat's \code{Idents} function works. +#' +#' @param .force.list Logical. If TRUE then always return a list, even if the result is one data frame. +#' +#' @return An immune repertoire (if ".barcodes" is a barcode vector) or a list of immune repertoires +#' (if ".barcodes" is named vector or an output from Seurat::Idents()). Each element is an immune repertoire +#' with clonotype barcodes corresponding to the input barcodes. The output list's names are cluster names +#' in the ".barcode" argument (Seurat::Idents() case only). +#' +#' @seealso \link{select_clusters} +#' +#' @examples +#' \dontrun{ +#' data(immdata) +#' # Create a fake single-cell data +#' df <- immdata$data[[1]] +#' df$Barcode <- "AAAAACCCCC" +#' df$Barcode[51:nrow(df)] <- "GGGGGCCCCC" +#' barcodes <- "AAAAACCCCC" +#' df <- select_barcodes(df, barcodes) +#' nrow(df) +#' } +#' @export +select_barcodes <- function(.data, .barcodes, .force.list = FALSE) { + if (has_class(.data, "list")) { + stop('Please apply the function to a repertoire instead of a list of repertoires. + In order to update both the data and metadata, consider using the select_clusters() function.') + } + if (!("Barcode" %in% colnames(.data))) { + stop('No "Barcode" column in the input data. Did you apply the function to a single-cell repertoire?') + } + + if (is.null(names(.barcodes))) { + tmp <- rep("_", length(.barcodes)) + names(tmp) <- .barcodes + .barcodes <- tmp + } + + # + # TODO: make the efficient version without un-nesting the whole data table + # + + # TODO: is there a way to optimise this without unnecessary conversion? + convert_to_df <- FALSE + if (!has_class(.data, "data.table")) { + df <- as.data.table(.data) + convert_to_df <- TRUE + } + + # Un-nest the data table + df <- df[, c(strsplit(Barcode, IMMCOL_ADD$scsep, useBytes = TRUE, fixed=TRUE), .SD), by = .(Barcode)] + df[[IMMCOL$count]] <- 1 + df[, Barcode := NULL] + setnames(df, "", "Barcode") + + # Split by barcodes + df_list <- split(df, .barcodes[df$Barcode]) + + if (length(df_list)) { + # TODO: is there a way to optimise this without data copying? + # Group clonotypes + for (i in 1:length(df_list)) { + df <- df_list[[i]] + df <- df[, .(sum(Clones), paste0(Barcode, collapse = IMMCOL_ADD$scsep)), by = setdiff(names(df), c("Clones", "Proportion", "Barcode"))] + setnames(df, "V1", "Clones") + setnames(df, "V2", "Barcode") + + df$Proportion <- df$Clones / sum(df$Clones) + df <- df[order(-Clones)] + + if (convert_to_df) { + setDF(df) + } + + df_list[[i]] <- df + } + } + + if (.force.list) { + df_list + } else if (length(df_list) == 1) { + df_list[[1]] + } else { + df_list + } +} + + +#' Split the immune repertoire data to clusters from single-cell barcodes +#' +#' @concept single_cell +#' +#' @description Given the vector of barcodes from Seurat, split the input repertoires +#' to separate subsets following the barcodes' assigned IDs. Useful when you want to +#' split immune repertoires by patients or clusters. +#' +#' @param .data List of two elements "data" and "meta", with "data" being a list of +#' immune repertoires, and "meta" being a metadata table. +#' +#' @param .clusters Factor vector with barcodes as vector names and cluster IDs as vector elements. +#' The output of the Seurat \code{Idents} function works. +#' +#' @param .field A string specifying the name of the field in the input metadata. New immune +#' repertoire subsets will have cluster IDs in this field. +#' +#' @return A list with two elements "data" and "meta" with updated immune repertoire tables and +#' metadata. +#' +#' @seealso \link{select_barcodes} +#' +#' @examples +#' \dontrun{ +#' library(Seurat) +#' Idents(pbmc_small) +#' new_cluster_ids <- c("A", "B", "C") +#' new_cluster_ids <- levels(pbmc_small) +#' new_cluster_ids +#' pbmc_small <- RenameIdents(pbmc_small, new_cluster_ids) +#' } +#' @export +select_clusters <- function(.data, .clusters, .field = "Cluster") { + if (!has_class(.data, "list")) { + stop("Please provide a list with both 'data' and 'meta' elements.") + } else if (!("data" %in% names(.data) && "meta" %in% names(.data))) { + stop("Your list is missing one of the elements required. + Please provide a list with both 'data' and 'meta' elements.") + } else if (!all(sapply(.data$data, function (x) "Barcode" %in% colnames(x)))) { + stop('No "Barcode" column in the input data. Did you apply the function to a single-cell repertoire?') + } + + new_data_list <- list() + new_metadata <- list() + + for (df_i in 1:length(.data$data)) { + # Select barcodes + source_name <- names(.data$data)[df_i] + df_list <- select_barcodes(.data$data[[df_i]], .clusters, TRUE) + + # Update the data and metadata if everything is OK + if (length(df_list)) { + for (new_df_i in 1:length(df_list)) { + cluster_id <- names(df_list)[new_df_i] + new_name <- paste0(names(.data$data)[df_i], "_", cluster_id) + + new_data_list[[new_name]] <- df_list[[new_df_i]] + + new_metarow <- .data$meta[.data$meta$Sample == source_name, ] + new_metarow$Sample <- new_name + new_metarow[[paste0(.field, ".source")]] <- source_name + new_metarow[[.field]] <- cluster_id + + new_metadata <- c(new_metadata, list(new_metarow)) + } + } + } + + list(data = new_data_list, meta = do.call(rbind, new_metadata)) +} diff --git a/R/tools.R b/R/tools.R index 32e9fe41..5dcc36f9 100644 --- a/R/tools.R +++ b/R/tools.R @@ -81,7 +81,7 @@ check_distribution <- function(.data, .do.norm = NA, .laplace = 1, .na.val = 0, #' tmp <- immunarch:::add_class(tmp, "new_class") #' class(tmp) add_class <- function(.obj, .class) { - class(.obj) <- c(class(.obj), .class) + class(.obj) <- c(.class, class(.obj)) .obj } @@ -408,175 +408,6 @@ apply_asymm <- function(.datalist, .fun, ..., .diag = NA, .verbose = TRUE) { } -#' Get the N most abundant clonotypes -#' -#' @concept preprocessing -#' -#' @importFrom dplyr top_n collect -#' -#' @param .data The data to be processed. Can be \link{data.frame}, -#' \link{data.table}, or a list of these objects. -#' -#' Every object must have columns in the immunarch compatible format. -#' \link{immunarch_data_format} -#' -#' Competent users may provide advanced data representations: -#' DBI database connections, Apache Spark DataFrame from \link{copy_to} or a list -#' of these objects. They are supported with the same limitations as basic objects. -#' -#' Note: each connection must represent a separate repertoire. -#' @param .n Numeric. Number of the most abundant clonotypes to return. -#' -#' @return -#' Data frame with the \code{.n} most abundant clonotypes only. -#' -#' @examples -#' data(immdata) -#' top(immdata$data) -#' top(immdata$data[[1]]) -#' @export top -top <- function(.data, .n = 10) { - if (has_class(.data, "list")) { - lapply(.data, top, .n = .n) - } else { - # ToDo: fix Clones with IMMDATA$count - top_n(.data, .n, Clones) %>% collect(n = Inf) - } -} - - -#' Filter out coding and non-coding clonotype sequences -#' -#' @concept preprocessing -#' -#' @aliases coding noncoding inframes outofframes -#' -#' @description Filter out clonotypes with non-coding, coding, in-frame or out-of-frame CDR3 sequences: -#' -#' `coding()` - remove all non-coding sequences (i.e., remove all sequences with stop codons and frame shifts); -#' -#' `noncoding()` - remove all coding sequences (i.e., leave sequences with stop codons and frame shifts only); -#' -#' `inframes()` - remove all out-of-frame sequences (i.e., remove all sequences with frame shifts); -#' -#' `outofframes()` - remove all in-frame sequences (i.e., leave sequences with frame shifts only). -#' -#' Note: the function will remove all clonotypes sequences with NAs in the CDR3 amino acid column. -#' -#' @usage -#' coding(.data) -#' -#' noncoding(.data) -#' -#' inframes(.data) -#' -#' outofframes(.data) -#' -#' @param .data The data to be processed. Can be \link{data.frame}, -#' \link{data.table}, or a list of these objects. -#' -#' Every object must have columns in the immunarch compatible format. -#' \link{immunarch_data_format} -#' -#' Competent users may provide advanced data representations: -#' DBI database connections, Apache Spark DataFrame from \link{copy_to} or a list -#' of these objects. They are supported with the same limitations as basic objects. -#' -#' Note: each connection must represent a separate repertoire. -#' -#' @return -#' Filtered data frame. -#' -#' @examples -#' data(immdata) -#' immdata_cod <- coding(immdata$data) -#' immdata_cod1 <- coding(immdata$data[[1]]) -#' @export coding noncoding inframes outofframes -coding <- function(.data) { - if (has_class(.data, "list")) { - lapply(.data, coding) - } else { - # immdata$data[[1]] %>% mutate_(Len = "nchar(CDR3.nt)", Nonc = "CDR3.nt %like% '[*, ~]'") %>% filter((Len %% 3 == 0) & (!Nonc)) - - dt_flag <- FALSE - if (has_class(.data, "data.table")) { - dt_flag <- TRUE - .data <- .data %>% lazy_dt() - } - d <- collect(.data, n = Inf) - d <- filter(d, !is.na(d[[IMMCOL$cdr3aa]])) - d <- d[grep("[*, ~]", d[[IMMCOL$cdr3aa]], invert = TRUE), ] - if (dt_flag) { - data.table(d) - } else { - d - } - } -} - -noncoding <- function(.data) { - if (has_class(.data, "list")) { - lapply(.data, noncoding) - } else { - dt_flag <- FALSE - if (has_class(.data, "data.table")) { - dt_flag <- TRUE - .data <- .data %>% lazy_dt() - } - d <- collect(.data, n = Inf) - d <- filter(d, !is.na(d[[IMMCOL$cdr3aa]])) - d <- d[grep("[*, ~]", d[[IMMCOL$cdr3aa]], invert = FALSE), ] - if (dt_flag) { - data.table(d) - } else { - d - } - } -} - -inframes <- function(.data) { - if (has_class(.data, "list")) { - lapply(.data, inframes) - } else { - dt_flag <- FALSE - if (has_class(.data, "data.table")) { - dt_flag <- TRUE - .data <- .data %>% lazy_dt() - } - d <- collect(.data, n = Inf) - d <- filter(d, !is.na(d[[IMMCOL$cdr3aa]])) - # subset(.data, nchar(.data[[IMMCOL$cdr3nt]]) %% 3 == 0) - d <- d[grep("[~]", d[[IMMCOL$cdr3aa]], invert = TRUE), ] - if (dt_flag) { - data.table(d) - } else { - d - } - } -} - -outofframes <- function(.data) { - if (has_class(.data, "list")) { - lapply(.data, outofframes) - } else { - dt_flag <- FALSE - if (has_class(.data, "data.table")) { - dt_flag <- TRUE - .data <- .data %>% lazy_dt() - } - d <- collect(.data, n = Inf) - d <- filter(d, !is.na(d[[IMMCOL$cdr3aa]])) - # subset(.data, nchar(.data[[IMMCOL$cdr3nt]]) %% 3 != 0) - d <- d[grep("[~]", d[[IMMCOL$cdr3aa]], invert = FALSE), ] - if (dt_flag) { - data.table(d) - } else { - d - } - } -} - - #' Return a column's name #' #' @concept utility_private diff --git a/R/vis.R b/R/vis.R index 84ad1d5b..36cb2fd6 100644 --- a/R/vis.R +++ b/R/vis.R @@ -152,7 +152,7 @@ theme_cleveland2 <- function(rotate = TRUE) { #' #' Gene usage: #' -#' - Gene usage statistics (from \link{geneUsage}) using bar plots, box plots, treemaps - see \link{vis.immunr_gene_usage}; +#' - Gene usage statistics (from \link{geneUsage}) using bar plots, box plots - see \link{vis.immunr_gene_usage}; #' #' - Gene usage distances (from \link{geneUsageAnalysis}) using heatmaps, circos plots, polar area plots - see \link{vis.immunr_ov_matrix}; #' @@ -246,7 +246,6 @@ vis <- function(.data, ...) { #' vis(ov, "heatmap") #' vis(ov, "heatmap2") #' vis(ov, "circos") -#' #' @export vis.immunr_ov_matrix <- function(.data, .plot = c("heatmap", "heatmap2", "circos"), ...) { args <- list(...) @@ -334,7 +333,6 @@ vis.immunr_gu_matrix <- function(.data, .plot = c("heatmap", "heatmap2", "circos #' vis_heatmap(ov) #' gu <- geneUsage(immdata$data, "hs.trbj") #' vis_heatmap(gu) -#' #' @export vis_heatmap <- function(.data, .text = TRUE, .scientific = FALSE, .signif.digits = 2, .text.size = 4, .labs = c("Sample", "Sample"), .title = "Overlap", @@ -433,7 +431,6 @@ vis_heatmap <- function(.data, .text = TRUE, .scientific = FALSE, .signif.digits #' data(immdata) #' ov <- repOverlap(immdata$data) #' vis_heatmap2(ov) -#' #' @export vis_heatmap2 <- function(.data, .title = NA, .labs = NA, .color = colorRampPalette(c("#67001f", "#d6604d", "#f7f7f7", "#4393c3", "#053061"))(1024), ...) { args <- list() @@ -610,7 +607,7 @@ vis.immunr_inc_overlap <- function(.data, .target = 1, .grid = FALSE, .ncol = 2, p }) - return(do.call(grid.arrange, c(p_list, list(ncol = .ncol)))) + return(do.call(wrap_plots, c(p_list, list(ncol = .ncol)))) } else { if (data_is_bootstrapped) { sample_names <- colnames(.data[[1]][[1]]) @@ -694,7 +691,6 @@ vis.immunr_inc_overlap <- function(.data, .target = 1, .grid = FALSE, .ncol = 2, #' vis(pr, "freq", .type = "none") #' #' vis(pr, "clonotypes", 1, 2) -#' #' @export vis.immunr_public_repertoire <- function(.data, .plot = c("freq", "clonotypes"), ...) { .plot <- .plot[1] @@ -728,9 +724,8 @@ vis.immunr_public_repertoire <- function(.data, .plot = c("freq", "clonotypes"), #' @examples #' data(immdata) #' immdata$data <- lapply(immdata$data, head, 2000) -#' pr <- pubRep(immdata$data, .verbose=FALSE) +#' pr <- pubRep(immdata$data, .verbose = FALSE) #' pubRepStatistics(pr) %>% vis() -#' #' @export vis.immunr_public_statistics <- function(.data, ...) { upsetr_data <- as.list(.data$Count) @@ -838,6 +833,7 @@ vis_public_frequencies <- function(.data, .by = NA, .meta = NA, #' #' @importFrom grid rectGrob gpar #' @importFrom stats lm +#' @importFrom patchwork wrap_plots plot_annotation #' #' @description Visualise correlation of public clonotype frequencies in pairs of repertoires. #' @@ -867,8 +863,6 @@ vis_public_frequencies <- function(.data, .by = NA, .meta = NA, #' @param .radj.size An integer value, that defines the size of the The text #' for the R adjusted coefficient. #' -#' @param .return.grob If TRUE then returh the gridArrange grob instead of the plot. -#' #' @return #' A ggplot2 object. #' @@ -881,7 +875,7 @@ vis_public_frequencies <- function(.data, .by = NA, .meta = NA, vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, .title = NA, .ncol = 3, .point.size.modif = 1, .cut.axes = TRUE, - .density = TRUE, .lm = TRUE, .radj.size = 3.5, .return.grob = FALSE) { + .density = TRUE, .lm = TRUE, .radj.size = 3.5) { .shared.rep <- .data mat <- public_matrix(.shared.rep) @@ -892,12 +886,11 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, for (j in 1:ncol(mat)) { ps <- c(ps, list(vis_public_clonotypes(.shared.rep, i, j, "", .point.size.modif = .point.size.modif, - .cut.axes = .cut.axes, .density = .density, .lm = .lm, .radj.size = .radj.size, - .return.grob = TRUE + .cut.axes = .cut.axes, .density = .density, .lm = .lm, .radj.size = .radj.size ))) } } - grid.arrange(grobs = ps, ncol = .ncol, top = .title) + do.call(wrap_plots, c(ps, ncol = .ncol)) + plot_annotation(title = .title) } else if (is.na(.x.rep)) { ps <- lapply(1:ncol(mat), function(i) { vis_public_clonotypes(.shared.rep, i, .y.rep, "", @@ -905,7 +898,7 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, .cut.axes = .cut.axes, .density = .density, .lm = .lm ) }) - do.call(grid.arrange, c(ps, ncol = .ncol, top = .title)) + do.call(wrap_plots, c(ps, ncol = .ncol)) + plot_annotation(title = .title) } else if (is.na(.y.rep)) { ps <- lapply(1:ncol(mat), function(j) { vis_public_clonotypes(.shared.rep, .x.rep, j, "", @@ -913,7 +906,7 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, .cut.axes = .cut.axes, .density = .density, .lm = .lm ) }) - do.call(grid.arrange, c(ps, ncol = .ncol, top = .title)) + do.call(wrap_plots, c(ps, ncol = .ncol)) + plot_annotation(title = .title) } else { if (!is.character(.x.rep)) { .x.rep <- colnames(mat)[.x.rep] @@ -1028,11 +1021,7 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, ) + scale_fill_manual(values = colorRampPalette(c(.colourblind_vector()[1], grey_col))(2)) - if (!.return.grob) { - grid.arrange(top_plot, empty, points, right_plot, ncol = 2, nrow = 2, widths = c(4, 1), heights = c(1, 4)) - } else { - arrangeGrob(top_plot, empty, points, right_plot, ncol = 2, nrow = 2, widths = c(4, 1), heights = c(1, 4)) - } + wrap_plots(top_plot, empty, points, right_plot, ncol = 2, nrow = 2, widths = c(4, 1), heights = c(1, 4)) } else { points } @@ -1040,7 +1029,7 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, } -##### Gene usage & histogram plot, boxplot and treemap ##### +##### Gene usage & histogram plot, boxplot ##### #' Histograms and boxplots (general case / gene usage) @@ -1049,7 +1038,7 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, #' #' @name vis.immunr_gene_usage #' -#' @description Visualise distributions of genes using heatmaps, treemaps or other plots. +#' @description Visualise distributions of genes using heatmaps or other plots. #' #' @param .data Output from the \link{geneUsage} function. #' @@ -1057,8 +1046,6 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, #' #' - "hist" for histograms using \link{vis_hist}; #' -#' - "tree" for histograms using \link{vis_treemap}; -#' #' - "heatmap" for heatmaps using \link{vis_heatmap}; #' #' - "heatmap2" for heatmaps using \link{vis_heatmap2}; @@ -1071,8 +1058,6 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, #' #' - "box" - passes arguments to \link{vis_box}; #' -#' - "tree" - passes arguments to \link{vis_treemap} and \link{treemap} from the treemap package; -#' #' - "heatmap" - passes arguments to \link{vis_heatmap}; #' #' - "heatmap2" - passes arguments to \link{vis_heatmap2} and \link{heatmap} from the "pheatmap" package; @@ -1080,7 +1065,7 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, #' - "circos" - passes arguments to \link{vis_circos} and \link{chordDiagram} from the "circlize" package. #' #' @return -#' A ggplot2, pheatmap, circlize or treemap object. +#' A ggplot2 object, pheatmap or circlize object. #' #' @examples #' data(immdata) @@ -1094,7 +1079,7 @@ vis_public_clonotypes <- function(.data, .x.rep = NA, .y.rep = NA, #' @seealso \link{geneUsage} #' #' @export -vis.immunr_gene_usage <- function(.data, .plot = c("hist", "box", "tree", "heatmap", "heatmap2", "circos"), ...) { +vis.immunr_gene_usage <- function(.data, .plot = c("hist", "box", "heatmap", "heatmap2", "circos"), ...) { .plot <- .plot[1] if (.plot == "hist") { vis_hist(.data, ...) @@ -1104,8 +1089,6 @@ vis.immunr_gene_usage <- function(.data, .plot = c("hist", "box", "tree", "heatm .labs = c("Gene", "Usage"), .title = "Gene usage", .subtitle = NULL, ... ) - } else if (.plot == "tree") { - vis_treemap(.data, ...) } else if (.plot == "heatmap") { row.names(.data) <- .data[[1]] .data <- t(as.matrix(.data[2:ncol(.data)])) @@ -1119,7 +1102,7 @@ vis.immunr_gene_usage <- function(.data, .plot = c("hist", "box", "tree", "heatm .data <- t(as.matrix(.data[2:ncol(.data)])) vis_circos(.data, ...) } else { - stop("Error: Unknown value of the .plot parameter. Please provide one of the following: 'hist', 'box', 'tree'.") + stop("Error: Unknown value of the .plot parameter. Please provide one of the following: 'hist', 'box', 'heatmap', 'heatmap2', 'circos'.") } } @@ -1128,8 +1111,6 @@ vis.immunr_gene_usage <- function(.data, .plot = c("hist", "box", "tree", "heatm #' #' @concept vis #' -#' @importFrom gridExtra arrangeGrob grid.arrange -#' #' @name vis_hist #' #' @description Visualisation of distributions using ggplot2-based histograms. @@ -1164,8 +1145,6 @@ vis.immunr_gene_usage <- function(.data, .plot = c("hist", "box", "tree", "heatm #' #' @param .labs A character vector of length two with names for x-axis and y-axis, respectively. #' -#' @param .return.grob If TRUE then returh the gridArrange grob instead of the plot. -#' #' @param .melt If TRUE then apply \link{melt} to the ".data" before plotting. #' In this case ".data" is supposed to be a data frame with the first character column reserved #' for names of genes and other numeric columns reserved to counts or frequencies of genes. @@ -1208,7 +1187,7 @@ vis.immunr_gene_usage <- function(.data, .plot = c("hist", "box", "tree", "heatm #' @export vis_hist <- function(.data, .by = NA, .meta = NA, .title = "Gene usage", .ncol = NA, .points = TRUE, .test = TRUE, .coord.flip = FALSE, - .grid = FALSE, .labs = c("Gene", NA), .return.grob = FALSE, + .grid = FALSE, .labs = c("Gene", NA), .melt = TRUE, .legend = NA, .add.layer = NULL, ...) { res <- .data @@ -1286,27 +1265,24 @@ vis_hist <- function(.data, .by = NA, .meta = NA, .title = "Gene usage", .ncol = .add.layer } - if (.return.grob) { - return(do.call(gridExtra::arrangeGrob, c(ps, ncol = .ncol, top = .title))) - } else { - return(do.call(gridExtra::grid.arrange, c(ps, ncol = .ncol, top = .title))) - } + p <- do.call(wrap_plots, c(ps, ncol = .ncol)) + p <- p + plot_annotation(title = .title) + return(p) + } else { res <- split(res, res$Sample) ps <- list() for (i in 1:length(res)) { ps[[i]] <- vis_hist(res[[i]], .title = names(res)[i], .ncol = NA, .coord.flip = .coord.flip, - .grid = FALSE, .labs = c("Gene", NA), .return.grob = FALSE, + .grid = FALSE, .labs = c("Gene", NA), .melt = FALSE, .legend = .legend, .add.layer = .add.layer, ... ) } - if (.return.grob) { - return(do.call(gridExtra::arrangeGrob, c(ps, ncol = .ncol, top = .title))) - } else { - return(do.call(gridExtra::grid.arrange, c(ps, ncol = .ncol, top = .title))) - } + p <- do.call(wrap_plots, c(ps, ncol = .ncol)) + p <- p + plot_annotation(title = .title) + return(p) } } else { res$Value <- res$Freq @@ -1373,7 +1349,6 @@ vis_hist <- function(.data, .by = NA, .meta = NA, .title = "Gene usage", .ncol = #' #' @examples #' vis_box(data.frame(Sample = sample(c("A", "B", "C"), 100, TRUE), Value = rnorm(100)), .melt = FALSE) -#' #' @export vis_box <- function(.data, .by = NA, .meta = NA, .melt = TRUE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, .defgroupby = "Sample", .grouping.var = "Group", @@ -1500,33 +1475,6 @@ vis_box <- function(.data, .by = NA, .meta = NA, .melt = TRUE, rotate_x_text() + theme_cleveland2() } -#' Visualisation of data frames and matrices using treemaps -#' -#' @concept vis -#' -#' @name vis_treemap -#' -#' @importFrom treemap treemap -#' -#' @param .data Input data frame or matrix. -#' -#' @param ... Other arguments passed to \link{treemap}. -#' -#' @return -#' A treemap object. -#' -#' @examples -#' data(immdata) -#' gu <- geneUsage(immdata$data) -#' vis_treemap(gu) -#' -#' @export -vis_treemap <- function(.data, ...) { - melted <- reshape2::melt(.data) - colnames(melted) <- c("name", "group", "value") - treemap::treemap(melted, index = c("group", "name"), vSize = "value", ...) -} - ##### Clustering ##### @@ -1546,11 +1494,10 @@ vis_treemap <- function(.data, ...) { #' @param .plot A character vector of length one or two specifying which plots to visualise. #' If "clust" then plot only the clustering. If "best" then plot the number of optimal clusters. #' If both then plot both. -#' @param .return.grob If TRUE then return grob instead of plot. #' @param ... Not used here. #' #' @return -#' Ggplot2 objects inside the gridExtra container. +#' Ggplot2 objects inside the patchwork container. #' #' @seealso \link{vis}, \link{repOverlapAnalysis}, \link{geneUsageAnalysis} #' @@ -1558,9 +1505,8 @@ vis_treemap <- function(.data, ...) { #' data(immdata) #' ov <- repOverlap(immdata$data) #' repOverlapAnalysis(ov, "mds+hclust") %>% vis() -#' #' @export -vis.immunr_hclust <- function(.data, .rect = FALSE, .plot = c("clust", "best"), .return.grob = FALSE, ...) { +vis.immunr_hclust <- function(.data, .rect = FALSE, .plot = c("clust", "best"), ...) { p1 <- NULL if ("clust" %in% .plot) { p1 <- fviz_dend(.data[[1]], main = "Hierarchical clustering", rect = .rect) @@ -1578,11 +1524,8 @@ vis.immunr_hclust <- function(.data, .rect = FALSE, .plot = c("clust", "best"), p1 } } else { - if (.return.grob) { - gridExtra::arrangeGrob(p1, p2, ncol = 2) - } else { - gridExtra::grid.arrange(p1, p2, ncol = 2) - } + p <- p1 + p2 + p } } @@ -1606,11 +1549,10 @@ vis.immunr_hclust <- function(.data, .rect = FALSE, .plot = c("clust", "best"), #' @param .plot A character vector of length one or two specifying which plots to visualise. #' If "clust" then plot only the clustering. If "best" then plot the number of optimal clusters. #' If both then plot both. -#' @param .return.grob If TRUE then return grob instead of plot. #' @param ... Not used here. #' #' @return -#' Ggplot2 objects inside the gridExtra container. +#' Ggplot2 objects inside the pathwork container. #' #' @seealso \link{vis}, \link{repOverlapAnalysis}, \link{geneUsageAnalysis} #' @@ -1618,11 +1560,10 @@ vis.immunr_hclust <- function(.data, .rect = FALSE, .plot = c("clust", "best"), #' data(immdata) #' ov <- repOverlap(immdata$data) #' repOverlapAnalysis(ov, "mds+kmeans") %>% vis() -#' #' @export vis.immunr_kmeans <- function(.data, .point = TRUE, .text = TRUE, .ellipse = TRUE, .point.size = 2, .text.size = 10, .plot = c("clust", "best"), - .return.grob = FALSE, ...) { + ...) { p1 <- NULL if ("clust" %in% .plot) { p1 <- fviz_cluster(.data[[1]], @@ -1647,11 +1588,7 @@ vis.immunr_kmeans <- function(.data, .point = TRUE, .text = TRUE, .ellipse = TRU } # p = ifelse(is.null(p1), p2, p1) # doesnt work for some reason } else { - if (.return.grob) { - gridExtra::arrangeGrob(p1, p2, ncol = 2) - } else { - gridExtra::grid.arrange(p1, p2, ncol = 2) - } + p1 + p2 } } @@ -1716,7 +1653,6 @@ vis.immunr_dbscan <- function(.data, .point = TRUE, .text = TRUE, .ellipse = TRU #' data(immdata) #' ov <- repOverlap(immdata$data) #' repOverlapAnalysis(ov, "mds") %>% vis() -#' #' @export vis.immunr_mds <- function(.data, .by = NA, .meta = NA, .point = TRUE, .text = TRUE, .ellipse = TRUE, @@ -1925,7 +1861,6 @@ vis_bar_stacked <- function(.data, .by = NA, .meta = NA, #' data(immdata) #' clp <- repClonality(immdata$data, "clonal.prop") #' vis(clp) -#' #' @export vis.immunr_clonal_prop <- function(.data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, ...) { # ToDo: this and other repClonality and repDiversity functions doesn't work on a single repertoire. Fix it @@ -2101,7 +2036,6 @@ vis.immunr_rare_prop <- function(.data, .by = NA, .meta = NA, .errorbars = c(0.0 #' #' @examples #' vis_bar(data.frame(Sample = c("A", "B", "C"), Value = c(1, 2, 3))) -#' #' @export vis_bar <- function(.data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .stack = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, .errorbar.width = 0.2, .defgroupby = "Sample", .grouping.var = "Group", @@ -2164,13 +2098,13 @@ vis_bar <- function(.data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), . if (group_res$is_grouped) { if (.stack) { p <- ggplot() + - geom_bar(aes(x = Group, y = Value.mean, fill = Grouping.var), - stat = "identity", position = "stack", data = .data_proc, col = "black" + geom_col(aes(x = Group, y = Value.mean, fill = Grouping.var), + position = "stack", data = .data_proc, col = "black" ) } else { - p <- ggplot(aes(x = Grouping.var, y = Value, fill = Group, color = Group, group = Group), data = .data) + - geom_bar(aes(x = Grouping.var, y = Value.mean), - stat = "identity", position = position_name, data = .data_proc, col = "black" + p <- ggplot(aes(x = Grouping.var, y = Value, color = Group, fill = Group, group = Group), data = .data) + + geom_col(aes(x = Grouping.var, y = Value.mean), + position = position_name, data = .data_proc, col = "black" ) } @@ -2187,7 +2121,7 @@ vis_bar <- function(.data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), . if (!.errorbars.off) { p <- p + geom_errorbar(aes(x = Grouping.var, y = Value.mean, ymin = Value.min, ymax = Value.max, color = Group), - color = "black", data = .data_proc, width = .errorbar.width, position = position_dodge(.9) + color = "black", data = .data_proc, width = .errorbar.width, position = position_dodge(.9) ) } @@ -2322,7 +2256,6 @@ vis_bar <- function(.data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), . #' data(immdata) #' dv <- repDiversity(immdata$data, "chao1") #' vis(dv) -#' #' @export vis.immunr_chao1 <- function(.data, .by = NA, .meta = NA, .errorbars = c(0.025, 0.975), .errorbars.off = FALSE, .points = TRUE, .test = TRUE, .signif.label.size = 3.5, ...) { .data <- data.frame(Sample = row.names(.data), Value = .data[, 1]) @@ -2698,14 +2631,13 @@ vis.immunr_exp_clones <- function(.data, .by = NA, .meta = NA, #' #' @examples #' # Load necessary data and package. -#' library(gridExtra) #' data(immdata) #' # Get 5-mers. #' imm.km <- getKmers(immdata$data[[1]], 5) #' # Plots for kmer proportions in each data frame in immdata. #' p1 <- vis(imm.km, .position = "stack") #' p2 <- vis(imm.km, .position = "fill") -#' grid.arrange(p1, p2) +#' p1 + p2 #' @export vis.immunr_kmer_table <- function(.data, .head = 100, .position = c("stack", "dodge", "fill"), .log = FALSE, ...) { .position <- switch(substr(.position[1], 1, 1), s = "stack", d = "dodge", f = "fill") @@ -2975,6 +2907,7 @@ vis.immunr_dynamics <- function(.data, .plot = c("smooth", "area", "line"), .ord melted$Sample <- factor(melted$Sample, levels = ordered_sample_names, ordered = TRUE) } + melted$Count <- melted$Count + min(melted$Count[melted$Count != 0]) / 1000 p <- ggplot(melted, aes( x = Sample, fill = Clonotype, stratum = Clonotype, alluvium = Clonotype, y = Count, label = Clonotype diff --git a/README.md b/README.md index 4a3ca8fd..09d5ee20 100644 --- a/README.md +++ b/README.md @@ -1,30 +1,40 @@ [![Follow](https://img.shields.io/twitter/follow/immunomind.svg?style=social)](https://twitter.com/intent/follow?screen_name=immunomind) -[![CRAN](http://www.r-pkg.org/badges/version/immunarch?style=flat-square)](https://cran.r-project.org/package=immunarch) +[![CRAN](http://www.r-pkg.org/badges/version-ago/immunarch?style=flat-square)](https://cran.r-project.org/package=immunarch) [![Downloads_all](http://cranlogs.r-pkg.org/badges/grand-total/immunarch)](http://www.r-pkg.org/pkg/immunarch) [![Downloads_week](http://cranlogs.r-pkg.org/badges/last-week/immunarch)](http://www.r-pkg.org/pkg/immunarch) [![Issues](https://img.shields.io/github/issues/immunomind/immunarch?style=flat-square)](http://github.com/immunomind/immunarch/issues) [![CI](https://gitlab.com/immunomind/immunarch/badges/master/pipeline.svg?style=flat-square)](https://gitlab.com/immunomind/immunarch/-/jobs) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3367200.svg)](https://doi.org/10.5281/zenodo.3367200) +![Visitors](https://visitor-badge.glitch.me/badge?page_id=immunomind.immunarch) +[![Downloads_all](http://cranlogs.r-pkg.org/badges/grand-total/tcR)](http://www.r-pkg.org/pkg/tcR) +[![Downloads_week](http://cranlogs.r-pkg.org/badges/last-week/tcR)](http://www.r-pkg.org/pkg/tcR) -# `immunarch` --- An R Package for Painless Bioinformatics Analysis of T-cell and B-cell Immune Repertoire Data - +# `immunarch` --- Fast and Seamless Exploration of Single-cell and Bulk T-cell/Antibody Immune Repertoires in R ## Why `immunarch`? - **Work with any type of data:** single-cell, bulk, data tables, databases --- you name it. -- **Community at the heart:** thrive in the community of almost 30,000 researchers and medical scientists worldwide, including researchers from **Pfizer, Novartis, Regeneron, Stanford, UCSF** and **MIT**. +- **Community at the heart:** ask questions, share knowledge and thrive in the community of almost 30,000 researchers and medical scientists worldwide. **Pfizer, Novartis, Regeneron, Stanford, UCSF** and **MIT** trust us. - **One plot --- one line:** write a [whole PhD thesis in 8 lines of code](https://twitter.com/Nusob88/status/1127601201112129536) or reproduce almost any publication in 5-10 lines of `immunarch` code. - **Be on the bleeding edge of science:** we regularly update `immunarch` with the latest methods. [Let us know what you need!](#help-the-community) - **Automatic format detection and parsing** for all popular immunosequencing formats: from **MiXCR** and **ImmunoSEQ** to **10XGenomics** and **ArcherDX**. -And the **most important**: `immunarch` is not just a tool --- it is an **ecosystem**. + +### Lightning-fast Start +```r +install.packages("immunarch") # Install the package +library(immunarch); data(immdata) # Load the package and the test dataset +repOverlap(immdata$data) %>% vis() # Compute and visualise the most important statistics: +geneUsage(immdata$data[[1]]) %>% vis() # public clonotypes, gene usage, sample diversity +repDiversity(immdata$data) %>% vis(.by = "Status", .meta = immdata$meta) # Group samples +``` -## From Berkeley with devotion +### From Berkeley with devotion -`immunarch` is brought to you by [ImmunoMind](https://immunomind.io) --- a [UC Berkeley SkyDeck](https://www.forbes.com/sites/avivalegatt/2019/01/07/launch-your-startup-at-these-five-college-incubators/) startup. +`immunarch` is brought to you by [ImmunoMind](https://immunomind.io) --- a [UC Berkeley SkyDeck](https://www.forbes.com/sites/avivalegatt/2019/01/07/launch-your-startup-at-these-five-college-incubators/) startup. ImmunoMind Data Science tools for single-cell and immunomics [exploration](https://immunarch.com) and [biomarker discovery](https://immunomind.io) are trusted by researchers from top pharma companies and universities, including 10X Genomics, Pfizer, Regeneron, UCSF, MIT, Stanford, John Hopkins School of Medicine and Vanderbilt University. -We have been helping researchers to extract insights from sequencing data of T-cell and antibody repertoires since the inception of the RepSeq domain. Our bioinformatics tools are trusted by top universities including Stanford, UCSF, MIT, King's College London and big pharma companies including Pfizer and Novartis. +[![Follow](https://img.shields.io/twitter/follow/immunomind.svg?style=social)](https://twitter.com/intent/follow?screen_name=immunomind) ## Stay connected! @@ -44,8 +54,6 @@ We have been helping researchers to extract insights from sequencing data of T-c -Feel free to follow us on [Twitter](https://twitter.com/immunomind) as well. - --- ## Table of Contents @@ -61,7 +69,7 @@ Feel free to follow us on [Twitter](https://twitter.com/immunomind) as well. ## Introduction -`immunarch` is an R package designed to analyse T-cell receptor (TCR) and B-cell receptor (BCR) repertoires, aimed at medical scientists and bioinformaticians. The mission of `immunarch` is to make immune sequencing data analysis as effortless as possible and help you focus on research instead of coding. Follow us on [Twitter](https://twitter.com/immunomind) for news and updates. +`immunarch` is an R package designed to analyse T-cell receptor (TCR) and B-cell receptor (BCR) repertoires, aimed at medical scientists and bioinformaticians. The mission of `immunarch` is to make immune sequencing data analysis as effortless as possible and help you focus on research instead of coding. ## Contact @@ -96,7 +104,7 @@ Since releasing on CRAN is limited to one release per one-two months, you can in ```r install.packages("devtools") # skip this if you already installed devtools -devtools::install_github("immunomind/immunarch", ref="develop") +devtools::install_github("immunomind/immunarch", ref="dev") ``` You can find the list of releases of `immunarch` here: https://github.com/immunomind/immunarch/releases diff --git a/_pkgdown.yml b/_pkgdown.yml index a0b84894..116be35a 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -18,7 +18,8 @@ articles: contents: - '`v1_introduction`' - '`v2_data`' - - '`v3_basic_analysis`' + - '`web_only/v21_singlecell`' + - '`web_only/v3_basic_analysis`' - '`web_only/v4_overlap`' - '`web_only/v5_gene_usage`' - '`web_only/v6_diversity`' @@ -29,16 +30,14 @@ articles: navbar: structure: - left: - - intro - - articles - - reference - - news - right: - - im_link - - twitter - - github + left: [covid19, intro, articles, reference] + right: [im_link, twitter, github] components: + home: ~ + news: ~ + covid19: + text: "COVID-19" + href: https://github.com/immunomind/covid19 intro: text: "Get started" href: articles/v1_introduction.html @@ -54,7 +53,9 @@ navbar: - text: Working with data in immunarch href: articles/v2_data.html - text: Basic analysis and clonality - href: articles/v3_basic_analysis.html + href: articles/web_only/v3_basic_analysis.html + - text: Single-cell and paired chain data + href: articles/web_only/v21_singlecell.html - text: '---' @@ -81,9 +82,6 @@ navbar: href: articles/web_only/v8_tracking.html - text: Kmer and sequence motif analysis and visualisation href: articles/web_only/v9_kmers.html - news: - text: Changelog - href: https://github.com/immunomind/immunarch/releases twitter: icon: fa-lg fa-twitter href: http://twitter.com/immunomind diff --git a/data/immdata.rda b/data/immdata.rda index c127c5f4..67311927 100644 Binary files a/data/immdata.rda and b/data/immdata.rda differ diff --git a/data/scdata.rda b/data/scdata.rda new file mode 100644 index 00000000..76ab27d8 Binary files /dev/null and b/data/scdata.rda differ diff --git a/inst/extdata/sc/flu.csv.gz b/inst/extdata/sc/flu.csv.gz new file mode 100644 index 00000000..0f32f6db Binary files /dev/null and b/inst/extdata/sc/flu.csv.gz differ diff --git a/man/coding.Rd b/man/coding.Rd index 9ec9c628..0283251d 100644 --- a/man/coding.Rd +++ b/man/coding.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/tools.R +% Please edit documentation in R/preprocessing.R \name{coding} \alias{coding} \alias{noncoding} diff --git a/man/dbAnnotate.Rd b/man/dbAnnotate.Rd index a0cad42f..4e4e2979 100644 --- a/man/dbAnnotate.Rd +++ b/man/dbAnnotate.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/search.R +% Please edit documentation in R/annotation.R \name{dbAnnotate} \alias{dbAnnotate} \title{Annotate clonotypes in immune repertoires using clonotype databases such as VDJDB and MCPAS} diff --git a/man/dbLoad.Rd b/man/dbLoad.Rd index cde61bc7..ba729075 100644 --- a/man/dbLoad.Rd +++ b/man/dbLoad.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/search.R +% Please edit documentation in R/annotation.R \name{dbLoad} \alias{dbLoad} \title{Load clonotype databases such as VDJDB and McPAS into the R workspace} diff --git a/man/filter_barcodes.Rd b/man/filter_barcodes.Rd deleted file mode 100644 index f5e93d7a..00000000 --- a/man/filter_barcodes.Rd +++ /dev/null @@ -1,45 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/processing.R -\name{filter_barcodes} -\alias{filter_barcodes} -\title{Filter clonotypes using barcodes from single-cell metadata} -\usage{ -filter_barcodes(.data, .barcodes) -} -\arguments{ -\item{.data}{The data to be processed. Can be \link{data.frame}, -\link{data.table}, or a list of these objects. - -Every object must have columns in the immunarch compatible format. -\link{immunarch_data_format} - -Competent users may provide advanced data representations: -DBI database connections, Apache Spark DataFrame from \link{copy_to} or a list -of these objects. They are supported with the same limitations as basic objects. - -Note: each connection must represent a separate repertoire.} - -\item{.barcodes}{Character vector. Vector of barcodes to use for -filtering.} -} -\value{ -An immune repertoire or a list of immune repertoires with clonotypes which have -one or more barcodes in the input barcode vector and with clonotype abundances corrected. -} -\description{ -Given the input data and a vector of barcodes, -the function returns the input with cells which has -at least one of the given barcode. Columns with clonotype counts -and proportions are changed accordingly to the filtered barcodes. -} -\examples{ -data(immdata) -# Create a fake single-cell data -df <- immdata$data[[1]] -df$Barcode <- "AAAAACCCCC" -df$Barcode[51:nrow(df)] <- "GGGGGCCCCC" -barcodes <- "AAAAACCCCC" -df <- filter_barcodes(df, barcodes) -nrow(df) -} -\concept{single_cell} diff --git a/man/immdata.Rd b/man/immdata.Rd index 2ff51131..b9e2d865 100644 --- a/man/immdata.Rd +++ b/man/immdata.Rd @@ -1,12 +1,12 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/data.R +% Please edit documentation in R/data_docs.R \docType{data} \name{immdata} \alias{immdata} -\title{Immune repertoire dataset} +\title{Single chain immune repertoire dataset} \format{ -A list of two elements. First element "data" is a list with data frames with clonotype tables. -Second element "meta" is a metadata file. +A list of two elements. First element ("data") is a list with data frames with clonotype tables. +Second element ("meta") is a metadata table. \describe{ \item{data}{List of immune repertoire data frames.} \item{meta}{Metadata} @@ -17,7 +17,7 @@ Second element "meta" is a metadata file. immdata } \description{ -A testing dataset containing clonotypes. +A dataset with single chain TCR data for testing and examplatory purposes. } \concept{data} \keyword{datasets} diff --git a/man/pubRep.Rd b/man/pubRep.Rd index 71b0e1f3..6cc6fe75 100644 --- a/man/pubRep.Rd +++ b/man/pubRep.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/shared.R +% Please edit documentation in R/public.R \name{pubRep} \alias{pubRep} \alias{publicRepertoire} diff --git a/man/pubRepApply.Rd b/man/pubRepApply.Rd index 93f4a546..b9bc2763 100644 --- a/man/pubRepApply.Rd +++ b/man/pubRepApply.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/shared.R +% Please edit documentation in R/public.R \name{pubRepApply} \alias{pubRepApply} \alias{publicRepertoireApply} diff --git a/man/pubRepFilter.Rd b/man/pubRepFilter.Rd index 94a6751a..8308704d 100644 --- a/man/pubRepFilter.Rd +++ b/man/pubRepFilter.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/shared.R +% Please edit documentation in R/public.R \name{pubRepFilter} \alias{pubRepFilter} \alias{publicRepertoireFilter} diff --git a/man/pubRepStatistics.Rd b/man/pubRepStatistics.Rd index 224242fd..b9f1da1a 100644 --- a/man/pubRepStatistics.Rd +++ b/man/pubRepStatistics.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/shared.R +% Please edit documentation in R/public.R \name{pubRepStatistics} \alias{pubRepStatistics} \title{Statistics of number of public clonotypes for each possible combinations of repertoires} diff --git a/man/public_matrix.Rd b/man/public_matrix.Rd index af0bc85f..01c3d31c 100644 --- a/man/public_matrix.Rd +++ b/man/public_matrix.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/shared.R +% Please edit documentation in R/public.R \name{public_matrix} \alias{public_matrix} \title{Get a matrix with public clonotype frequencies} diff --git a/man/repLoad.Rd b/man/repLoad.Rd index b75ded9f..1bbc3ce3 100644 --- a/man/repLoad.Rd +++ b/man/repLoad.Rd @@ -4,7 +4,7 @@ \alias{repLoad} \title{Load immune repertoire files into the R workspace} \usage{ -repLoad(.path, .format = NA) +repLoad(.path, .format = NA, .mode = "paired", .coding = TRUE) } \arguments{ \item{.path}{A character string specifying the path to the input data. @@ -26,9 +26,17 @@ If input data has multiple chains or cell types stored in the same file R data frames with only one type of chain and cell presented. The metadata file will have additional columns specifying cell and chain types for different samples.} -\item{.format}{A character string specifying what format to use. See "Details" for more information on supported formats. +\item{.format}{A character string specifying what format to use. Do NOT use it. See "Details" for more information on supported formats. Leave NA (which is default) if you want `immunarch` to detect formats automatically.} + +\item{.mode}{Either "single" for single chain data or "paired" for paired chain data. + +Currently "single" works for every format, and "paired" works only for 10X Genomics data. + +By default, 10X Genomics data will be loaded as paired chain data, and other files will be loaded as single chain data.} + +\item{.coding}{A logical value. Pass TRUE to get coding-only clonotypes (by defaul). Pass FALSE to get all clonotypes.} } \value{ A list with two named elements: @@ -59,9 +67,10 @@ your folder. Example: } \code{repLoad} has the ".format" argument that sets the format for input repertoire files. -Do not pass it if you want immunarch to detect the formats and parse files automatically! -In case you want to force the package to parse the data in a specific format, -you can choose one of the several options for the argument: +Immunarch detects the file format automatically, and the argument is left only for the compatability +purposes. It will be soon removed. Do not pass it or your code will stop working! + +Currently, Immunarch support the following formats: - "immunoseq" - ImmunoSEQ of any version. http://www.adaptivebiotech.com/immunoseq diff --git a/man/repSample.Rd b/man/repSample.Rd index f9f228d3..75a98e29 100644 --- a/man/repSample.Rd +++ b/man/repSample.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/processing.R +% Please edit documentation in R/sampling.R \name{repSample} \alias{repSample} \title{Downsampling and resampling of immune repertoires} diff --git a/man/repSave.Rd b/man/repSave.Rd index 2a47c6cd..d5d9dadc 100644 --- a/man/repSave.Rd +++ b/man/repSave.Rd @@ -37,9 +37,8 @@ dirpath <- tempdir() repSave(immdata, dirpath) # Load it and check if it is the same new_immdata <- repLoad(dirpath) -sum(immdata$data[[1]] != new_immdata$data[[1]], na.rm = TRUE) -sum(immdata$data[[2]] != new_immdata$data[[2]], na.rm = TRUE) -sum(immdata$meta != new_immdata$meta, na.rm = TRUE) - +# sum(immdata$data[[1]] != new_immdata$data[[1]], na.rm = TRUE) +# sum(immdata$data[[2]] != new_immdata$data[[2]], na.rm = TRUE) +# sum(immdata$meta != new_immdata$meta, na.rm = TRUE) } \concept{io} diff --git a/man/scdata.Rd b/man/scdata.Rd new file mode 100644 index 00000000..27b95712 --- /dev/null +++ b/man/scdata.Rd @@ -0,0 +1,26 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/data_docs.R +\docType{data} +\name{scdata} +\alias{scdata} +\title{Paired chain immune repertoire dataset} +\format{ +A list of four elements. +"data" is a list with data frames with clonotype tables. +"meta" is a metadata table. +"bc_patients" is a list of barcodes corresponding to specific patients. +"bc_clusters" is a list of barcodes corresponding to specific cell clusters. +\describe{ + \item{data}{List of immune repertoire data frames.} + \item{meta}{Metadata} + ... +} +} +\usage{ +scdata +} +\description{ +A dataset with paired chain IG data for testing and examplatory purposes. +} +\concept{data} +\keyword{datasets} diff --git a/man/select_barcodes.Rd b/man/select_barcodes.Rd new file mode 100644 index 00000000..95f884dd --- /dev/null +++ b/man/select_barcodes.Rd @@ -0,0 +1,55 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/singlecell.R +\name{select_barcodes} +\alias{select_barcodes} +\title{Select specific clonotypes using barcodes from single-cell metadata} +\usage{ +select_barcodes(.data, .barcodes, .force.list = FALSE) +} +\arguments{ +\item{.data}{The data to be processed. Can be \link{data.frame}, +\link{data.table}, or a list of these objects. + +Every object must have columns in the immunarch compatible format. +\link{immunarch_data_format} + +Competent users may provide advanced data representations: +DBI database connections, Apache Spark DataFrame from \link{copy_to} or a list +of these objects. They are supported with the same limitations as basic objects. + +Note: each connection must represent a separate repertoire.} + +\item{.barcodes}{Either a character vector with barcodes or a named character/factor vector with +barcodes as names and cluster IDs a vector elements. The output of Seurat's \code{Idents} function works.} + +\item{.force.list}{Logical. If TRUE then always return a list, even if the result is one data frame.} +} +\value{ +An immune repertoire (if ".barcodes" is a barcode vector) or a list of immune repertoires +(if ".barcodes" is named vector or an output from Seurat::Idents()). Each element is an immune repertoire +with clonotype barcodes corresponding to the input barcodes. The output list's names are cluster names +in the ".barcode" argument (Seurat::Idents() case only). +} +\description{ +Subset the input immune repertoire by barcodes. Pass a vector of +barcodes to subset or a vector cluster IDs and corresponding barcodes to +get a list of immune repertoires corresponding to cluster IDs. +Columns with clonotype counts +and proportions are changed accordingly to the filtered barcodes. +} +\examples{ +\dontrun{ +data(immdata) +# Create a fake single-cell data +df <- immdata$data[[1]] +df$Barcode <- "AAAAACCCCC" +df$Barcode[51:nrow(df)] <- "GGGGGCCCCC" +barcodes <- "AAAAACCCCC" +df <- select_barcodes(df, barcodes) +nrow(df) +} +} +\seealso{ +\link{select_clusters} +} +\concept{single_cell} diff --git a/man/select_clusters.Rd b/man/select_clusters.Rd new file mode 100644 index 00000000..27feb878 --- /dev/null +++ b/man/select_clusters.Rd @@ -0,0 +1,41 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/singlecell.R +\name{select_clusters} +\alias{select_clusters} +\title{Split the immune repertoire data to clusters from single-cell barcodes} +\usage{ +select_clusters(.data, .clusters, .field = "Cluster") +} +\arguments{ +\item{.data}{List of two elements "data" and "meta", with "data" being a list of +immune repertoires, and "meta" being a metadata table.} + +\item{.clusters}{Factor vector with barcodes as vector names and cluster IDs as vector elements. +The output of the Seurat \code{Idents} function works.} + +\item{.field}{A string specifying the name of the field in the input metadata. New immune +repertoire subsets will have cluster IDs in this field.} +} +\value{ +A list with two elements "data" and "meta" with updated immune repertoire tables and +metadata. +} +\description{ +Given the vector of barcodes from Seurat, split the input repertoires +to separate subsets following the barcodes' assigned IDs. Useful when you want to +split immune repertoires by patients or clusters. +} +\examples{ +\dontrun{ +library(Seurat) +Idents(pbmc_small) +new_cluster_ids <- c("A", "B", "C") +new_cluster_ids <- levels(pbmc_small) +new_cluster_ids +pbmc_small <- RenameIdents(pbmc_small, new_cluster_ids) +} +} +\seealso{ +\link{select_barcodes} +} +\concept{single_cell} diff --git a/man/top.Rd b/man/top.Rd index 89d88d2e..ec7f63d0 100644 --- a/man/top.Rd +++ b/man/top.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/tools.R +% Please edit documentation in R/preprocessing.R \name{top} \alias{top} \title{Get the N most abundant clonotypes} diff --git a/man/vis.Rd b/man/vis.Rd index 7ce0d4ba..134a95ab 100644 --- a/man/vis.Rd +++ b/man/vis.Rd @@ -43,7 +43,7 @@ Overlaps and public clonotypes: Gene usage: -- Gene usage statistics (from \link{geneUsage}) using bar plots, box plots, treemaps - see \link{vis.immunr_gene_usage}; +- Gene usage statistics (from \link{geneUsage}) using bar plots, box plots - see \link{vis.immunr_gene_usage}; - Gene usage distances (from \link{geneUsageAnalysis}) using heatmaps, circos plots, polar area plots - see \link{vis.immunr_ov_matrix}; diff --git a/man/vis.immunr_chao1.Rd b/man/vis.immunr_chao1.Rd index fd5c5eca..acb0c94a 100644 --- a/man/vis.immunr_chao1.Rd +++ b/man/vis.immunr_chao1.Rd @@ -72,7 +72,6 @@ You can execute the command \code{?p.adjust} in the R console to see more. data(immdata) dv <- repDiversity(immdata$data, "chao1") vis(dv) - } \seealso{ \link{repDiversity} \link{vis} diff --git a/man/vis.immunr_clonal_prop.Rd b/man/vis.immunr_clonal_prop.Rd index a44badea..e098e896 100644 --- a/man/vis.immunr_clonal_prop.Rd +++ b/man/vis.immunr_clonal_prop.Rd @@ -69,7 +69,6 @@ You can execute the command \code{?p.adjust} in the R console to see more. data(immdata) clp <- repClonality(immdata$data, "clonal.prop") vis(clp) - } \seealso{ \link{repClonality} \link{vis} diff --git a/man/vis.immunr_gene_usage.Rd b/man/vis.immunr_gene_usage.Rd index f7d92577..939fc26c 100644 --- a/man/vis.immunr_gene_usage.Rd +++ b/man/vis.immunr_gene_usage.Rd @@ -4,11 +4,7 @@ \alias{vis.immunr_gene_usage} \title{Histograms and boxplots (general case / gene usage)} \usage{ -\method{vis}{immunr_gene_usage}( - .data, - .plot = c("hist", "box", "tree", "heatmap", "heatmap2", "circos"), - ... -) +\method{vis}{immunr_gene_usage}(.data, .plot = c("hist", "box", "heatmap", "heatmap2", "circos"), ...) } \arguments{ \item{.data}{Output from the \link{geneUsage} function.} @@ -17,8 +13,6 @@ - "hist" for histograms using \link{vis_hist}; -- "tree" for histograms using \link{vis_treemap}; - - "heatmap" for heatmaps using \link{vis_heatmap}; - "heatmap2" for heatmaps using \link{vis_heatmap2}; @@ -31,8 +25,6 @@ - "box" - passes arguments to \link{vis_box}; -- "tree" - passes arguments to \link{vis_treemap} and \link{treemap} from the treemap package; - - "heatmap" - passes arguments to \link{vis_heatmap}; - "heatmap2" - passes arguments to \link{vis_heatmap2} and \link{heatmap} from the "pheatmap" package; @@ -40,10 +32,10 @@ - "circos" - passes arguments to \link{vis_circos} and \link{chordDiagram} from the "circlize" package.} } \value{ -A ggplot2, pheatmap, circlize or treemap object. +A ggplot2 object, pheatmap or circlize object. } \description{ -Visualise distributions of genes using heatmaps, treemaps or other plots. +Visualise distributions of genes using heatmaps or other plots. } \examples{ data(immdata) diff --git a/man/vis.immunr_hclust.Rd b/man/vis.immunr_hclust.Rd index ddac4397..93bb2a6f 100644 --- a/man/vis.immunr_hclust.Rd +++ b/man/vis.immunr_hclust.Rd @@ -4,13 +4,7 @@ \alias{vis.immunr_hclust} \title{Visualisation of hierarchical clustering} \usage{ -\method{vis}{immunr_hclust}( - .data, - .rect = FALSE, - .plot = c("clust", "best"), - .return.grob = FALSE, - ... -) +\method{vis}{immunr_hclust}(.data, .rect = FALSE, .plot = c("clust", "best"), ...) } \arguments{ \item{.data}{Clustering results from \link{repOverlapAnalysis} or \link{geneUsageAnalysis}.} @@ -21,12 +15,10 @@ If "clust" then plot only the clustering. If "best" then plot the number of optimal clusters. If both then plot both.} -\item{.return.grob}{If TRUE then return grob instead of plot.} - \item{...}{Not used here.} } \value{ -Ggplot2 objects inside the gridExtra container. +Ggplot2 objects inside the patchwork container. } \description{ Visualisation of the results of hierarchical clustering. @@ -36,7 +28,6 @@ For other clustering visualisations see \link{vis.immunr_kmeans}. data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds+hclust") \%>\% vis() - } \seealso{ \link{vis}, \link{repOverlapAnalysis}, \link{geneUsageAnalysis} diff --git a/man/vis.immunr_kmeans.Rd b/man/vis.immunr_kmeans.Rd index 1911acb8..736c133e 100644 --- a/man/vis.immunr_kmeans.Rd +++ b/man/vis.immunr_kmeans.Rd @@ -13,7 +13,6 @@ .point.size = 2, .text.size = 10, .plot = c("clust", "best"), - .return.grob = FALSE, ... ) } @@ -34,12 +33,10 @@ If "clust" then plot only the clustering. If "best" then plot the number of optimal clusters. If both then plot both.} -\item{.return.grob}{If TRUE then return grob instead of plot.} - \item{...}{Not used here.} } \value{ -Ggplot2 objects inside the gridExtra container. +Ggplot2 objects inside the pathwork container. } \description{ Visualisation of the results of K-means and DBSCAN clustering. @@ -49,7 +46,6 @@ For hierarhical clustering visualisations see \link{vis.immunr_hclust}. data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds+kmeans") \%>\% vis() - } \seealso{ \link{vis}, \link{repOverlapAnalysis}, \link{geneUsageAnalysis} diff --git a/man/vis.immunr_kmer_table.Rd b/man/vis.immunr_kmer_table.Rd index dee28d2c..7b00ea7d 100644 --- a/man/vis.immunr_kmer_table.Rd +++ b/man/vis.immunr_kmer_table.Rd @@ -31,14 +31,13 @@ Plot a distribution (bar plot) of the most frequent kmers in a data. } \examples{ # Load necessary data and package. -library(gridExtra) data(immdata) # Get 5-mers. imm.km <- getKmers(immdata$data[[1]], 5) # Plots for kmer proportions in each data frame in immdata. p1 <- vis(imm.km, .position = "stack") p2 <- vis(imm.km, .position = "fill") -grid.arrange(p1, p2) +p1 + p2 } \seealso{ \code{get.kmers} diff --git a/man/vis.immunr_mds.Rd b/man/vis.immunr_mds.Rd index 31345c8b..9528c305 100644 --- a/man/vis.immunr_mds.Rd +++ b/man/vis.immunr_mds.Rd @@ -66,6 +66,5 @@ Other visualisation methods: data(immdata) ov <- repOverlap(immdata$data) repOverlapAnalysis(ov, "mds") \%>\% vis() - } \concept{post_analysis} diff --git a/man/vis.immunr_ov_matrix.Rd b/man/vis.immunr_ov_matrix.Rd index cb35ed75..8b789428 100644 --- a/man/vis.immunr_ov_matrix.Rd +++ b/man/vis.immunr_ov_matrix.Rd @@ -40,6 +40,5 @@ vis(ov) vis(ov, "heatmap") vis(ov, "heatmap2") vis(ov, "circos") - } \concept{overlap} diff --git a/man/vis.immunr_public_repertoire.Rd b/man/vis.immunr_public_repertoire.Rd index 88a63777..b0872511 100644 --- a/man/vis.immunr_public_repertoire.Rd +++ b/man/vis.immunr_public_repertoire.Rd @@ -34,6 +34,5 @@ vis(pr, "freq") vis(pr, "freq", .type = "none") vis(pr, "clonotypes", 1, 2) - } \concept{pubrep} diff --git a/man/vis.immunr_public_statistics.Rd b/man/vis.immunr_public_statistics.Rd index 3deec0bf..3ceb35ed 100644 --- a/man/vis.immunr_public_statistics.Rd +++ b/man/vis.immunr_public_statistics.Rd @@ -20,8 +20,7 @@ Visualise public clonotype frequencies. \examples{ data(immdata) immdata$data <- lapply(immdata$data, head, 2000) -pr <- pubRep(immdata$data, .verbose=FALSE) +pr <- pubRep(immdata$data, .verbose = FALSE) pubRepStatistics(pr) \%>\% vis() - } \concept{pubrep} diff --git a/man/vis_bar.Rd b/man/vis_bar.Rd index 33daa8a9..a36ebc41 100644 --- a/man/vis_bar.Rd +++ b/man/vis_bar.Rd @@ -84,6 +84,5 @@ Bar plots } \examples{ vis_bar(data.frame(Sample = c("A", "B", "C"), Value = c(1, 2, 3))) - } \concept{vis} diff --git a/man/vis_box.Rd b/man/vis_box.Rd index d3fb43ff..fdaa8b14 100644 --- a/man/vis_box.Rd +++ b/man/vis_box.Rd @@ -73,7 +73,6 @@ Visualisation of distributions using ggplot2-based boxplots. } \examples{ vis_box(data.frame(Sample = sample(c("A", "B", "C"), 100, TRUE), Value = rnorm(100)), .melt = FALSE) - } \seealso{ \link{vis.immunr_gene_usage}, \link{geneUsage} diff --git a/man/vis_heatmap.Rd b/man/vis_heatmap.Rd index d9b5e3fa..4aff77d9 100644 --- a/man/vis_heatmap.Rd +++ b/man/vis_heatmap.Rd @@ -63,7 +63,6 @@ ov <- repOverlap(immdata$data) vis_heatmap(ov) gu <- geneUsage(immdata$data, "hs.trbj") vis_heatmap(gu) - } \seealso{ \link{vis}, \link{repOverlap}. diff --git a/man/vis_heatmap2.Rd b/man/vis_heatmap2.Rd index 51107f38..595dcbbf 100644 --- a/man/vis_heatmap2.Rd +++ b/man/vis_heatmap2.Rd @@ -36,7 +36,6 @@ package with minimum amount of arguments. data(immdata) ov <- repOverlap(immdata$data) vis_heatmap2(ov) - } \seealso{ \link{vis}, \link{repOverlap} diff --git a/man/vis_hist.Rd b/man/vis_hist.Rd index 20654e9a..b15cb753 100644 --- a/man/vis_hist.Rd +++ b/man/vis_hist.Rd @@ -15,7 +15,6 @@ vis_hist( .coord.flip = FALSE, .grid = FALSE, .labs = c("Gene", NA), - .return.grob = FALSE, .melt = TRUE, .legend = NA, .add.layer = NULL, @@ -53,8 +52,6 @@ to automatically detect the optimal number of columns.} \item{.labs}{A character vector of length two with names for x-axis and y-axis, respectively.} -\item{.return.grob}{If TRUE then returh the gridArrange grob instead of the plot.} - \item{.melt}{If TRUE then apply \link{melt} to the ".data" before plotting. In this case ".data" is supposed to be a data frame with the first character column reserved for names of genes and other numeric columns reserved to counts or frequencies of genes. diff --git a/man/vis_public_clonotypes.Rd b/man/vis_public_clonotypes.Rd index 70584c36..1b0e1a4d 100644 --- a/man/vis_public_clonotypes.Rd +++ b/man/vis_public_clonotypes.Rd @@ -14,8 +14,7 @@ vis_public_clonotypes( .cut.axes = TRUE, .density = TRUE, .lm = TRUE, - .radj.size = 3.5, - .return.grob = FALSE + .radj.size = 3.5 ) } \arguments{ @@ -44,8 +43,6 @@ that shows how similar samples are in terms of shared clonotypes.} \item{.radj.size}{An integer value, that defines the size of the The text for the R adjusted coefficient.} - -\item{.return.grob}{If TRUE then returh the gridArrange grob instead of the plot.} } \value{ A ggplot2 object. diff --git a/man/vis_treemap.Rd b/man/vis_treemap.Rd deleted file mode 100644 index bf54a282..00000000 --- a/man/vis_treemap.Rd +++ /dev/null @@ -1,26 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/vis.R -\name{vis_treemap} -\alias{vis_treemap} -\title{Visualisation of data frames and matrices using treemaps} -\usage{ -vis_treemap(.data, ...) -} -\arguments{ -\item{.data}{Input data frame or matrix.} - -\item{...}{Other arguments passed to \link{treemap}.} -} -\value{ -A treemap object. -} -\description{ -Visualisation of data frames and matrices using treemaps -} -\examples{ -data(immdata) -gu <- geneUsage(immdata$data) -vis_treemap(gu) - -} -\concept{vis} diff --git a/vignettes/v1_introduction.Rmd b/vignettes/v1_introduction.Rmd index e54380cf..016feff4 100644 --- a/vignettes/v1_introduction.Rmd +++ b/vignettes/v1_introduction.Rmd @@ -1,6 +1,6 @@ --- title: "Introduction to `immunarch`" -author: 'ImmunoMind' +author: 'ImmunoMind' date: "support@immunomind.io" output: html_document: @@ -23,94 +23,79 @@ output: # Introduction -`immunarch` is an R package designed to analyse T-cell receptor (TCR) and B-cell receptor (BCR) repertoires, aimed at medical scientists and bioinformaticians. The mission of `immunarch` is to make immune sequencing data analysis as effortless as possible and help you focus on research instead of coding. Follow us on [Twitter](https://twitter.com/immunomind) for news and updates. +`immunarch` is an R package designed to analyse T-cell receptor (TCR) and B-cell receptor (BCR) repertoires, aimed at medical scientists and bioinformaticians. The mission of `immunarch` is to make immune sequencing data analysis as effortless as possible and help you focus on research instead of coding. Follow us on [Twitter](http://twitter.com/immunomind) for news and updates. ## Installation -In order to install `immunarch` execute the following R command: + +### Latest release on CRAN +In order to install `immunarch` execute the following command: ```r -install.packages("devtools") # skip this if you already installed devtools -devtools::install_url("https://github.com/immunomind/immunarch/raw/master/immunarch.tar.gz") +install.packages("immunarch") ``` -That's it, you can start using `immunarch` now! See the [Quick Start](#quick-start) section below to dive into immune repertoire data analysis. If you run in any trouble with installation, take a look at the [Installation Troubleshooting](#installation-troubleshooting) section below. +That's it, you can start using `immunarch` now! See the [Quick Start](#quick-start) section below to dive into immune repertoire data analysis. If you run in any trouble with installation, take a look at the [Installation Troubleshooting](http://immunarch.com/articles/v1_introduction.html#installation-troubleshooting) section. -Note that there are quite a lot of dependencies to install with the package because it installs all the widely-used packages for data analysis and visualisation. You got both the AIRR data analysis framework and Data Science package eco-system with only one command! +Note: there are quite a lot of dependencies to install with the package because it installs all the widely-used packages for data analysis and visualisation. You got both the AIRR data analysis framework and the full Data Science package ecosystem with only one command, making `immunarch` the entry-point for single-cell & immune repertoire Data Science. - -You can find the list of releases of `immunarch` here: https://github.com/immunomind/immunarch/releases -## Quick start -The gist of the typical TCR or BCR data analysis workflow can be reduced to the next few lines of code. +### Latest pre-release on GitHub +Since releasing on CRAN is limited to one release per one-two months, you can install the latest pre-release version with bleeding edge features and optimisations directly from the code repository. In order to install the latest pre-release version, you need to execute only two commands: -**1) Load the package and the data** +```r +install.packages("devtools") # skip this if you already installed devtools +devtools::install_github("immunomind/immunarch", ref="dev") +``` -```{r eval=FALSE} -# 1.1) Load the package into R: -library(immunarch) +You can find the list of releases of `immunarch` here: http://github.com/immunomind/immunarch/releases -# 1.2a) To quickly test immunarch, load the test dataset: -data(immdata) -# 1.2b) To try immunarch on your own data, use the `repLoad` function on your data folder: -immdata = repLoad("path/to/your/folder/with/repertoires") -``` +## Quick start +The gist of the typical TCR or BCR data analysis workflow can be reduced to the next few lines of code. -**2) Analyse repertoire similarity at the clonotype level** +### Use `immunarch` data -```{r eval=FALSE} -# 2.1) Find the number of shared clonotypes and visualise it: -ov = repOverlap(immdata$data) -vis(ov) +**1) Load the package and the data** -# 2.2) Cluster samples by their similarity: -ov.kmeans = repOverlapAnalysis(ov, .method = "mds+kmeans") -vis(ov.kmeans) +```r +library(immunarch) # Load the package into R +data(immdata) # Load the test dataset ``` -**3) Find repertoire differences in the Variable gene usage** +**2) Calculate and visualise basic statistics** -```{r eval=FALSE} -# 3.1) Compute V gene usage and and highlight gene differences in groups with different clinical status: -gu = geneUsage(immdata$data) -vis(gu, .by="Status", .meta=immdata$meta) - -# 3.2) Cluster samples by their V gene usage similarity: -gu.clust = geneUsageAnalysis(gu, .method = "js+hclust") -vis(gu.clust) +```r +repExplore(immdata$data, "lens") %>% vis() # Visualise the length distribution of CDR3 +repClonality(immdata$data, "homeo") %>% vis() # Visualise the relative abundance of clonotypes ``` -**4) Find differences in the diversity of repertoires** - -```{r eval=F} -# 4.1) Compare diversity of repertoires and visualise samples, grouped by both clinical status and sequencing Lane: -div = repDiversity(immdata$data, .method = "chao1") -vis(div, .by=c("Status", "Lane"), .meta=immdata$meta) +**3) Explore and compare T-cell and B-cell repertoires** +```r +repOverlap(immdata$data) %>% vis() # Build the heatmap of public clonotypes shared between repertoires +geneUsage(immdata$data[[1]]) %>% vis() # Visualise the V-gene distribution for the first repertoire +repDiversity(immdata$data) %>% vis(.by = "Status", .meta = immdata$meta) # Visualise the Chao1 diversity of repertoires, grouped by the patient status ``` -**5) Manipulate plots to make them publication-ready** +### Use your own data -```{r eval=FALSE} -# 5.1) Manipulate the visualisation of diversity estimates to make the plot publication-ready: -div = repDiversity(immdata$data, .method = "chao1") -div.plot = vis(div, .by=c("Status", "Lane"), .meta=immdata$meta) -fixVis(div.plot) +```r +library(immunarch) # Load the package into R +immdata <- repLoad("path/to/your/data") # Replace it with the path to your data. Immunarch automatically detects the file format. ``` -**6) Advanced methods** +### Advanced methods -For advanced methods such as clonotype tracking, kmer analysis and public repertoire analysis see tutorials on the [immunarch website](https://immunarch.com/). +For advanced methods such as clonotype annotation, clonotype tracking, kmer analysis and public repertoire analysis see "Tutorials". ## Installation troubleshooting @@ -154,7 +139,7 @@ Execution halted 3. For Mac users. Make sure to install XCode from App Store first and command line developers tools second by executing the following command in Terminal: `xcode-select –install` -4. For Mac users. If you have issues like old packages can't be updated, or error messages such as `ld: warning: directory not found for option` or `ld: library not found for -lgfortran`, [this link](https://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks--lgfortran-and--lquadmath-error/) will help you to fix the issue. +4. For Mac users. If you have issues like old packages can't be updated, or error messages such as `ld: warning: directory not found for option` or `ld: library not found for -lgfortran`, [this link](http://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks--lgfortran-and--lquadmath-error/) will help you to fix the issue. 5. For Mac Mojave (1.14) users. If you run into the following error: @@ -179,7 +164,7 @@ Error: package or namespace load failed for 'igraph' in dyn.load(file, DLLpath = libgfortran.so.4: cannot open shared object file: No such file or directory ``` - See [this link](https://ashokragavendran.wordpress.com/2017/10/24/error-installing-rigraph-unable-to-load-shared-object-igraph-so-libgfortran-so-4-cannot-open-shared-object-file-no-such-file-or-directory/) for help. + See [this link](http://ashokragavendran.wordpress.com/2017/10/24/error-installing-rigraph-unable-to-load-shared-object-igraph-so-libgfortran-so-4-cannot-open-shared-object-file-no-such-file-or-directory/) for help. 7. For Linux users. If you have issues with the `rgl` package: @@ -194,7 +179,7 @@ ERROR: configuration failed for package ‘rgl’ apt-get install mesa-common-dev ``` - Check [this link](https://stackoverflow.com/questions/15292905/how-to-solve-the-error-missing-required-header-gl-gl-h-while-installing-the-p) for more information and other possible workarounds. + Check [this link](http://stackoverflow.com/questions/15292905/how-to-solve-the-error-missing-required-header-gl-gl-h-while-installing-the-p) for more information and other possible workarounds. 7. If you have error messages with `rlang` in them such as: @@ -221,25 +206,15 @@ ERROR: lazy loading failed for package 'immunarch' Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true") ``` -9. For Windows users. If you have issues with the package installation, or if you want to change the folder for R packages, feel free to check [this forum post](https://community.rstudio.com/t/help-regarding-package-installation-renviron-rprofile-r-libs-r-libs-site-and-r-libs-user-oh-my/13888/8). +9. For Windows users. If you have issues with the package installation, or if you want to change the folder for R packages, feel free to check [this forum post](http://community.rstudio.com/t/help-regarding-package-installation-renviron-rprofile-r-libs-r-libs-site-and-r-libs-user-oh-my/13888/8). -9. For Windows users. Make sure to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/). Before installation close RStudio, install Rtools and re-open it afterwards. To check if Rtools installed correctly, run the `devtools::find_rtools()` command (after installing the devtools package). If you have an error, check [this link](https://github.com/r-lib/devtools/issues/1941) for help. +9. For Windows users. Make sure to install [Rtools](https://cran.r-project.org/bin/windows/Rtools/). Before installation close RStudio, install Rtools and re-open it afterwards. To check if Rtools installed correctly, run the `devtools::find_rtools()` command (after installing the devtools package). If you have an error, check [this link](http://github.com/r-lib/devtools/issues/1941) for help. 9. If you can not install dependencies for `immunarch`, please try manual installation of all dependencies by executing the following command in R console: ``` install.packages(c("rematch", "prettyunits", "forcats", "cellranger", "progress", "zip", "backports", "ellipsis", "zeallot", "SparseM", "MatrixModels", "sp", "haven", "curl", "readxl", "openxlsx", "minqa", "nloptr", "RcppEigen", "utf8", "vctrs", "carData", "pbkrtest", "quantreg", "maptools", "rio", "lme4", "labeling", "munsell", "cli", "fansi", "pillar", "viridis", "car", "ellipse", "flashClust", "leaps", "scatterplot3d", "modeltools", "DEoptimR", "digest", "gtable", "lazyeval", "rlang", "scales", "tibble", "viridisLite", "withr", "assertthat", "glue", "magrittr", "pkgconfig", "R6", "tidyselect", "BH", "plogr", "purrr", "ggsci", "cowplot", "ggsignif", "polynom", "fastcluster", "plyr", "abind", "dendextend", "FactoMineR", "mclust", "flexmix", "prabclus", "diptest", "robustbase", "kernlab", "GlobalOptions", "shape", "colorspace", "stringi", "hms", "clipr", "crayon", "httpuv", "mime", "jsonlite", "xtable", "htmltools", "sourcetools", "later", "promises", "gridBase", "RColorBrewer", "yaml", "ggplot2", "dplyr", "dtplyr", "dbplyr", "data.table", "gridExtra", "ggpubr", "pheatma3", "ggrepel", "reshape2", "DBI", "factoextra", "fpc", "circlize", "tidyr", "Rtsne", "readr", "readxl", "shiny", "shinythemes", "treemap", "igraph", "airr", "ggseqlogo", "UpSetR", "stringr", "ggalluvial", "Rcpp")) ``` - -9. If you can not install `immunarch` using `install_url()`, try the manual installation. First, you need to download the package file from [our GitHub](https://github.com/immunomind/immunarch/raw/master/immunarch.tar.gz). - - Note that you **should not un-archive it**! - - After downloading the file, you need to run R command from the `devtools` package to install `immunarch`. Upon completion the dependencies will have been already downloaded and installed. Run the following command: - - ``` -devtools::install_local("path/to/your/folder/with/immunarch.tar.gz", dependencies=T) - ``` 9. If you encounter the following error while running the `devtools::install_local` function: @@ -254,4 +229,4 @@ devtools::install_local("path/to/your/folder/with/immunarch.tar.gz", dependencie Check your path to the downloaded package archive file. It should not be "path/to/your/folder/with/immunarch.tar.gz", but a path on your PC to the downloaded file, e.g., "C:/Users/UserName/Downloads/immunarch.tar.gz" or "/Users/UserName/Downloads/immunarch.tar.gz". -9. If troubles still persist, message us on support@immunomind.io or create an issue in [our GitHub]( https://github.com/immunomind/immunarch/issues) with the code that represents the issue and the output you get in the console. +9. If troubles still persist, let us know via [GitHub](http://github.com/immunomind/immunarch/issues) (preferably) or [support@immunomind.io](mailto:support@immunomind.io) (in case of private data). diff --git a/vignettes/v2_data.Rmd b/vignettes/v2_data.Rmd index 691c4355..a06f41e9 100644 --- a/vignettes/v2_data.Rmd +++ b/vignettes/v2_data.Rmd @@ -77,10 +77,9 @@ The package provides several IO functions: - `repLoad` - to load the repertoires, having compatible format. -- `repSave` - to save changes and to write the repertoire data to a file in a specific format (`immunarch`, VDJtools or MonetDB). +- `repSave` - to save changes and to write the repertoire data to a file in a specific format (`immunarch`, VDJtools). -`repLoad` has the `.format` argument that sets the format for input repertoire files. -Do not provide it if you want `immunarch` to detect the formats and parse files automatically! In case you want to force the package to parse the data in a specific format, you can choose one of the several options for the argument: +`repLoad` detects the input file format automatically. `immunarch` currently support the following immune repertoire data formats: - `"immunarch"` - current software tool, in case you forgot it :) diff --git a/vignettes/web_only/v21_singlecell.Rmd b/vignettes/web_only/v21_singlecell.Rmd new file mode 100644 index 00000000..659b68cd --- /dev/null +++ b/vignettes/web_only/v21_singlecell.Rmd @@ -0,0 +1,134 @@ +--- +title: "Single-cell and paired chain data: scTCRseq and scBCRseq exploration" +author: 'ImmunoMind' +date: "support@immunomind.io" +output: + html_document: + fig_height: 8 + fig_width: 10 + theme: spacelab + toc: yes + pdf_document: + toc: yes + word_document: + toc: yes +--- + + + + + + +```{r setup, include=FALSE, echo=FALSE} +# knitr::knit_hooks$set(optipng = knitr::hook_optipng) +# knitr::opts_chunk$set(optipng = '-o7') + +knitr::opts_chunk$set(echo = TRUE) +knitr::opts_chunk$set(fig.align = "center") +knitr::opts_chunk$set(fig.width = 18) +knitr::opts_chunk$set(fig.height = 12) + +library(immunarch) +data(scdata) +``` + + +# Executive Summary + +> This is a vignette dedicated to provide an overview on how to work with single-cell paired chain data in `immunarch` + +> Single-cell support is currently in the development version. In order to access it, you need to install the latest development version of the package by executing the following command: + +```r +install.packages("devtools"); devtools::install_github("immunomind/immunarch", ref="dev") +``` + +> To read paired chain data into `immunarch` use the `repLoad` function with `.mode = "paired"`. Currently we support 10X Genomics only. + +> To subset immune repertoires by specific barcodes use the `select_barcodes` function. Output of `Seurat::Idents()` as a barcode vector works. + +> To create cluster-specific and patient-specific datasets using barcodes from the output of `Seurat::Idents()` use the `select_clusters` function. + + +# Use the data packaged with `immunarch` + +Load the package into the R enviroment: +```{r} +library(immunarch) +``` + +For testing purposes we attached a new paired chain dataset to `immunarch`. Load it by executing the following command: + +```{r} +data(scdata) +``` + +# Load the paired chain data + +To load your own datasets, use the `repLoad` function. Currently we implemented paired chain data support for 10X Genomics data only. A working example of loading datasets into R: + +```{r} +file_path <- paste0(system.file(package = "immunarch"), "/extdata/sc/flu.csv.gz") +igdata <- repLoad(file_path, .mode = "paired") + +igdata$meta + +head(igdata$data[[1]][c(1:7, 16, 17)]) +``` + +# Subset by barcodes + +To subset the data by barcodes, use the `select_barcodes` function. + +```{r} +barcodes <- c("AGTAGTCAGTGTACTC-1", "GGCGACTGTACCGAGA-1", "TTGAACGGTCACCTAA-1") + +new_df <- select_barcodes(scdata$data[[1]], barcodes) + +new_df +``` + +## Patient-specific datasets + +To create a new dataset with cluster-specific immune repertoires, use the `select_clusters` function: + +```{r} +scdata_pat <- select_clusters(scdata, scdata$bc_patient, "Patient") + +names(scdata_pat$data) + +scdata_pat$meta +``` + +## Cluster-specific datasets + +To create a new dataset with cluster-specific immune repertoires, use the `select_clusters` function. You can apply this function after you created patient-specific datasets to get patient-specific cell cluster-specific immune repertoires, e.g., a Memory B Cell repertoire for a specific patient: + +```{r} +scdata_cl <- select_clusters(scdata_pat, scdata$bc_cluster, "Cluster") + +names(scdata_cl$data) + +scdata_cl$meta +``` + +# Explore and compute statistics + +Most functions will work out-of-the-box with paired chain data. + +```{r} +p1 <- repOverlap(scdata_cl$data) %>% vis() +p2 <- repDiversity(scdata_cl$data) %>% vis() + +target <- c("CARAGYLRGFDYW;CQQYGSSPLTF", "CARATSFYYFHHW;CTSYTTRTTLIF", "CARDLSRGDYFPYFSYHMNVW;CQSDDTANHVIF", "CARGFDTNAFDIW;CTAWDDSLSGVVF", "CTREDYW;CMQTIQLRTF") +p3 <- trackClonotypes(scdata_cl$data, target, .col = "aa") %>% vis() + +(p1 + p2) / p3 +``` + +Several functions may work incorrectly with paired chain data in this release of `immunarch`. Let us know via [GitHub Issues](http://github.com/immunomind/immunarch/issues)! + diff --git a/vignettes/v3_basic_analysis.Rmd b/vignettes/web_only/v3_basic_analysis.Rmd similarity index 78% rename from vignettes/v3_basic_analysis.Rmd rename to vignettes/web_only/v3_basic_analysis.Rmd index bc02ccd4..30de2cfa 100644 --- a/vignettes/v3_basic_analysis.Rmd +++ b/vignettes/web_only/v3_basic_analysis.Rmd @@ -44,12 +44,12 @@ For each task in this section `immunarch` includes separate functions that are g Note: all functions in immunarch require that the input immune repertoire data list have names. If you use the `repLoad` function, you will have no issues. If you compose your list by-hand, you must name elements in the list, e.g.: ```{r rename-list, eval=F} -your_data # Your list with repertoires without names +your_data # Your list with repertoires without names names(your_data) # Output: NULL -names(your_data) = sapply(1:length(your_data), function (i) paste0("Sample", i)) +names(your_data) <- sapply(1:length(your_data), function(i) paste0("Sample", i)) names(your_data) # Output: Sample1 Sample2 ... Sample10 ``` @@ -83,18 +83,18 @@ Almost all visualisations of analysis results are support grouping data by their 1) You can pass `.by` as a character vector with one or several column names from the metadata table to group your data before plotting. In this case you should provide also the `.meta` argument with the metadata table. ```{r eda-by-meta, warning=F, fig.width=10, fig.height=4.5} -exp_vol = repExplore(immdata$data, .method = "volume") -p1 = vis(exp_vol, .by = c("Status"), .meta = immdata$meta) -p2 = vis(exp_vol, .by = c("Status", "Sex"), .meta = immdata$meta) -grid.arrange(p1, p2, ncol=2) +exp_vol <- repExplore(immdata$data, .method = "volume") +p1 <- vis(exp_vol, .by = c("Status"), .meta = immdata$meta) +p2 <- vis(exp_vol, .by = c("Status", "Sex"), .meta = immdata$meta) +p1 + p2 ``` 2) You can pass `.by` as a character vector that exactly matches the number of samples in your data, each value should correspond to a sample's property. It will be used to group data based on the values provided. Note that in this case you should pass NA to `.meta`. ```{r eda-by-only, warning=F, fig.width=6, fig.height=4.5} -exp_vol = repExplore(immdata$data, .method = "volume") -by_vec = c('C', 'C', 'C', 'C', 'C', 'C', 'MS', 'MS', 'MS', 'MS', 'MS', 'MS') -p = vis(exp_vol, .by = by_vec) +exp_vol <- repExplore(immdata$data, .method = "volume") +by_vec <- c("C", "C", "C", "C", "C", "C", "MS", "MS", "MS", "MS", "MS", "MS") +p <- vis(exp_vol, .by = by_vec) p ``` @@ -106,10 +106,10 @@ Adjusted for multiple comparisons P-values are plotted on the top of groups. P-v Plots generated by the `vis` function as well as any ggplot2-based plots can be passed to `fixVis`---built-in software tool for making publication-ready plots: ```{r eda-fixvis, eval=F} # 1. Analyse -exp_len = repExplore(immdata$data, .method = "len", .col = "aa") +exp_len <- repExplore(immdata$data, .method = "len", .col = "aa") # 2. Visualise -p1 = vis(exp_len) +p1 <- vis(exp_len) # 3. Fix and make publication-ready results fixVis(p1) @@ -120,63 +120,65 @@ See the `fixVis` tutorial [here](https://immunarch.com/articles/web_only/v7_fixv ## Exploratory analysis For the basic exploratory analysis such as comparing of number of reads / UMIs per repertoire or distribution use the function `repExplore`. ```{r eda-1, warning=F, fig.width=12, fig.height=4} -exp_len = repExplore(immdata$data, .method = "len", .col = "aa") -exp_cnt = repExplore(immdata$data, .method = "count") -exp_vol = repExplore(immdata$data, .method = "volume") +exp_len <- repExplore(immdata$data, .method = "len", .col = "aa") +exp_cnt <- repExplore(immdata$data, .method = "count") +exp_vol <- repExplore(immdata$data, .method = "volume") -p1 = vis(exp_len) -p2 = vis(exp_cnt) -p3 = vis(exp_vol) +p1 <- vis(exp_len) +p2 <- vis(exp_cnt) +p3 <- vis(exp_vol) p1 ``` ```{r eda-2, warning=F, fig.width=14, fig.height=4} -grid.arrange(p2, p3, ncol = 2) +p2 + p3 ``` ```{r eda-3, warning=F, fig.width=10, fig.height=4} # You can group samples by their metainformation -p4 = vis(exp_len, .by="Status", .meta=immdata$meta) -p5 = vis(exp_cnt, .by="Sex", .meta=immdata$meta) -p6 = vis(exp_vol, .by=c("Status", "Sex"), .meta=immdata$meta) +p4 <- vis(exp_len, .by = "Status", .meta = immdata$meta) +p5 <- vis(exp_cnt, .by = "Sex", .meta = immdata$meta) +p6 <- vis(exp_vol, .by = c("Status", "Sex"), .meta = immdata$meta) p4 ``` ```{r eda-4, warning=F, fig.width=10, fig.height=5} -grid.arrange(p5, p6, ncol = 2) +p5 + p6 ``` ## Clonality One of the ways to estimate the diversity of samples is to evaluate clonality. `repClonality` measures the amount of the most or the least frequent clonotypes. There are several methods to assess clonality, let us take a view of them. The `clonal.prop` method computes the proportion of repertoire occupied by the pools of cell clones: ```{r clonality-pr} -imm_pr = repClonality(immdata$data, .method = "clonal.prop") +imm_pr <- repClonality(immdata$data, .method = "clonal.prop") imm_pr ``` The `top` method considers the most abundant cell clonotypes: ```{r clonality-top} -imm_top = repClonality(immdata$data, .method = "top", .head = c(10, 100, 1000, 3000, 10000)) +imm_top <- repClonality(immdata$data, .method = "top", .head = c(10, 100, 1000, 3000, 10000)) imm_top ``` While the `rare` method deals with the least prolific clonotypes: ```{r clonality-rare} -imm_rare = repClonality(immdata$data, .method = "rare") +imm_rare <- repClonality(immdata$data, .method = "rare") imm_rare ``` Finally, the `homeo` method assesses the clonal space homeostasis, i.e., the proportion of the repertoire occupied by the clones of a given size: ```{r clonality-hom-vis-1, message=F, warning=F, fig.width=11, fig.height=4.5} -imm_hom = repClonality(immdata$data, .method = "homeo", - .clone.types = c(Small = .0001, Medium = .001, Large = .01, Hyperexpanded = 1)) +imm_hom <- repClonality(immdata$data, + .method = "homeo", + .clone.types = c(Small = .0001, Medium = .001, Large = .01, Hyperexpanded = 1) +) imm_hom -grid.arrange(vis(imm_top), vis(imm_top, .by="Status", .meta=immdata$meta), ncol = 2) +vis(imm_top) + vis(imm_top, .by = "Status", .meta = immdata$meta) -grid.arrange(vis(imm_rare), vis(imm_rare, .by="Status", .meta=immdata$meta), ncol = 2) +vis(imm_rare) + vis(imm_rare, .by = "Status", .meta = immdata$meta) -grid.arrange(vis(imm_hom), vis(imm_hom, .by=c("Status", "Sex"), .meta=immdata$meta), ncol = 2) +vis(imm_hom) + vis(imm_hom, .by = c("Status", "Sex"), .meta = immdata$meta) ``` diff --git a/vignettes/web_only/v4_overlap.Rmd b/vignettes/web_only/v4_overlap.Rmd index 555c846f..28a70a36 100644 --- a/vignettes/web_only/v4_overlap.Rmd +++ b/vignettes/web_only/v4_overlap.Rmd @@ -57,17 +57,23 @@ Repertoire overlap is the most common approach to measure repertoire similarity. The function that includes described methods is `repOverlap`. Again the output is easily visualised when passed to `vis()` function that does all the work: ```{r overlap, message=F, warning=FALSE, fig.width=12, fig.height=7} -imm_ov1 = repOverlap(immdata$data, .method = "public", .verbose = F) -imm_ov2 = repOverlap(immdata$data, .method = "morisita", .verbose = F) +imm_ov1 <- repOverlap(immdata$data, .method = "public", .verbose = F) +imm_ov2 <- repOverlap(immdata$data, .method = "morisita", .verbose = F) -grid.arrange(vis(imm_ov1), vis(imm_ov2, .text.size=2), ncol = 2) +p1 <- vis(imm_ov1) +p2 <- vis(imm_ov2, .text.size = 2) + +p1 + p2 vis(imm_ov1, "heatmap2") ``` You can easily change the number of significant digits: ```{r overlap-signif-digits, message=F, warning=FALSE, fig.width=12, fig.height=7} -grid.arrange(vis(imm_ov2, .text.size=2.5, .signif.digits=1), vis(imm_ov2, .text.size=2, .signif.digits=2), ncol = 2) +p1 <- vis(imm_ov2, .text.size = 2.5, .signif.digits = 1) +p2 <- vis(imm_ov2, .text.size = 2, .signif.digits = 2) + +p1 + p2 ```