Skip to content

Commit

Permalink
Merge branch 'main' into dev
Browse files Browse the repository at this point in the history
  • Loading branch information
shakson-isaac committed Jul 26, 2023
2 parents 65368b6 + 0e25ea4 commit 027e26e
Show file tree
Hide file tree
Showing 3 changed files with 48 additions and 32 deletions.
9 changes: 9 additions & 0 deletions R/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
inst/doc
improvements
READMEupdate.md
RData/
.DS_Store
71 changes: 39 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,11 @@ The manuscript will soon appear at medRxiv! (Sakaue et al. "**Tissue-specific en
SCENT uses single-cell multimodal data (e.g., 10X Multiome RNA/ATAC) and links ATAC-seq peaks (putative enhancers) to their target genes by modeling association between chromatin accessibility and gene expression across individual single cells.

<div align="center">
<img src="https://github.com/immunogenomics/SCENT/blob/e0f14cd59a7a148d94383e2a825f3546e2045d41/fig/cover_image.png" width=90%>
<img src="https://raw.githubusercontent.com/immunogenomics/SCENT/main/fig/cover_image2.png" width=90%>
</div>



We use Poisson regression to associate gene expression (raw) count and (binarized) peak accessibility, and estimate errors in coefficients by bootstrapping framework to control for type I error.


Expand Down Expand Up @@ -62,17 +63,19 @@ Vignettes are posted in this github repo to show 2 potential uses of the SCENT p
In summary, the main functionality is the SCENT object construction:

```r
library(SCENT)

SCENT_obj <- CreateSCENTObj(rna = mrna, atac = atac, meta.data = meta,
peak.info = gene_peak,
covariates = c("log(nUMI)","percent.mito","sample", "batch"),
celltypes = "newCT")
celltypes = "celltype")
```

Followed by SCENT algorithm:
```r
SCENT_obj <- SCENT_algorithm(object = SCENT_obj, celltype = "Tcells", ncores = 6)
SCENT_obj <- SCENT_algorithm(object = SCENT_obj, celltype = "Tcell", ncores = 6)
```
Where the user specifies a celltype for association analysis and the number of cores for parallelized bootstrapping.
Where the user specifies a `celltype` (in this case "Tcell") for association analysis (in `meta.data` slot in SCENT object) and the number of cores for parallelized bootstrapping.

The output of the SCENT algorithm will be contained in the field:
```r
Expand All @@ -83,62 +86,67 @@ which can be saved as a textfile for further downstream analysis.

Further information on Inputs and Outputs of SCENT are detailed below:

#### Inputs To SCENT Object:
#### Arguments To `CreateSCENTObj`:

| # | Argument name (format) | Descriptions |
| ---- | ---------------------------- | ------------------------------------------------------------ |
| 1 | rna (.rds) | A gene-by-cell matrix from multimodal RNA-seq data. This is a raw count matrix without any normalization. The column names should be the gene names used in the `file_gene_peak_tested` file. Sparse matrix format is required. |
| 2 | atac (.rds) | A peak-by-cell matrix from multimodal ATAC-seq data. The row names should be the peak names used in the `file_gene_peak_tested` file. The column names are the cell names which should be the same names used in `rna_matrix` and the `cell`column of `metafile`. The matrix may not be binarized while it will be binarized within the function. Sparse matrix format is required. |
| 3 | meta.data (.txt) | A metadata for cells (rows are cells, and cell names should be in the column named as "cell"; see below example). Additionally, this text should include covariates to use in the model. Examples include: % mitochondrial reads, nUMI, sample, and batch as covariates. Dataframe format is required. |
| 4 | peak.info (.txt) | A textfile indicating which gene-peak pairs you want to test in this chunk (see below example). We highly recommend splitting gene-peak pairs into many chunks to increase computational efficiency (See Parallelized Jobs Info in Section 2). Dataframe or List(Dataframe) format is required. |
| 5 | covariates (character) | A vector of character fields that denote the covariates listed in the meta.data. For example, a set of covariates can be: %mitochondrial reads, nUMI, sample, and batch. Additionally the user can specify transformations to the covariates such as log transformation on nUMI counts for direct usage in the SCENT algorithm invoking poisson glm. |
| 6 | cell_types (character) | User specified naming of the celltype column in the meta.data file. This column should contain the names of the celltypes you want to test in this association analysis. |
| 1 | rna (sparse matrix) | A gene-by-cell count matrix from multimodal RNA-seq data. This is a raw count matrix without any normalization. The row names should be the gene names used in the `peak.info` file. The column names are the cell names which should be the same names used in the `cell`column of the dataframe specified for `meta.data`. Sparse matrix format is required. |
| 2 | atac (sparse matrix) | A peak-by-cell count matrix from multimodal ATAC-seq data. This is a raw count matrix without any normalization. The row names should be the peak names used in the `peak.info` file. The column names are the cell names which should be the same names used in `rna` and the `cell`column of dataframe specified for `meta.data`. The matrix may not be binarized while it will be binarized within the function. Sparse matrix format is required. |
| 3 | meta.data (dataframe) | A meta data frame for cells (rows are cells, and **cell names should be in the column named as "cell"**; see below example). Additionally, this text should include covariates to use in the model. Examples include: % mitochondrial reads, log(nUMI), sample, and batch as covariates. Dataframe format is required. |
| 4 | peak.info (dataframe) | A textfile indicating which gene-peak pairs you want to test in this chunk (see below example). We highly recommend splitting gene-peak pairs into many chunks to increase computational efficiency (See Parallelized Jobs Info in Section 2). List(Dataframe) format which is a list of multiple data frames for parallelization is required. \* |
| 5 | covariates (a vector of character) | A vector of character fields that denote the covariates listed in the meta.data. For example, a set of covariates can be: %mitochondrial reads, log_nUMI, sample, and batch. Additionally the user can specify transformations to the covariates such as log transformation on nUMI counts for direct usage in the SCENT algorithm invoking poisson glm. **We recommend users to at least use log(number_of_total_RNA_UMI_count_per_cell) as the base model is Poisson regression and we do not include the offset term into the default model.** |
| 6 | celltypes (character) | User specified naming of the celltype column in the meta.data file. This column should contain the names of the celltypes you want to test in this association analysis. |

Alternatives: The peak.info field can be left blank and created using the CreatePeakToGeneList function in the SCENT package. This function requires the user to specify a bed file that specifies ~500 kb windows of multiple gene loci to identify cis gene-peak pairs to test.
\* Extra Argument: The peak.info.list field can be left blank initially and a created List(Dataframes) can be constructed using the CreatePeakToGeneList function in the SCENT package. This function requires the user to specify a bed file that specifies ~500 kb windows of multiple gene loci to identify cis gene-peak pairs to test. The vignette, SCENT_parallelize.Rmd, will show steps to produce a SCENT object with a peak.info.list field that is used for parallelization in the SCENT_parallelization.R script.



#### Example Formats:
The example format of `file_gene_peak_tested` file in text format.
The example format of `peak.info` argument:

```bash
$ head ${file_gene_peak_tested}
A1BG chr19-57849279-57850722
A1BG chr19-57888160-57889279
A1BG chr19-57915851-57917093
A1BG chr19-57934422-57935603
> gene_peak <- read.table("/path/to/your_gene_peak_text_file.txt")
> head(gene_peak)

V1 V2
1 A1BG chr19-57849279-57850722
2 A1BG chr19-57888160-57889279
3 A1BG chr19-57915851-57917093
4 A1BG chr19-57934422-57935603
5 A1BG chr19-57946848-57948062
```

We usually only select peaks of which the center falls within 500 kb from the target gene (*cis* analysis). Also, while we have a function to QC peaks and genes so that they are present in at least 5% of all cells within `SCENT.R`, it is more efficient to only include these QCed peaks and genes in `file_gene_peak_tested` to reduce the number of tests.
We usually only select peaks of which the center falls within 500 kb from the target gene (*cis* analysis). Also, while we have a function to QC peaks and genes so that they are present in at least 5% of all cells within `SCENT.R`, **it is more efficient to only include these QCed peaks and genes in `peak.info` to reduce the number of tests**.


The example format of `metafile` file in rds format.
The example format of `meta.data` argument:

```r
meta <- readRDS(metafile)
meta$`log(nUMI)` <- log(meta$nUMI)
head(meta)

cell nUMI percent_mito sample batch
cell nUMI percent.mito sample batch
AAACAGCCAAGGAATC-1 AAACAGCCAAGGAATC-1 8380 0.01503428 sample_1 batch_a
AAACAGCCAATCCCTT-1 AAACAGCCAATCCCTT-1 3771 0.02207505 sample_1 batch_a
AAACAGCCAATGCGCT-1 AAACAGCCAATGCGCT-1 6876 0.01435579 sample_1 batch_a
AAACAGCCACACTAAT-1 AAACAGCCACACTAAT-1 1733 0.03881841 sample_1 batch_a
AAACAGCCACCAACCG-1 AAACAGCCACCAACCG-1 5415 0.01600768 sample_1 batch_a
AAACAGCCAGGATAAC-1 AAACAGCCAGGATAAC-1 2759 0.02485340 sample_1 batch_a
celltype
AAACAGCCAAGGAATC-1 Tcell
AAACAGCCAATCCCTT-1 Tcell
AAACAGCCAATGCGCT-1 Tcell
AAACAGCCACACTAAT-1 Tcell
AAACAGCCACCAACCG-1 Tcell
AAACAGCCAGGATAAC-1 Tcell
celltype log(nUMI)
AAACAGCCAAGGAATC-1 Tcell 9.033603
AAACAGCCAATCCCTT-1 Tcell 8.235095
AAACAGCCAATGCGCT-1 Tcell 8.835792
AAACAGCCACACTAAT-1 Tcell 7.457609
AAACAGCCACCAACCG-1 Tcell 8.596928
AAACAGCCAGGATAAC-1 Tcell 7.922624
```


#### Output of SCENT (SCENT.result field)
#### Output of SCENT (`SCENT.result` slot)

```bash
$ head ${file_output}
> head([email protected])
gene peak beta se z p boot_basic_p
A1BG chr19-57849279-57850722 0.587060911718621 0.227961010352348 2.57526894977009 0.0100162168431262 0.0192
A1BG chr19-57888160-57889279 -0.0842330294127105 0.232845263030106 -0.3617553920425660.717534829528597 0.688
Expand All @@ -162,8 +170,7 @@ Each column indicates ...

### 2.) Using SCENT with parallelized jobs.


`SCENT_parallelization.R` is the code necessary for running parallelized SCENT jobs.
`SCENT_parallelization.R` is the example code necessary for running parallelized SCENT jobs.
This code needs a `SCENT_Object.rds` file that contains a list of gene-peak pairs.
To generate this object please follow the SCENT_parallelize.Rmd vignette file.

Expand Down
Binary file added fig/cover_image2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 027e26e

Please sign in to comment.