For this example we will use PBMC (Peripheral Blood Mononuclear Cell) data from the paper Integrated analysis of multimodal single-cell data by Hao et al. You can find the original data here, in MatrixMarker (.mtx) format. For convenience, you can download the samples recompressed as .h5 files. Direct links:
First we load SingleCellProjections
and the packages DataFrames
and CSV
for handling annotations.
julia> using SingleCellProjections, DataFrames, CSV
Then we load samples "P1" and "P2", by specifiying the paths to the files and naming them.
julia> base_path = "/path/to/downloads/";
julia> sample_paths = joinpath.(base_path, ["GSE164378_RNA_ADT_3P_P1.h5", "GSE164378_RNA_ADT_3P_P2.h5"]);
julia> counts = load_counts(sample_paths; sample_names=["P1","P2"])
DataMatrix (33766 variables and 35340 observations)
+ SparseMatrixCSC{Int64, Int32}
+ Variables: id, feature_type, name, genome, read, pattern, sequence
+ Observations: id, sampleName, barcode
Data sets in SingleCellProjections
are represented as DataMatrix
objects, which are matrices with annotations for var
(variables/genes/features) and obs
(observations, typically cells). Above, counts
is a DataMatrix
where the counts are stored in a sparse matrix. You can also see the available annotations for variables and observations. To access the different parts, use:
counts.matrix
- For the matrixcounts.var
- Variable annotations (DataFrame
)counts.obs
- Observation annotations (DataFrame
)
Here we compute a new obs
annotation where we count the fraction of reads coming from Mitochondrial genes for each cell:
julia> var_counts_fraction!(counts, "name"=>contains(r"^MT-"), "feature_type"=>isequal("Gene Expression"), "fraction_mt")
DataMatrix (33766 variables and 35340 observations)
+ SparseMatrixCSC{Int64, Int32}
+ Variables: id, feature_type, name, genome, read, pattern, sequence
+ Observations: id, sampleName, barcode, fraction_mt
+ Models: VarCountsFractionModel(subset_size=13, total_size=33538, col="fraction_mt")
Note that the new annotation fraction_mt
is present in the output.
We will also load some more cell annotations from the provided file.
julia> cell_annotations = CSV.read(joinpath(base_path, "GSE164378_RNA_ADT_3P.csv.gz"), DataFrame);
To merge, we use the DataFrames
function leftjoin!
, since it takes care of matching the cells in counts
to the cells in cell_annotations
based on the :barcode
column.
julia> leftjoin!(counts.obs, cell_annotations; on=:barcode);
Let's look at some annotations for the first few cells:
julia> counts.obs[1:5,["id","sampleName","barcode","fraction_mt","celltype.l1"]]
5×5 DataFrame
+ Row │ id sampleName barcode fraction_mt celltype.l1
+ │ String String String Float64 String7
+─────┼───────────────────────────────────────────────────────────────────────────────────
+ 1 │ P1_L1_AAACCCAAGACATACA P1 L1_AAACCCAAGACATACA 0.0330832 CD4 T
+ 2 │ P1_L1_AAACCCACATCGGTTA P1 L1_AAACCCACATCGGTTA 0.0649309 Mono
+ 3 │ P1_L1_AAACCCAGTGGAACAC P1 L1_AAACCCAGTGGAACAC 0.0541372 NK
+ 4 │ P1_L1_AAACCCATCTGCGGAC P1 L1_AAACCCATCTGCGGAC 0.0712244 CD4 T
+ 5 │ P1_L1_AAACGAAAGTTACTCG P1 L1_AAACGAAAGTTACTCG 0.0625228 CD4 T
The raw counts data is not suitable for analyses like PCA, since the data is far from normally distributed. A common strategy to handle this is to transform the data. Here we will use SCTransform (see also original sctransform implementation in R).
julia> transformed = sctransform(counts)
ERROR: MethodError: no method matching SCTransformModel(::DataMatrix{SparseMatrixCSC{Int64, Int32}, DataFrames.DataFrame, DataFrames.DataFrame}; use_cache::Bool, verbose::Bool)
+
+Closest candidates are:
+ SCTransformModel(::Type{T}, ::DataMatrix; var_filter, rtol, atol, annotate, post_var_filter, post_obs_filter, obs, kwargs...) where T
+ @ SingleCellProjections ~/work/SingleCellProjections.jl/SingleCellProjections.jl/src/transform.jl:281
From the output, we see that the number of variables have been reduced, since the default sctransform
options remove variables present in very few cells and only keeps variables with feature_type
set to "Gene Expression"
.
The matrix is now shown as A+B₁B₂B₃
. This is normally not very important from the user's point of view, but it is critical for explaining how SingleCellProjections
can be fast and not use too much memory. Instead of storing the SCTransformed matrix as a huge dense matrix, it is stored in memory as a MatrixExpression
, in this case a sparse matrix A
plus a product of three smaller matrices B₁
,B₂
and B₃
.
After transformation we always want to normalize the data. At the very least, data should be centered for PCA to work properly, this can be achieved by just running normalize_matrix
with the default parameters. Here, we also want to regress out "fraction_mt"
. You can add more obs
annotations (categorical and/or numerical) to regress out if you need.
julia> normalized = normalize_matrix(transformed, "fraction_mt")
ERROR: UndefVarError: `transformed` not defined
Now the matrix is shown as A+B₁B₂B₃+(-β)X'
, i.e. another low-rank term was added to handle the normalization/regression. The first two terms are reused to make sure memory is not wasted.
It is possible to filter variables and observations. Here we keep all cells that are not labeled as "other"
.
julia> filtered = filter_obs("celltype.l1"=>!isequal("other"), normalized)
ERROR: UndefVarError: `normalized` not defined
Now we are ready to perform Principal Component Analysis (PCA). This is computed by the Singular Value Decomposition (SVD), so we should call the svd
function. The number of dimensions is specified using the nsv
parameter.
julia> reduced = svd(filtered; nsv=20)
ERROR: UndefVarError: `filtered` not defined
The matrix is now stored as an SVD
object, which includes low dimensional representations of the observations and variables. To retrieve the low dimensional coordinates, use obs_coordinates
and var_coordinates
respectively.
Download interactive PCA plot.
+Expand this to show some example PlotlyJS plotting code.
You can of course use your own favorite plotting library instead. Use obs_coordinates
to get the coordinates for each cell, and data.obs
to access cell annotations for coloring.
using PlotlyJS
+function plot_categorical_3d(data, annotation; marker_size=3)
+ points = obs_coordinates(data)
+ traces = GenericTrace[]
+ for sub in groupby(data.obs, annotation; sort=true)
+ value = sub[1,annotation]
+ ind = parentindices(sub)[1]
+ push!(traces, scatter3d(;x=points[1,ind], y=points[2,ind], z=points[3,ind], mode="markers", marker_size, name=value))
+ end
+ plot(traces, Layout(;legend=attr(itemsizing="constant")))
+end
Use it like this:
julia> plot_categorical_3d(reduced, "celltype.l1")
+
For visualization purposes, it is often useful to further reduce the dimension after running PCA. (In contrast, analyses are generally run on the PCA/normalized/original data, since the methods below necessarily distort the data to force it down to 2 or 3 dimensions.)
Force Layout plots (also known as SPRING Plots) are created like this:
julia> fl = force_layout(reduced; ndim=3, k=100)
ERROR: UndefVarError: `reduced` not defined
Download interactive Force Layout plot.
SingleCellProjections
can be used together with UMAP.jl:
julia> using UMAP
julia> umapped = umap(reduced, 3)
ERROR: UndefVarError: `reduced` not defined
Download interactive UMAP plot.
Similarly, t-SNE plots are supported using TSne.jl. In this example, we just run it one every 10ᵗʰ cell, because t-SNE doesn't scale very well with the number of cells:
julia> using TSne
julia> t = tsne(reduced[:,1:10:end], 3)
ERROR: UndefVarError: `reduced` not defined
Download interactive t-SNE plot.
It is of course possible to use your own favorite dimension reduction method/package. The natural input for most cases are the coordinates after dimension reduction by PCA (obs_coordinates(reduced)
).
SingleCellProjections
is build to make it very easy to project one dataset onto another.
Let's load count data for two more samples:
julia> sample_paths_proj = joinpath.(base_path, ["GSE164378_RNA_ADT_3P_P5.h5", "GSE164378_RNA_ADT_3P_P6.h5"]);
julia> counts_proj = load_counts(sample_paths_proj; sample_names=["P5","P6"]);
julia> leftjoin!(counts_proj.obs, cell_annotations; on=:barcode);
julia> counts_proj
DataMatrix (33766 variables and 42553 observations)
+ SparseMatrixCSC{Int64, Int32}
+ Variables: id, feature_type, name, genome, read, pattern, sequence
+ Observations: id, sampleName, barcode, nCount_ADT, nFeature_ADT, nCount_RNA, nFeature_RNA, orig.ident, lane, donor, ...
And project them onto the Force Layout we created above:
julia> fl_proj = project(counts_proj, fl)
ERROR: UndefVarError: `fl` not defined
The result looks similar to the force layout plot above, since the donors "P5" and "P6" are similar to donors "P1" and "P2".
Download interactive Force Layout projection plot.
Under the hood, SingleCellProjections
recorded a ProjectionModel
for every step of the analysis leading up to the Force Layout. Let's take a look:
julia> fl.models
ERROR: UndefVarError: `fl` not defined
When projecting, these models are applied one by one (C.f. output from project
above), ensuring that the projected data is processed correctly. In most cases, projecting is not the same as running the same analysis independently, since information about the data set is recorded in the model.