This project used the SCANPY Python library to cluster and visualize single-cell RNA sequencing (scRNA-seq) data from peripheral blood mononuclear cells (PBMCs). Key steps included data preprocessing, dimensionality reduction via PCA, neighborhood graph construction and UMAP visualization, cell clustering using the Leiden algorithm, and marker gene identification to annotate the cell types.
pip install SCANPY
pip install NumPy
pip install Pandas
pip install Matplotlib
pip install AnnDATA
pip install leidnalg
pip install Seaborn3
SCANPY is a python based single cell analysis package. This proposed framework integrates complex analysis workflows into a single dashboard that enables multidimensional data exploration, deep learning classification and prediction, and built-in visualization through Python scripts. Scanpy uses a data form called ANNDATA for the analysis. An ANNDATA object represents a data matrix with annotations. It gives access to machine-learning tools to easily extract information without occupying much memory. It is similar to R’s EXPRESSIONSET but supports sparse data and saves data on a disc in HDF5-based format, which is not dependent on the platform, framework, and language. This allows operating on an ANNDATA object without fully loading it into memory. Its graph of neighborhood relations among data points is better than the popular package SCIKIT-LEARN
Dataset (with intronic reads) a single cell expression dataset by Cell Ranger 6.1.2 version is taken from the 10x genomics database. Human peripheral blood mononuclear cells (PBMCs) were obtained by 10x Genomics from AllCells from a healthy female donor aged 25-30 years. Out of 16,000 cells (11,984 cells recovered) as described in the Chromium Single Cell 3' Reagent Kits User Guide (v3.1 Chemistry Dual Index) (CG000315 Rev C) using the Chromium X. The reads are sequenced on an Illumina NovaSeq 6000 to a read depth of approximately 40,000 mean reads per cell.
Data is retrieved from unzipped files and processed by identifying high fraction genes, filtering out cells with less than 3 detected genes, removing mitochondrial genes, and storing the count matrix as AnnData.
The above image shows the genes that yield the highest fraction of counts in every single cell, across all cells.
This violin plot of the number of genes expressed in the count matrix, total counts, and the percentage of the mitochondrial genes in the cell.
This Scatter plot shows the distribution of total gene counts in the cell’s vs mitochondrial genes and genes by counts.
Total count normalization is performed to ensure uniform counts across cells.
Data is log normalized and highly variable genes are identified using PCA.
The above figure shows the distribution of the highly variable genes before and after data normalization.
This shows the variance of each PC in the data set is visualized to Consider these for further analysis.
PCA is performed to reduce data dimensionality while retaining dataset variability.
Principal components (PCs) contributing to dataset variance are selected.
The graph embedding is visualized in two dimensions using UMAP of the highly expressed genes in the first three PCs.
Neighborhood graph is computed and embedded in two dimensions using UMAP.
UMAP is preferred over tSNE for visualizing similar cell types.
Clustering is done using Leiden algorithm on the neighborhood graph to identify cell clusters.
The graph embeddings of the scaled and normalized data
Statistical tests (t-test, Wilcoxon rank-sum test, and logistic regression) are performed to identify marker genes.
Marker genes and literature are used to determine cell types and annotate clusters.
Analysis of highly differential genes in all the 11 clusters using t-test
Various visualization methods are employed to represent genes and clusters.
11 clusters resulted from the Leiden clustering depending on the marker genes
The cell type annotation of the clusters is based on the cell type. The clusters were merged into 9 clusters as two of the genes belonged to the B-cells and germ cells.
The dot plot shows the expression of the marker genes in each cluster.
The analysis of this dataset resulted in an AnnData matrix with 11984 observations and 36601 genes. In the data preprocessing 32 cells were filtered out which have less than 200 genes and 9238 genes that are present in less than 3 cells are filtered out. After performing the data scaling further analysis was done with 964 observations and 3553 genes. The loadings of the three PCs are visualized to see the expressed gene for each PC and embeddings are visualized using UMAP. After the application of Leiden 11 clusters were found. Further analysis was done, and marker genes were identified as FHIT, FOXP1, CD8B, PITPNC1, ITGB1, NKG7, CD74, ANXA1, S100A9, RTKN2, PPBP.