CellO (Cell Ontology-based classification) is a Python package for performing cell type classification of human RNA-seq data. CellO makes hierarchical predictions against the Cell Ontology. These classifiers were trained on nearly all of the human primary cell, bulk RNA-seq data in the Sequence Read Archive.
For more details regarding the underlying method, see the paper: Bernstein, M.N., Ma, J., Gleicher, M., Dewey, C.N. (2020). CellO: Comprehensive and hierarchical cell type classification of human cellswith the Cell Ontology. iScience, 24(1), 101913.
There are two modes in which one can use CellO: within Python in conjunction with Scanpy, or with the command line.
To install CellO using Pip, run the following command:
pip install cello-classify
CellO's API interfaces with the Scanpy Python library and can integrate into a more general single-cell analysis pipeline. For an example on how to use CellO with Scanpy, please see the tutorial.
This tutorial can also be executed from a Google Colab notebook in the cloud: https://colab.research.google.com/drive/1lNvzrP4bFDkEe1XXKLnO8PZ83StuvyWW?usp=sharing.
CellO takes as input a gene expression matrix. CellO accepts data in multiple formats:
- TSV: tab-separated value
- CSV: comma-separated value
- HDF5: a database in HDF5 format that includes three datasets: a dataset storing the expression matrix, a dataset storing the list of gene-names (i.e. rows), and a gene-set storing the list of cell ID's (i.e. columns)
- 10x formatted directory: a directory in the 10x format including three files:
matrix.mtx
,genes.tsv
, andbarcodes.tsv
Given an output-prefix provided to CellO (this can include the path to the output), CellO outputs three tables formatted as tab-separated-value files:
<output_prefix>.probability.tsv
: a NxM classification probability table of N cells and M cell types where element (i,j) is a probability value that describes CellO's confidence that cell i is of cell type j<output_prefix>.binary.tsv
: a NxM binary-decision matrix where element (i,j) is 1 if CellO predicts cell i to be of cell type j and is 0 otherwise.<output_prefix>.most_specific.tsv
: a table mapping each cell to the most-specific predicted cell<output_prefix>.log
: a directory that stores log files that store details of CellO's execution<output_prefix>.log/genes_absent_from_training_set.tsv
: if a new model is trained using the-t
option, then this file will store the genes in CellO's training set that were not found in the input dataset<output_prefix>.log/clustering.tsv
: a TSV file mapping each cell to its assigned cluster. Note, that if pre-computed clusters are provided via the-p
option, then this file will not be written.
Usage:
cello_predict [options] input_file
Options:
-h, --help show this help message and exit
-a ALGO, --algo=ALGO Hierarchical classification algorithm to apply
(default='IR'). Must be one of: 'IR' - Isotonic
regression, 'CLR' - cascaded logistic regression
-d DATA_TYPE, --data_type=DATA_TYPE
Data type (required). Must be one of: 'TSV', 'CSV',
'10x', or 'HDF5'. Note: if 'HDF5' is used, then
arguments must be provided to the h5_cell_key,
h5_gene_key, and h5_expression_key parameters.
-c H5_CELL_KEY, --h5_cell_key=H5_CELL_KEY
The key of the dataset within the input HDF5 file
specifying which dataset stores the cell ID's. This
argument is only applicable if '-d HDF5' is used
-g H5_GENE_KEY, --h5_gene_key=H5_GENE_KEY
The key of the dataset within the input HDF5 file
specifying which dataset stores the gene names/ID's.
This argument is only applicable if '-d HDF5' is used
-e H5_EXPRESSION_KEY, --h5_expression_key=H5_EXPRESSION_KEY
The key of the dataset within the input HDF5 file
specifying which dataset stores the expression matrix.
This argument is only applicable if '-d HDF5' is used
-r, --rows_cells Use this flag if expression matrix is organized as
CELLS x GENES rather than GENES x CELLS. Not
applicable when '-d 10x' is used.
-u UNITS, --units=UNITS
Units of expression. Must be one of: 'COUNTS', 'CPM',
'LOG1_CPM', 'TPM', 'LOG1_TPM'
-s ASSAY, --assay=ASSAY
Sequencing assay. Must be one of: '3_PRIME',
'FULL_LENGTH'
-t, --train_model If the genes in the input matrix don't match what is
expected by the classifier, then train a classifier on
the input genes. The model will be saved to
<output_prefix>.model.dill
-m MODEL, --model=MODEL
Path to pretrained model file.
-l REMOVE_ANATOMICAL, --remove_anatomical=REMOVE_ANATOMICAL
A comma-separated list of terms ID's from the Uberon
Ontology specifying which tissues to use to filter
results. All cell types known to be resident to the
input tissues will be filtered from the results.
-p PRE_CLUSTERING, --pre_clustering=PRE_CLUSTERING
A TSV file with pre-clustered cells. The first column
stores the cell names/ID's (i.e. the column names of
the input expression matrix) and the second column
stores integers referring to each cluster. The TSV
file should not have column names.
-b, --ontology_term_ids
Use the less readable, but more rigorous Cell Ontology
term id's in output
-o OUTPUT_PREFIX, --output_prefix=OUTPUT_PREFIX
Prefix for all output files. This prefix may contain a
path.
Notably, the input expression data's genes must match the genes expected by the trained classifier. If the genes match, then CellO will use a pre-trained classifier to classify the expression profiles (i.e. cells) in the input dataset.
To provide an example, here is how you would run CellO on a toy dataset stored in example_input/Zheng_PBMC_10x
. This dataset is a set of 1,000 cells subsampled from the Zheng et al. (2017) dataset. To run CellO on this dataset, run this command:
cello_predict -d 10x -u COUNTS -s 3_PRIME example_input/Zheng_PBMC_10x -o test
Note that -o test
specifies the all output files will have the prefix "test". The -d
specifies the input format, -u
specifies the units of the expression matrix, and -s
specifies the assay-type. For a full list of available formats, units, assay-types, run:
cello_predict -h
If the genes in the input file do not match the genes on which the model was trained, CellO can be told to train a classifier with only those genes included in the given input dataset by using the -t
flag. The trained model will be saved to a file named <output_prefix>.model.dill
where <output_prefix>
is the output-prefix argument provided via the -o
option. Training CellO usually takes under an hour.
For example, to train a model and run CellO on the file example_input/LX653_tumor.tsv
, run the command:
cello_predict -u COUNTS -s 3_PRIME -t -o test example_input/LX653_tumor.tsv
Along with the classification results, this command will output a file test.model.dill
.
Training a model on a new gene set needs only to be done once (see previous section). For example, to run CellO on example_input/LX653_tumor.tsv
using a specific model stored in a file, run:
cello_predict -u COUNTS -s 3_PRIME -m test.model.dill -o test example_input/LX653_tumor.tsv
Note that -m test.model.dill
tells CellO to use the model computed in the previous example.
We provide a commandline tool for quantifying raw reads with Kallisto. Note that to run this script, Kallisto must be installed and available in your PATH
environment variable. This script will output an expression profile that includes all of the genes that CellO is expecting and thus, expression profiles created with this script are automatically compatible with CellO.
This script requires a preprocessed kallisto reference. To download the pre-built Kallisto reference that is compatible with CellO, run the command:
bash download_kallisto_reference.sh
This command will download a directory called kallisto_reference
in the current directory. To run Kallisto on a set of FASTQ files, run the command
cello_quantify_sample <comma_dilimited_fastq_files> <tmp_dir> -o <kallisto_output_file>
where <comma_delimited_fastq_files>
is a comma-delimited set of FASTQ files containing all of the reads for a single RNA-seq sample and <tmp_dir>
is the location where Kallisto will store it's output files. The file <kallisto_output_file>
is a tab-separated-value table of the log(TPM+1) values that can be fed directly to CellO. To run CellO on this output file, run:
cell_predict -u LOG1_TPM -s FULL_LENGTH <kallisto_output_file> -o <cell_output_prefix>
Note that the above command assumes that the assay is a full-length assay (meaning reads can originate from the full-length of the transcript). If this is a 3-prime assay (reads originate from only the 3'-end of the transcript), the -s FULL_LENGTH
should be replaced with -s 3_PRIME
in the above command.
If upon running pip install cello
you receive an error installing Cython, that looks like:
ERROR: Command errored out with exit status 1:
command: /scratch/cdewey/test_cello/CellO-master/cello_env/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-wo2dj5q7/quadprog/setup.py'"'"'; __file__='"'"'/tmp/pip-install-wo2dj5q7/quadprog/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
cwd: /tmp/pip-install-wo2dj5q7/quadprog/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-wo2dj5q7/quadprog/setup.py", line 17, in <module>
from Cython.Build import cythonize
ModuleNotFoundError: No module named 'Cython'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
then you may try upgrading to the latest version of pip and Cython by running:
python -m pip install --upgrade pip
pip install --upgrade cython