-
Notifications
You must be signed in to change notification settings - Fork 13
Higashi Usage
All customizable parameters are stored in a JSON config file. The path to this JSON config file will be needed in Step 3.
For all parameters below, when certain parameter is marked as Optional it means you can left those parameters out when they are not applicable.
params | Type | Required/Optional | description | example |
---|---|---|---|---|
config_name | str | Required if you will be using Higashi-vis, otherwise Optional | Name of this configuration, will be used in visualization tool | "sn-m3C-seq-with_meth" |
data_dir | str | Required | Directory where the data are stored | "/sn-m3C-seq" |
input_format | str | Optional | How the data are stored. Can either be "higashi_v1" or "higashi_v2". "higashi_v1" stands for storing the scHi-C dataset as one big table named as data.txt. "higashi_v2" stands for storing contact pairs as individual tables for each cell, and list the path to these files in the filelist.txt | "higashi_v1" |
header_included | bool |
Required when input_format ="higashi_v2" |
whether the header of the tab is included in each table | true |
contact_header | list |
Required when input_format ="higashi_v2" and header_included is false |
The header of the contact pairs. Must include ["chrom1", "pos1", "chrom2","pos2"], when "count" is not included, the program assumes count=1 for all contact pairs | ["chrom1", "pos1", "chrom2", "pos2", "count"] |
structured | bool | Required | Whether the data.txt file is structured (interaction pairs of a cell i is successive in the dataframe not randomly placed). If the data.txt is organized before, it could save a lot of memory and time for processing | true |
temp_dir | str | Required | Directory where the temporary files will be stored. An empty folder will be created if it doesn't exists. | "../Temp/sn-m3C_1Mb" |
genome_reference_path | str | Required | Path of the genome reference file from USCS Genome Browser, will be used to generate bin nodes | "../hg19.chrom.sizes.txt" |
cytoband_path | str | Required | Path of the cytoband reference file from USCS Genome Browser, will be used to remove centromere regions | "../cytoBand_hg19.txt" |
coassay | bool | Optional | Using co-assayed signals or not | true |
coassay_signal | str | Optional | Name of the co-assayed signals in the hdf5 file to use (can be empy) | "meth_cg-100kb-cg_rate" |
batch_id | str | Optional | The name of the batch id information stored in label_info.pickle . The corresponding information would be used to remove batch effects |
"batch id" |
library_id | str | Optional | Similar to the batch_id. The difference is that, batch_id assumes the cell type composition of different batches are similar, while library_id don't have that assumption. (Such as Ramani et al. and 4DN sci-Hi-C) | "batch id" |
bulk_path | str | Optional | Path of the bulk Hi-C file (mcool format), can be used when calculating the projection matrix for scA/B | "/bulkHiC/4DNFIYGPDLKF_C28.mcool" |
Note: It is recommended to check if there are strong batch effects in the dataset in the first place before using the batch effects removal function of Higashi.
params | Type | Required/Optional | description | example |
---|---|---|---|---|
chrom_list | str | Required | List of chromosomes to train the model on. The name convention should be the same as the data.txt and the genome_reference file | ["chr1", "chr2","chr3","chr4","chr5"] |
resolution | int | Required | Resolution for imputation. | 1000000 |
resolution_cell | int | Required | Resolution for generate attributes of the cell nodes. Recommend to use 1Mb (data with lower coverage per cell) or 500Kb (data with higher coverage per cell). | 1000000 |
local_transfer_range | int | Required | Number of neighboring bins in 1D genomic distance to consider during imputation (similar to the window size of linear convolution) | 1 |
dimensions | int | Required | Embedding dimensions | 64, |
loss_mode | str | Required | Train the model in classification or ranking (can be either classification, rank, or zinb (zero-inflated negative binomial, Recommended)) | zinb |
rank_thres | int | Required | Difference of ground truth values that are larger than rank_thres would be considered as stable order. | 1 |
embedding_epoch | int | Optional | Number of epochs to train to generate embeddings. When this parameters is not included, Higashi program would train 60 epochs in this period as default. | 80 |
no_nbr_epoch | int | Optional | Number of epochs to train Higashi without neighbor information. When this parameters is not included, Higashi program would train 45 epochs in this period as default. | 80 |
with_nbr_epoch | int | Optional | Number of epochs to train Higashi with neighbor information used. When this parameters is not included, Higashi program would train 30 epochs in this period as default. | 60 |
Note: It takes different number of epochs for Higashi to converge on different datasets. All datasets we tested in the paper takes less than 60 epochs. Also, Higashi saves trained embeddings every epoch (the location can be found here). When you see that the embeddings give satisfying results, feel free the stop the Higashi program. And then start it again with the option -s 2
(See detailed explanation of this option in Step 3). Higashi would load the trained model from last time and continue training to save time.
params | Type | Required/Optional | description | example |
---|---|---|---|---|
embedding_name | str | Required | Name of embedding vectors to store | "exp1" |
impute_list | int | Required | List of chromosome to impute (must appear in the chrom list above) | ["chr1"] |
minimum_distance | int | Required | Minimum genomic distance between a pair of genomc bins to impute (bp) | 1000000 |
maximum_distance | int | Required | Maximum genomic distance between a pair of genomc bins to impute (bp, -1 represents no constraint) | -1 |
neighbor_num | int | Required | Number of neighboring cells to incorporate when making imputation, the hyperparameter k in the manuscript |
5 |
correct_be_impute | bool | Optional | Whether taking batch effects into account and try to remove batch effects when imputing. When set as true, batch_id parameter must be included. |
false |
impute_verbose | int | Optional | Verbosity level of imputation process. When set as a positive int |
10 |
params | Type | Required/Optional | description | example |
---|---|---|---|---|
cpu_num | int | Required | Higashi is optimized for multiprocessing. Limit the number of cores to use with this param. -1 represents use all available cpu. | -1 |
gpu_num | int | Required | Higashi is optimized to utilize multiple gpus for computational efficiency. Higashi won't use all these gpus throughout the time. For co-assayed data, it would use multiple gpus in the processing step. For all data, Higashi would train and impute scHi-C on different gpus for computational efficiency. This parameters should be non negative. | 8 |
Note: The cpu_num and gpu_num do not necessarily correspond to the physical number of cpu cores or gpu cards. They actually refers to how many parallel threads are used.
params | Type | Required/Optional | description | example |
---|---|---|---|---|
UMAP_params | dict | Optional | Parameters that'll be passed to Higashi-vis. Higashi-vis will use these parameters when calculating UMAP visualization. Follow the naming convention of the package umap | {"n_neighbors": 30, "min_dist": 0.3 |
TSNE_params | dict | Optional | Similar to UMAP_params. Follow the naming convention of tsne in sklearn | {"n_neighbors": 15} |
random_walk | bool | Optional | Whether run linear_convolution and randomwalk-with-restart at the processing part for visualization. Code adapted from scHiCluster. Do not recommend when resolution goes higher than 100Kb. When not included, it will be set as false in default. | false |
vis_palette | dict | ** Optional ** | Custom palette for a specific label_info. | {"cluster label": {"L23": "#e51f4e", "L4": "#45af4b", "L5": "#ffe011", "L6": "#0081cc", "Ndnf": "#ff7f35", "Vip": "#951eb7", "Pvalb": "#4febee"}} |
Run the following commands to process the input data.
cd higashi
python Process.py [-c CONFIG]
'
required arguments:
-c CONFIG The path to the configuration JSON file that you created in the step
'
This script will finish the following tasks:
- generate a dictionary that'll map genomic bin loci to the node id.
- extract data from the data.txt and turn that into the format of hyperedges (triplets)
- create contact maps based on sparse scHi-C for visualization, baseline model, and generate node attributes
- run linear convolution + random-walk-with-restart (scHiCluster) to impute the contact maps as baseline and visualization
- generate node attributes
- (Optional) process co-assayed signals
Before each step is executed, a message would be printed indicating the progress, which helps the debugging process.
python main_cell.py [-c CONFIG] [-s START]
'
optional arguments:
-s {1,2,3} The start step of Higashi program. Can be used to continue Higashi
training if interrupted before. 1,2,3 stands for the following steps:
1. Train Higashi without cell-dependent GNN to force self-attention layers
to capture the heterogeneity of chromatin structures
2. Train Higashi with cell-dependent GNN, but with k=0
3. Train Higashi with cell-dependent GNN, but with k=`neighbor_num` in the
config JSON. When set as 1, the program would execute step 1,2,3 sequentially.
When set as 2, the program would execute step 2,3 sequentially. (default: 1)
required arguments:
-c CONFIG The path to the configuration JSON file that you created in the step 2
'
**Extra Notes: **
Higashi saves parameters of the model and embeddings every 5 epochs, the user can check if the embeddings look good in the process. For instance, the user is not sure how many epochs would Higashi converges on their new dataset and set the embedding_epoch
as 120 just to be on the safe side. During the training process, the user find that the embeddings converge at around epoch 58. Instead of waiting for 120 epochs to finish, one can just wait till the model finished the 60 epoch (as the model saves parameter every 5 epochs), and interrupt the Higashi program. Then the user can restart Higashi with the option -s 2
to load pre-trained model and skip the first embedding generation training stage.
The runtime analysis was carried out on a Linux machine with 8 NVIDIA RTX 2080 Ti GPU cards, a 16-core Intel Xeon Silver 4110 CPU, and 252GB memory. The batch size is set as 192. Since the number of hyperedges varies across different datasets, we use the operation time per 1000 batches as the unit for measuring the runtime. For simplicity, we refer to 1000 batches as one epoch, which is different from the conventional definition where one iteration over the whole training dataset is one epoch. For the reported runtime, the Higashi program is set to use all available CPU cores and one GPU card during training. It is also set to not use parallel imputation although for smaller datasets one GPU card could fit multiple Higashi models. The runtime of the core operations of Higashi is reported in the table here.
Operation | Average runtime |
---|---|
Training without cd-GNN | 61.3s / epoch |
Training with cd-GNN ( |
95.2s / epoch |
Training with cd-GNN ( |
109.6s / epoch |
Training with cd-GNN ( |
221.4s / epoch |
Imputation (1Mb resolution, hg38, autosomal chromosomes) | 0.2s / cell |
Imputation (500Kb resolution, hg38, autosomal chromosomes) | 0.8s / cell |
Imputation (100Kb resolution, hg38, autosomal chromosomes) | 21.3s / cell |
Imputation (50Kb resolution, hg38, autosomal chromosomes) | 76.8s / cell |
Higashi ~ ~ Wiki
- Input files
- Usage (API)
- [Fast-Higashi initialized Higashi (Under construction)]
- Runtime of Fast-Higashi