Skip to content

Higashi Usage

ruochiz edited this page Mar 13, 2021 · 38 revisions

Step 1: Configure the parameters

All customizable parameters are stored in a JSON config file. An example config file can be found in config_dir/example.JSON. The path to this JSON config file will be needed in Step 3.

Input data related parameters

params Type description example
config_name str Name of this configuration, will be used in visualization tool "sn-m3C-seq-with_meth"
data_dir str Directory where the data are stored "/sn-m3C-seq"
structured bool Whether the data.txt file is structured (interaction pairs of a cell i is successive in the dataframe not randomly placed). If the data.txt is organized before, it could save a lot of memory and time for processing true
temp_dir str Directory where the temporary files will be stored. An empty folder will be created if it doesn't exists. "../Temp/sn-m3C_1Mb"
genome_reference_path str Path of the genome reference file from USCS Genome Browser, will be usde to generate bin nodes "../hg19.chrom.sizes.txt"
cytoband_path str Path of the cytoband reference file from USCS Genome Browser, will be used to remove centromere regions "../cytoBand_hg19.txt"
coassay bool Using co-assayed signals or not true
coassay_signal str Name of the co-assayed signals in the hdf5 file to use (can be empy) "meth_cg-100kb-cg_rate"
batch_id str Optional. The name of the batch id information stored in label_info.pickle. The corresponding information would be used to remove batch effects "batch id"
library_id str Optional. Similar to the batch_id. The difference is that, batch_id assumes the cell type composition of different batches are similar, while library_id don't have that assumption. (Such as Ramani et al. and 4DN sci-Hi-C) "batch id"

Note: It is recommended to check if there are strong batch effects in the dataset in the first place before using the batch effects removal function of Higashi.

Training process related parameters

params Type description example
chrom_list str List of chromosomes to train the model on. The name convention should be the same as the data.txt and the genome_reference file ["chr1", "chr2","chr3","chr4","chr5"]
resolution int Resolution for imputation. 1000000
resolution_cell int Resolution for generate attributes of the cell nodes. Recommend to use 1Mb (data with lower coverage per cell) or 500Kb (data with higher coverage per cell). 1000000
local_transfer_range int Number of neighboring bins in 1D genomic distance to consider during imputation (similar to the window size of linear convolution) 1
dimensions int Embedding dimensions 64,
loss_mode str Train the model in classification or ranking (can be either classification or rank) rank
rank_thres int Difference of ground truth values that are larger than rank_thres would be considered as stable order. 1
embedding_epoch int Number of epochs to train to generate embeddings. When this parameters is not included, Higashi program would train 60 epochs in this period as default. 60
no_nbr_epoch int Number of epochs to train Higashi without neighbor information. When this parameters is not included, Higashi program would train 45 epochs in this period as default. 45
with_nbr_epoch int Number of epochs to train Higashi with neighbor information used. When this parameters is not included, Higashi program would train 30 epochs in this period as default. 30

Note: It takes different number of epochs for Higashi to converge on different datasets. All datasets we tested in the paper takes less than 60 epochs. Also, Higashi saves trained embeddings every epoch (the location can be found here). When you see that the embeddings give satisfying results, feel free the stop the Higashi program. And then start it again with the option -s 2 (See detailed explanation of this option in Step 3). Higashi would load the trained model from last time and continue training to save time.

Output related parameters

params Type description example
embedding_name str Name of embedding vectors to store "exp1"
impute_list int List of chromosome to impute (must appear in the chrom list above) ["chr1"]
minimum_distance int Minimum genomic distance between a pair of genomc bins to impute (bp) 1000000
maximum_distance int Maximum genomic distance between a pair of genomc bins to impute (bp, -1 represents no constraint) -1
neighbor_num int Number of neighboring cells to incorporate when making imputation 5
impute_no_nbr bool Whether to impute the contact maps without borrowing neighbor information true
impute_with_nbr bool Whether to impute the contact maps with neighbor information borrowed true

Computational resources related parameters

params Type description example
cpu_num int Higashi is optimized for multiprocessing. Limit the number of cores to use with this param. -1 represents use all available cpu. -1
gpu_num int Higashi is optimized to utilize multiple gpus for computational efficiency. Higashi won't use all these gpus throughout the time. For co-assayed data, it would use multiple gpus in the processing step. For all data, Higashi would train and impute scHi-C on different gpus for computational efficiency. This parameters should be non negative. 8

Visualization related parameters

params Type description example
UMAP_params dict Parameters that'll be passed to Higashi-vis. Higashi-vis will use these parameters when calculating UMAP visualization. Follow the naming convention of the package umap {"n_neighbors": 30, "min_dist": 0.3
TSNE_params dict Similar to UMAP_params. Follow the naming convention of tsne in sklearn {"n_neighbors": 15}
random_walk bool Whether run linear_convolution and randomwalk-with-restart at the processing part for visualization. Code adapted from scHiCluster. Do not recommend when resolution goes higher than 100Kb. false

Step 2: Data processing

Run the following commands to process the input data.

cd Code
python Process.py -c {CONFIG_PATH}

Fill in the {CONFIG_PATH} with the path to the configuration JSON file that you created in the step 2. This script will finish the following tasks:

  • generate a dictionary that'll map genomic bin loci to the node id.
  • extract data from the data.txt and turn that into the format of hyperedges (triplets)
  • create contact maps based on sparse scHi-C for visualization, baseline model, and generate node attributes
  • run linear convolution + random-walk-with-restart (scHiCluster) to impute the contact maps as baseline and visualization
  • generate node attributes
  • (Optional) process co-assayed signals

Before each step is executed, a message would be printed indicating the progress, which helps the debugging process.

Step 3: Train the Higashi model

python main_cell.py -c {CONFIG_PATH} -s {START_STEP}

Fill in the {CONFIG_PATH} with the path to the configuration JSON file that you created in the step 2. Fill in the {START_STEP} with 1,2,3 which indicates the following steps:

  1. Train Higashi without cell-dependent GNN to force self-attention layers to capture the heterogeneity of chromatin structures
  2. Train Higashi with cell-dependent GNN, but with k=0
  3. Train Higashi with cell-dependent GNN, but with k=k in the configuration JSON When {START_STEP} is 1, the program would execute step 1,2,3 sequentially.
Clone this wiki locally