Skip to content

Commit

Permalink
Update docs and tutorials for hictk
Browse files Browse the repository at this point in the history
[no ci]
  • Loading branch information
robomics committed Sep 27, 2024
1 parent cedf5a9 commit fde7770
Show file tree
Hide file tree
Showing 14 changed files with 435 additions and 469 deletions.
28 changes: 15 additions & 13 deletions docs/balancing_matrices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,20 +27,22 @@ The following is an example showing how to balance a .cool file using ICE.
user@dev:/tmp$ hictk balance ice 4DNFIZ1ZVXC8.mcool::/resolutions/1000
[2023-10-01 13:18:02.119] [info]: Running hictk v0.0.2-f83f93e
[2023-10-01 13:18:02.130] [info]: Writing interactions to temporary file /tmp/4DNFIZ1ZVXC8.tmp0...
[2023-10-01 13:18:05.098] [info]: Initializing bias vector...
[2023-10-01 13:18:05.099] [info]: Masking rows with fewer than 10 nnz entries...
[2023-10-01 13:18:06.298] [info]: Masking rows using mad_max=5...
[2023-10-01 13:18:06.971] [info]: Iteration 1: 36874560.192587376
[2023-10-01 13:18:07.634] [info]: Iteration 2: 21347543.04950776
[2023-10-01 13:18:08.307] [info]: Iteration 3: 7819314.542541969
[2024-09-26 16:02:19.731] [info]: Running hictk v1.0.0-fbdcb591
[2024-09-26 16:02:19.731] [info]: balancing using ICE (GW_ICE)
[2024-09-26 16:02:19.734] [info]: Writing interactions to temporary file /tmp/hictk-tmp-XXXX1ZC9FF/4DNFIZ1ZVXC8.mcool.tmp...
[2024-09-26 16:02:22.480] [info]: Initializing bias vector...
[2024-09-26 16:02:22.482] [info]: Masking rows with fewer than 10 nnz entries...
[2024-09-26 16:02:23.392] [info]: Masking rows using mad_max=5...
[2024-09-26 16:02:23.860] [info]: Iteration 1: 36452362.243888594
[2024-09-26 16:02:24.327] [info]: Iteration 2: 21649057.88060747
[2024-09-26 16:02:24.792] [info]: Iteration 3: 7890065.688497526
...
[2023-10-01 13:19:20.365] [info]: Iteration 105: 2.1397932757529552e-05
[2023-10-01 13:19:21.146] [info]: Iteration 106: 1.6604770462001875e-05
[2023-10-01 13:19:21.870] [info]: Iteration 107: 1.2885285040054778e-05
[2023-10-01 13:19:22.608] [info]: Iteration 108: 9.99900768769869e-06
[2023-10-01 13:19:22.619] [info]: Writing weights to 4DNFIZ1ZVXC8.mcool::/resolutions/1000/bins/weight...
[2024-09-26 16:03:12.285] [info]: Iteration 107: 2.0533518142916073e-05
[2024-09-26 16:03:12.752] [info]: Iteration 108: 1.601698258037195e-05
[2024-09-26 16:03:13.216] [info]: Iteration 109: 1.2493901433163442e-05
[2024-09-26 16:03:13.681] [info]: Iteration 110: 9.745791018854495e-06
[2024-09-26 16:03:13.707] [info]: Writing weights to 4DNFIZ1ZVXC8.mcool::/resolutions/1000/bins/GW_ICE...
[2024-09-26 16:03:13.708] [info]: Linking weights to 4DNFIZ1ZVXC8.mcool::/resolutions/1000/bins/weight...
When balancing files in .mcool or .hic formats, all resolutions are balanced.

Expand Down
104 changes: 42 additions & 62 deletions docs/creating_cool_and_hic_files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,44 +17,42 @@ File requirements:
* ``dm6.chrom.sizes`` - `download <https://hgdownload.cse.ucsc.edu/goldenpath/dm6/bigZips/dm6.chrom.sizes>`__
* ``4DNFIKNWM36K.pairs.gz`` - `download <https://4dn-open-data-public.s3.amazonaws.com/fourfront-webprod/wfoutput/930ba072-05ac-4382-9a92-369517184ec7/4DNFIKNWM36K.pairs.gz>`__


Ingesting pairwise interactions into a 10kbp .cool file
-------------------------------------------------------

Loading interactions in pairs (4DN-DCIC) format into a .cool/hic file is straightforward:

.. code-block:: console
# Create a 10kbp .cool file using dm6 as reference
user@dev:/tmp$ zcat 4DNFIKNWM36K.pairs.gz | hictk load --format 4dn --assembly dm6 --bin-size 10000 dm6.chrom.sizes 4DNFIKNWM36K.10000.cool
[2024-01-23 15:15:00.520] [info]: Running hictk v0.0.6-45c36af-dirty
[2024-01-23 15:15:00.531] [info]: writing chunk #1 to intermediate file "/tmp/4DNFIKNWM36K.10000.cool.tmp/4DNFIKNWM36K.10000.cool.tmp"...
[2024-01-23 15:15:23.762] [info]: done writing chunk #1 to tmp file "/tmp/4DNFIKNWM36K.10000.cool.tmp/4DNFIKNWM36K.10000.cool.tmp".
[2024-01-23 15:15:23.762] [info]: writing chunk #2 to intermediate file "/tmp/4DNFIKNWM36K.10000.cool.tmp/4DNFIKNWM36K.10000.cool.tmp"...
[2024-01-23 15:15:49.042] [info]: done writing chunk #2 to tmp file "/tmp/4DNFIKNWM36K.10000.cool.tmp/4DNFIKNWM36K.10000.cool.tmp".
[2024-01-23 15:15:49.042] [info]: writing chunk #3 to intermediate file "/tmp/4DNFIKNWM36K.10000.cool.tmp/4DNFIKNWM36K.10000.cool.tmp"...
[2024-01-23 15:15:49.834] [info]: done writing chunk #3 to tmp file "/tmp/4DNFIKNWM36K.10000.cool.tmp/4DNFIKNWM36K.10000.cool.tmp".
[2024-01-23 15:15:49.836] [info]: merging 3 chunks into "4DNFIKNWM36K.10000.cool"...
[2024-01-23 15:15:55.118] [info]: processing chr3L:15100000-15110000 chr3L:16230000-16240000 at 4789272 pixels/s...
[2024-01-23 15:15:59.718] [info]: ingested 119208613 interactions (18122865 nnz) in 59.197723453s!
# Create a 10kbp .hic file using dm6 as reference
user@dev:/tmp$ zcat 4DNFIKNWM36K.pairs.gz | hictk load --format 4dn --assembly dm6 --bin-size 10000 dm6.chrom.sizes 4DNFIKNWM36K.10000.hic
[2024-01-23 15:45:19.969] [info]: Running hictk v0.0.6-570037c-dirty
[2024-01-23 15:45:42.439] [info]: preprocessing chunk #1 at 452919 pixels/s...
[2024-01-23 15:46:09.182] [info]: preprocessing chunk #2 at 303750 pixels/s...
[2024-01-23 15:46:11.184] [info]: writing header at offset 0
[2024-01-23 15:46:11.184] [info]: begin writing interaction blocks to file "4DNFIKNWM36K.10000.hic"...
[2024-01-23 15:46:11.184] [info]: [10000 bp] writing pixels for chr3R:chr3R matrix at offset 50632...
[2024-01-23 15:46:13.295] [info]: [10000 bp] written 2264963 pixels for chr3R:chr3R matrix
[2024-01-23 15:46:13.295] [info]: [10000 bp] writing pixels for chr3R:chr3L matrix at offset 4235718...
[2024-01-23 15:46:14.611] [info]: [10000 bp] written 1610264 pixels for chr3R:chr3L matrix
...
[2024-01-23 15:46:44.065] [info]: [10000 bp] initializing expected value vector
[2024-01-23 15:46:50.531] [info]: [10000 bp] computing expected vector density
[2024-01-23 15:46:51.157] [info]: writing 1 expected value vectors at offset 32065110...
[2024-01-23 15:46:51.158] [info]: writing 0 normalized expected value vectors at offset 32078017...
[2024-01-23 15:46:51.194] [info]: ingested 119208613 interactions (18122865 nnz) in 91.225341628s!
user@dev:/tmp$ hictk load --format 4dn --bin-size 10000 4DNFIKNWM36K.pairs.gz 4DNFIKNWM36K.10000.cool
[2024-09-26 16:51:28.059] [info]: Running hictk v1.0.0-fbdcb591
[2024-09-26 16:51:28.068] [info]: begin loading pairwise interactions into a .cool file...
[2024-09-26 16:51:28.137] [info]: writing chunk #1 to intermediate file "/tmp/hictk-tmp-XXXXQPdOSn/4DNFIKNWM36K.10000.cool.tmp"...
[2024-09-26 16:51:45.281] [info]: done writing chunk #1 to tmp file "/tmp/hictk-tmp-XXXXQPdOSn/4DNFIKNWM36K.10000.cool.tmp".
[2024-09-26 16:51:45.281] [info]: writing chunk #2 to intermediate file "/tmp/hictk-tmp-XXXXQPdOSn/4DNFIKNWM36K.10000.cool.tmp"...
[2024-09-26 16:52:04.969] [info]: done writing chunk #2 to tmp file "/tmp/hictk-tmp-XXXXQPdOSn/4DNFIKNWM36K.10000.cool.tmp".
[2024-09-26 16:52:04.970] [info]: merging 2 chunks into "4DNFIKNWM36K.10000.cool"...
[2024-09-26 16:52:06.430] [info]: processing chr3L:1030000-1040000 chr3R:30240000-30250000 at 6882312 pixels/s...
[2024-09-26 16:52:08.478] [info]: ingested 119208613 interactions (18122865 nnz) in 40.418916003s!
To ingest interactions in a .hic file, simply change the extension of the output file (or use the ``--output-fmt`` option).

By default, the list of chromosomes is read from the file header.
The reference genome used to build the .cool or .hic file can be provided explicitly using the ``--chrom-sizes`` option.
Note that ``--chrom-sizes`` is a mandatory option when ingesting interactions in formats other than ``--format=4dn``.
In case the input file contains interactions mapping on chromosomes missing from the reference genome provided through ``--chrom-sizes``, the ``--drop-unknown-chroms`` flag can be used to instruct hictk to ignored said interactions.

When loading interactions using ``--format=pairs`` or ``--format=validPairs`` into a .cool file, tables of variable bins are supported.
To load interactions in to a .cool with a variable bin size provide the table of bins using the ``--bin-table`` option.

**Tips:**

* When creating large .hic files, ``hictk`` needs to create potentially large temporary files. When this is the case, use option ``--tmpdir`` to set the temporary folder to a path with sufficient space.
* When creating large .cool/hic files, ``hictk`` needs to create potentially large temporary files. When this is the case, use option ``--tmpdir`` to set the temporary folder to a path with sufficient space.
* When loading interactions into .hic files, some of the steps can be run in parallel by increasing the number of processing threads using the ``--threads`` option.
* When loading pre-binned interactions into .cool file, if the interactions are already sorted by genomic coordinates, the ``--assume-sorted`` option can be used to load interactions at once, without using temporary files.
* Interaction loading performance can be improved by processing interactions in larger chunks. This can be controlled using the ``--chunk-size`` option. In fact, when ``--chunk-size`` is greater than the number of interactions to be loaded, .hic and .cool files can be created without the use of temporary files.


Merging multiple files
Expand All @@ -66,35 +64,17 @@ Multiple .cool and .hic files using the same reference genome and resolution can
# Merge multiple cooler files
user@dev:/tmp$ hictk merge data/4DNFIZ1ZVXC8.mcool::/resolutions/1000 data/4DNFIZ1ZVXC8.mcool::/resolutions/1000 -o 4DNFIZ1ZVXC8.merged.cool
[2023-09-29 19:24:49.479] [info]: Running hictk v0.0.2
[2023-09-29 19:24:49.479] [info]: begin merging 2 coolers...
[2023-09-29 19:24:52.032] [info]: processing chr2R:11267000-11268000 chr4:1052000-1053000 at 3976143 pixels/s...
[2023-09-29 19:24:55.157] [info]: processing chr3R:5812000-5813000 chr3R:23422000-23423000 at 3201024 pixels/s...
[2023-09-29 19:24:57.992] [info]: DONE! Merging 2 coolers took 8.51s!
[2023-09-29 19:24:57.992] [info]: 4DNFIZ1ZVXC8.merged.cool size: 36.23 MB
# Merge multiple .hic files
user@dev:/tmp$ hictk merge data/4DNFIZ1ZVXC8.hic9 data/4DNFIZ1ZVXC8.hic9 -o 4DNFIZ1ZVXC8.10000.merged.hic --resolution 10000
[2024-01-23 15:49:23.248] [info]: Running hictk v0.0.6-570037c-dirty
[2024-01-23 15:49:23.248] [info]: begin merging 2 .hic files...
[2024-01-23 15:49:31.101] [info]: ingesting pixels at 1352814 pixels/s...
[2024-01-23 15:49:37.777] [info]: writing header at offset 0
[2024-01-23 15:49:37.777] [info]: begin writing interaction blocks to file "4DNFIZ1ZVXC8.10000.merged.hic"...
[2024-01-23 15:49:37.777] [info]: [10000 bp] writing pixels for chr2L:chr2L matrix at offset 212...
[2024-01-23 15:49:39.060] [info]: [10000 bp] written 1433133 pixels for chr2L:chr2L matrix
[2024-01-23 15:49:39.060] [info]: [10000 bp] writing pixels for chr2L:chr2R matrix at offset 2619165...
...
[2024-01-23 15:49:58.624] [info]: [10000 bp] initializing expected value vector
[2024-01-23 15:50:05.276] [info]: [10000 bp] computing expected vector density
[2024-01-23 15:50:05.276] [info]: writing 1 expected value vectors at offset 31936601...
[2024-01-23 15:50:05.276] [info]: writing 0 normalized expected value vectors at offset 31949508...
[2024-01-23 15:50:05.299] [info]: DONE! Merging 2 files took 42.05s!
[2024-01-23 15:50:05.299] [info]: 4DNFIZ1ZVXC8.10000.merged.hic size: 31.95 MB
user@dev:/tmp$ hictk merge 4DNFIZ1ZVXC8.mcool::/resolutions/10000 4DNFIZ1ZVXC8.mcool::/resolutions/10000 -o 4DNFIZ1ZVXC8.merged.10000.cool
[2024-09-26 17:07:57.101] [info]: Running hictk v1.0.0-fbdcb591
[2024-09-26 17:07:57.101] [info]: begin merging 2 files into one .cool file...
[2024-09-26 17:07:58.978] [info]: processing chr3L:1030000-1040000 chr3R:29720000-29730000 at 5571031 pixels/s...
[2024-09-26 17:08:01.224] [info]: DONE! Merging 2 files took 4.12s!
[2024-09-26 17:08:01.224] [info]: 4DNFIZ1ZVXC8.merged.10000.cool size: 19.64 MB
Merging .hic files as well as a mix of .hic and .cool files is also supported (as long as all files have the same resolution and reference genome).
When one or more of the input files are in .hic format, the ``--resolution`` option is mandatory.

**Tips:**

* When merging many, large .hic files, ``hictk`` needs to create potentially large temporary files. When this is the case, use option ``--tmpdir`` to set the temporary folder to a path with sufficient space.
See the list of Tips for hictk load.
70 changes: 21 additions & 49 deletions docs/creating_multires_files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,28 +14,30 @@ Interactions from a single-resolution Cooler file (.cool) can be used to generat
user@dev:/tmp$ hictk zoomify data/4DNFIZ1ZVXC8.mcool::/resolutions/1000 out.mcool
[2023-09-29 19:28:39.926] [info]: Running hictk v0.0.2
[2023-09-29 19:28:39.929] [info]: coarsening cooler at data/4DNFIZ1ZVXC8.mcool::/resolutions/1000 13 times (1000 -> 1000 -> 2000 -> 5000 -> 10000 -> 20000 -> 50000 -> 100000 -> 200000 -> 500000 -> 1000000 -> 2000000 -> 5000000 -> 10000000)
[2023-09-29 19:28:39.929] [info]: copying 1000 resolution from data/4DNFIZ1ZVXC8.mcool::/resolutions/1000
[2023-09-29 19:28:40.119] [info]: generating 2000 resolution from 1000 (2x)
[2023-09-29 19:28:40.343] [info]: [1000 -> 2000] processing chr2L:1996000-1998000 at 4484305 pixels/s...
[2023-09-29 19:28:40.663] [info]: [1000 -> 2000] processing chr2L:4932000-4934000 at 3125000 pixels/s...
[2023-09-29 19:28:40.973] [info]: [1000 -> 2000] processing chr2L:7986000-7988000 at 3236246 pixels/s...
...
[2023-09-29 19:29:12.513] [info]: generating 10000000 resolution from 5000000 (2x)
[2023-09-29 19:29:12.519] [info]: DONE! Processed 13 resolution(s) in 32.59s!
[2024-09-26 17:21:21.792] [info]: Running hictk v1.0.0-fbdcb591
[2024-09-26 17:21:21.795] [info]: coarsening cooler at 4DNFIZ1ZVXC8.mcool::/resolutions/1000 13 times (1000 -> 1000 -> 2000 -> 5000 -> 10000 -> 20000 -> 50000 -> 100000 -> 200000 -> 500000 -> 1000000 -> 2000000 -> 5000000 -> 10000000)
[2024-09-26 17:21:21.795] [info]: copying 1000 resolution from 4DNFIZ1ZVXC8.mcool::/resolutions/1000
[2024-09-26 17:21:21.959] [info]: generating 2000 resolution from 1000 (2x)
[2024-09-26 17:21:22.134] [info]: [1000 -> 2000] processing chr2L:1996000-1998000 at 5747126 pixels/s...
[2024-09-26 17:21:22.355] [info]: [1000 -> 2000] processing chr2L:4932000-4934000 at 4545455 pixels/s...
[2024-09-26 17:21:22.563] [info]: [1000 -> 2000] processing chr2L:7986000-7988000 at 4830918 pixels/s...
...
[2024-09-26 17:21:42.886] [info]: generating 2000000 resolution from 1000000 (2x)
[2024-09-26 17:21:42.892] [info]: generating 5000000 resolution from 1000000 (5x)
[2024-09-26 17:21:42.898] [info]: generating 10000000 resolution from 5000000 (2x)
[2024-09-26 17:21:42.902] [info]: DONE! Processed 13 resolution(s) in 21.11s!
# Coarsen a single resolution
user@dev:/tmp$ hictk zoomify data/4DNFIZ1ZVXC8.mcool::/resolutions/1000 out.cool --resolutions 50000
[2023-09-29 19:30:52.476] [info]: Running hictk v0.0.2
[2023-09-29 19:30:52.482] [info]: coarsening cooler at data/4DNFIZ1ZVXC8.mcool::/resolutions/1000 2 times (1000 -> 1000 -> 50000)
[2023-09-29 19:30:52.482] [info]: copying 1000 resolution from data/4DNFIZ1ZVXC8.mcool::/resolutions/1000
[2023-09-29 19:30:52.668] [info]: generating 50000 resolution from 1000 (50x)
[2023-09-29 19:30:53.789] [info]: [1000 -> 50000] processing chr2L:23000000-23050000 at 896057 pixels/s...
[2023-09-29 19:30:55.005] [info]: [1000 -> 50000] processing chr3L:4600000-4650000 at 822368 pixels/s...
[2023-09-29 19:30:56.440] [info]: [1000 -> 50000] processing chr3R:32050000-32079331 at 696864 pixels/s...
[2023-09-29 19:30:56.863] [info]: DONE! Processed 2 resolution(s) in 4.39s!
[2024-09-26 17:22:22.203] [info]: Running hictk v1.0.0-fbdcb591
[2024-09-26 17:22:22.206] [info]: coarsening cooler at 4DNFIZ1ZVXC8.mcool::/resolutions/1000 2 times (1000 -> 1000 -> 50000)
[2024-09-26 17:22:22.206] [info]: copying 1000 resolution from 4DNFIZ1ZVXC8.mcool::/resolutions/1000
[2024-09-26 17:22:22.364] [info]: generating 50000 resolution from 1000 (50x)
[2024-09-26 17:22:23.165] [info]: [1000 -> 50000] processing chr2L:23000000-23050000 at 1253133 pixels/s...
[2024-09-26 17:22:23.939] [info]: [1000 -> 50000] processing chr3L:4600000-4650000 at 1293661 pixels/s...
[2024-09-26 17:22:24.878] [info]: [1000 -> 50000] processing chr3R:32050000-32079331 at 1064963 pixels/s...
[2024-09-26 17:22:25.151] [info]: DONE! Processed 2 resolution(s) in 2.95s!
Converting a single-resolution .hic to a multi-resolution .hic
______________________________________________________________
Expand All @@ -44,36 +46,6 @@ Interactions from a .hic file (like the one generated by ``hictk load``) can be
hictk will copy interactions for resolutions that are available in the input file.
Interactions at resolutions missing from the input file will be generated by iterative coarsening.

.. code-block:: console
user@dev:/tmp$ hictk zoomify 4DNFIZ1ZVXC8.hic9 4DNFIZ1ZVXC8.zoomified.hic --threads 8
[2024-01-23 16:59:57.369] [info]: Running hictk v0.0.6-570037c-dirty
[2024-01-23 16:59:57.369] [info]: copying resolution 1000 from "4DNFIZ1ZVXC8.hic9"
[2024-01-23 16:59:57.369] [info]: generating 2000 resolution from 1000 (2x)
[2024-01-23 16:59:57.369] [info]: copying resolution 5000 from "4DNFIZ1ZVXC8.hic9"
[2024-01-23 16:59:57.369] [info]: copying resolution 10000 from "4DNFIZ1ZVXC8.hic9"
[2024-01-23 16:59:57.369] [info]: generating 20000 resolution from 10000 (2x)
[2024-01-23 16:59:57.369] [info]: copying resolution 50000 from "4DNFIZ1ZVXC8.hic9"
[2024-01-23 16:59:57.369] [info]: copying resolution 100000 from "4DNFIZ1ZVXC8.hic9"
[2024-01-23 16:59:57.369] [info]: generating 200000 resolution from 100000 (2x)
[2024-01-23 16:59:57.369] [info]: copying resolution 500000 from "4DNFIZ1ZVXC8.hic9"
[2024-01-23 16:59:57.369] [info]: copying resolution 1000000 from "4DNFIZ1ZVXC8.hic9"
[2024-01-23 16:59:57.369] [info]: generating 2000000 resolution from 1000000 (2x)
[2024-01-23 16:59:57.369] [info]: generating 5000000 resolution from 1000000 (5x)
[2024-01-23 16:59:57.369] [info]: generating 10000000 resolution from 5000000 (2x)
[2024-01-23 16:59:57.379] [info]: [1000 bp] ingesting interactions...
[2024-01-23 17:00:02.183] [info]: ingesting pixels at 2157032 pixels/s...
[2024-01-23 17:00:07.271] [info]: ingesting pixels at 1965795 pixels/s...
...
[2024-01-23 17:02:04.842] [info]: [1000 bp] computing expected vector density
[2024-01-23 17:02:05.325] [info]: [2000 bp] computing expected vector density
[2024-01-23 17:02:06.291] [info]: [5000 bp] computing expected vector density
[2024-01-23 17:02:06.292] [info]: writing 13 expected value vectors at offset 193918320...
[2024-01-23 17:02:06.293] [info]: writing 0 normalized expected value vectors at offset 194161639...
[2024-01-23 17:02:06.318] [info]: DONE! Processed 13 resolution(s) in 128.95s!
**Tips:**

* When zoomifying large .hic files, ``hictk`` may need to create large temporary files. When this is the case, use option ``--tmpdir`` to set the temporary folder to a path with sufficient space.
See the list of Tips for hictk load.
2 changes: 1 addition & 1 deletion docs/downloading_test_datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ After downloading the data, move to a folder with at least ~1 GB of free space a
:class: no-copybutton
user@dev:/tmp$ mkdir data/
user@dev:/tmp$ tar -xf hictk_test_data.tar.xz \
user@dev:/tmp$ tar -xf hictk_test_data.tar.zst \
-C data --strip-components=3 \
test/data/hic/4DNFIZ1ZVXC8.hic9 \
test/data/integration_tests/4DNFIZ1ZVXC8.mcool \
Expand Down
Loading

0 comments on commit fde7770

Please sign in to comment.