Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
AndreaGuarracino committed Jan 3, 2025
1 parent 4cc6a98 commit ff15b0c
Showing 1 changed file with 58 additions and 15 deletions.
73 changes: 58 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,6 @@ Pangenome graphs and whole genome multiple alignments are powerful tools, but th
Often, we would like to be able to break a small piece out of a pangenome without constructing the whole thing.
`impg` lets us do this by projecting sequence ranges through many-way (e.g. all-vs-all) pairwise alignments built by tools like `wfmash` and `minimap2`.

## What does `impg` do?

At its core, `impg` lifts over ranges from a target sequence (used as reference) into the queries (the other sequences aligned to the sequence used as reference) described in alignments.
In effect, it lets us pick up homologous loci from all genomes mapped onto our specific target region.
This is particularly useful when you're interested in comparing a specific genomic region across different individuals, strains, or species in a pangenomic or comparative genomic setting.
The output is provided in BED, BEDPE and PAF formats, making it straightforward to use to extract FASTA sequences for downstream use in multiple sequence alignment (like `mafft`) or pangenome graph building (e.g., `pggb` or `minigraph-cactus`).

## How does it work?

`impg` uses coitrees (implicit interval trees) to provide efficient range lookup over the input alignments.
CIGAR strings are converted to a compact delta encoding.
This approach allows for fast and memory-efficient projection of sequence ranges through alignments.

## Using `impg`

Getting started with `impg` is straightforward. Here's a basic example of how to use the command-line utility:
Expand All @@ -45,7 +32,7 @@ In this example, `-p` specifies the path to the PAF file, `-r` defines the targe
That is, for each collected range, we then find what sequence ranges are aligned onto it.
This is done progressively until we've closed the set of alignments connected to the initial target range.

### Installation
## Installation

To compile and install `impg` from source, you'll need a recent rust build toolchain and cargo.

Expand All @@ -57,10 +44,66 @@ To compile and install `impg` from source, you'll need a recent rust build toolc
```bash
cd impg
```
3. Compile the tool (requires rust build tools):
3. Compile the tool:
```bash
cargo install --force --path .
```
## Commands

`impg` provides three main commands:

### Query
Query overlaps in the alignment:
```bash
# Query a single region
impg query -p alignments.paf -r chr1:1000-2000

# Query multiple regions from a BED file
impg query -p alignments.paf -b regions.bed

# Enable transitive overlap search
impg query -p alignments.paf -r chr1:1000-2000 -x

# Output in PAF format
impg query -p alignments.paf -r chr1:1000-2000 -P
```

### Partition
Partition the alignment into smaller pieces:
```bash
impg partition -p alignments.paf -w 1000000 -s chr1 -d 10000 -l 5000
```
- `-w`: Window size for partitioning
- `-s`: Prefix of sequence names to start partitioning from
- `-d`: Maximum distance to merge intervals in each partition
- `-l`: Minimum length for intervals in each partition (this can lead to overlapping partitions)

### Stats
Print alignment statistics:
```bash
impg stats -p alignments.paf
```

### Common Options

All commands support these options:
- `-p, --paf-file`: Path to PAF file (gzipped or uncompressed)
- `-t, --num-threads`: Number of threads (default: 1)
- `-I, --force-reindex`: Force regeneration of index
- `-v, --verbose`: Verbosity level (0=error, 1=info, 2=debug)

## What does `impg` do?

At its core, `impg` lifts over ranges from a target sequence (used as reference) into the queries (the other sequences aligned to the sequence used as reference) described in alignments.
In effect, it lets us pick up homologous loci from all genomes mapped onto our specific target region.
This is particularly useful when you're interested in comparing a specific genomic region across different individuals, strains, or species in a pangenomic or comparative genomic setting.
The output is provided in BED, BEDPE and PAF formats, making it straightforward to use to extract FASTA sequences for downstream use in multiple sequence alignment (like `mafft`) or pangenome graph building (e.g., `pggb` or `minigraph-cactus`).

## How does it work?

`impg` uses coitrees (implicit interval trees) to provide efficient range lookup over the input alignments.
CIGAR strings are converted to a compact delta encoding.
This approach allows for fast and memory-efficient projection of sequence ranges through alignments.

## Authors

Expand Down

0 comments on commit ff15b0c

Please sign in to comment.