Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking changes for v1 #35

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
efd1d8b
Use JuliaFormatter
jakobnissen Jun 23, 2023
4abc5e9
Random cleanup
jakobnissen Jun 25, 2023
45a9f38
More fixups - squash
jakobnissen Jul 20, 2023
cfaacb3
More stuff
jakobnissen Jul 21, 2023
2c64877
Add revtrans and setindex
jakobnissen Jul 21, 2023
1ecb76c
Add translation
jakobnissen Jul 21, 2023
7385d9b
Make EveryKmer iterator
jakobnissen Jul 22, 2023
1293e69
Fixup README.md
jakobnissen Jul 22, 2023
82b4767
More stuff
jakobnissen Jul 24, 2023
603c591
Fixup README
jakobnissen Sep 25, 2023
a3ecda4
Begin EveryCanonicalKmer
jakobnissen Sep 25, 2023
6c1535e
Rename: EveryKmer to FwKmers
jakobnissen Sep 26, 2023
ba774f1
Start SpacedKmers
jakobnissen Sep 30, 2023
6584e2e
Some refactoring
jakobnissen Oct 2, 2023
550dd6e
Refactor to trait objects
jakobnissen Oct 2, 2023
726021b
Extensive refactoring
jakobnissen Oct 2, 2023
db1f16c
More refactoring
jakobnissen Oct 3, 2023
c7e5e80
Start tests
jakobnissen Dec 28, 2023
2912114
More tests
jakobnissen Dec 29, 2023
648c4bd
Add Canonical and UnambiguousKmers
jakobnissen Dec 29, 2023
1192b72
Update some docstrings
jakobnissen Dec 29, 2023
30952eb
Add some docs
jakobnissen Dec 30, 2023
c92db18
Add rest of the docs
jakobnissen Dec 30, 2023
d916616
Remove old iterators
jakobnissen Dec 30, 2023
47b4e03
Add FwRvIterator
jakobnissen Dec 31, 2023
c280f48
Push doc preview
jakobnissen Dec 31, 2023
78fe5dd
Misc cleanup
jakobnissen Dec 31, 2023
320596b
Add FxHash
jakobnissen Jan 2, 2024
9c79f6a
Misc changes
jakobnissen Jan 2, 2024
825bc73
Typos and fixes
jakobnissen Jan 2, 2024
8801e32
Make minhash example smaller
jakobnissen Jan 2, 2024
2e7553f
Optimise slicing
jakobnissen Jan 2, 2024
2dd7acb
Fix bug in slicing
jakobnissen Jan 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .JuliaFormatter.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
always_for_in = true
whitespace_typedefs = true
whitespace_ops_in_indices = true
remove_extra_newlines = true
import_to_using = true
normalize_line_endings = "unix"
separate_kwargs_with_semicolon = true
whitespace_in_kwargs = false
1 change: 0 additions & 1 deletion .github/workflows/Documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,5 @@ jobs:
run: julia --color=yes --project=docs/ -e 'using Pkg; Pkg.develop(PackageSpec(path=pwd())); Pkg.instantiate()'
- name: Build and deploy
env:
# GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # For authentication with GitHub Actions token
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }} # For authentication with SSH deploy key
run: julia --color=yes --project=docs/ docs/make.jl
18 changes: 18 additions & 0 deletions .github/workflows/UnitTests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,21 @@ jobs:
name: codecov-umbrella
fail_ci_if_error: false
token: ${{ secrets.CODECOV_TOKEN }}

docs:
name: Documentation
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: julia-actions/setup-julia@latest
with:
version: '1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only want to test on release?

- run: |
julia --project=docs -e '
using Pkg
Pkg.develop(PackageSpec(path=pwd()))
Pkg.instantiate()'
- run: julia --project=docs docs/make.jl
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
DOCUMENTER_KEY: ${{ secrets.DOCUMENTER_KEY }}
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,6 @@
*.jl.*.cov
*.jl.mem
.DS_Store
Manifest.toml
Manifest.toml
TODO.md
docs/build
22 changes: 18 additions & 4 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,17 +1,31 @@
name = "Kmers"
uuid = "445028e4-d31f-4f27-89ad-17affd83fc22"
authors = ["Sabrina Jaye Ward <[email protected]>"]
version = "0.1.0"
authors = [
"Jakob Nybo Nissen <[email protected]>",
"Sabrina Jaye Ward <[email protected]>"
]
version = "1.0.0"

[weakdeps]
StringViews = "354b36f9-a18e-4713-926e-db85100087ba"

[deps]
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9"
StringViews = "354b36f9-a18e-4713-926e-db85100087ba"
Comment on lines +10 to +15
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense that StringViews is both a dep AND a weak dep? Is there any advantage to that?


[extensions]
StringViewsExt = "StringViews"

[compat]
BioSequences = "3.1.3"
julia = "1.5"
Random = "1.10"
julia = "1.8"
StringViews = "1"

[extras]
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"

[targets]
test = ["Test"]
test = ["Test", "Random"]
58 changes: 29 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,57 +3,57 @@
[![Latest Release](https://img.shields.io/github/release/BioJulia/Kmers.jl.svg)](https://github.com/BioJulia/Kmers.jl/releases/latest)
[![MIT license](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/BioJulia/Kmers.jl/blob/master/LICENSE)
[![Documentation](https://img.shields.io/badge/docs-stable-blue.svg)](https://biojulia.github.io/Kmers.jl/stable)
[![Pkg Status](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)


## Description
Kmers.jl provide the `Kmer <: BioSequence` type which implement the concept of a
[k-mer](https://en.wikipedia.org/wiki/K-mer), a biological sequence of exactly length `k`.

Kmers provides a specialised concrete `BioSequence` subtype, optimised for
representing short immutable sequences called kmers: contiguous sub-strings of k
nucleotides of some reference sequence.

They are used extensively in bioinformatic analyses as an informational unit.
This concept was popularised by short read assemblers.
Analyses within the kmer space benefit from a simple formulation of the sampling
problem and direct in-hash comparisons.
K-mers are used frequently in bioinformatics because, when k is small and known at
compile time, these sequences can be efficiently represented as integers and stored
directly in CPU registers, allowing for much more efficient computation than arbitrary-length sequences.

Kmers provides the type representing kmers as well as the implementations of
the APIs specified by the
[`BioSequences.jl`](https://github.com/BioJulia/BioSequences.jl) package.
In Kmers.jl, the `Kmer` type is psrameterized by its length, and its data is stored in an `NTuple`. This makes `Kmers` bitstypes and highly efficient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In Kmers.jl, the `Kmer` type is psrameterized by its length, and its data is stored in an `NTuple`. This makes `Kmers` bitstypes and highly efficient.
In Kmers.jl, the `Kmer` type is parameterized by its length, and its data is stored in an `NTuple`. This makes `Kmers` bitstypes and highly efficient.


## Installation
Conceptually, one may use the following analogy:
* `BioSequence` is like `AbstractVector`
* `LongSequence` is like `Vector`
* `Kmer` is like [`SVector`](https://github.com/JuliaArrays/StaticArrays.jl) from `StaticArrays`

Kmers.jl is tightly coupled to the
[`BioSequences.jl`](https://github.com/BioJulia/BioSequences.jl) package,
and relies on its internals.
Hence, you should expect strict compat bounds on BioSequences.jl.

## Usage
### ⚠️ WARNING ⚠️
`Kmer`s are parameterized by their length. That means any operation on `Kmer`s that change their length, such as `push`, `pop`, slicing, or masking (logical indexing) will be **type unstable** and hence slow and memory inefficient, unless you write your code in such as way that the compiler can use constant folding.
Comment on lines +28 to +30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Usage
### ⚠️ WARNING ⚠️
`Kmer`s are parameterized by their length. That means any operation on `Kmer`s that change their length, such as `push`, `pop`, slicing, or masking (logical indexing) will be **type unstable** and hence slow and memory inefficient, unless you write your code in such as way that the compiler can use constant folding.
## Usage
### ⚠️ WARNING ⚠️
`Kmer`s are parameterized by their length. That means any operation on `Kmer`s that change their length, such as `push`, `pop`, slicing, or masking (logical indexing) will be **type unstable** and hence slow and memory inefficient, unless you write your code in such as way that the compiler can use constant folding.

I've seen this blank-line-between-headers on markdown lint before. I don't really care, and this will be the only comment I leave about it. Feel free to disregard.


Further, as `Kmer`s are immutable and their operations are aggressively inlined and unrolled,
they become inefficent as they get longer.
For example, reverse-complementing a 32-mer takes 26 ns, compared to 102 ns for the equivalent `LongSequence`. However, for 512-mers, the `LongSequence` takes 126 ns, and the `Kmer` 16 μs!

Kmers.jl is intended for high-performance computing. If you do not need the extra performance that register-stored sequences provide, you might consider using `LongSequence` from BioSequences.jl instead
Comment on lines +32 to +36
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Further, as `Kmer`s are immutable and their operations are aggressively inlined and unrolled,
they become inefficent as they get longer.
For example, reverse-complementing a 32-mer takes 26 ns, compared to 102 ns for the equivalent `LongSequence`. However, for 512-mers, the `LongSequence` takes 126 ns, and the `Kmer` 16 μs!
Kmers.jl is intended for high-performance computing. If you do not need the extra performance that register-stored sequences provide, you might consider using `LongSequence` from BioSequences.jl instead
Further, as `Kmer`s are immutable and their operations are aggressively inlined and unrolled,
they become inefficent as they get longer.
For example, reverse-complementing a 32-mer takes 26 ns,
compared to 102 ns for the equivalent `LongSequence`.
However, for 512-mers, the `LongSequence` takes 126 ns, and the `Kmer` 16 μs!
Kmers.jl is intended for high-performance computing.
If you do not need the extra performance that register-stored sequences provide,
you might consider using `LongSequence` from BioSequences.jl instead

Seems like you're using semantic line breaks in some places and not others. Again, this is purely a style thing, and someone can go back and change this in future. Feel free to ignore.


## Installation
You can install BioSequences from the julia
REPL. Press `]` to enter pkg mode, and enter the following:

```julia
add Kmers
pkg> add Kmers
```

If you are interested in the cutting edge of the development, please check out
If you are interested in the cutting edge of development, please check out
the master branch to try new features before release.


## Testing

Kmers is tested against Julia `1.X` on Linux, OS X, and Windows.

[![Unit tests](https://github.com/BioJulia/Kmers.jl/workflows/Unit%20tests/badge.svg?branch=master)](https://github.com/BioJulia/Kmers.jl/actions?query=workflow%3A%22Unit+tests%22+branch%3Amaster)
[![Documentation](https://github.com/BioJulia/Kmers.jl/workflows/Documentation/badge.svg?branch=master)](https://github.com/BioJulia/BioKmers.jl/actions?query=workflow%3ADocumentation+branch%3Amaster)
[![](https://codecov.io/gh/BioJulia/Kmers.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/BioJulia/Kmers.jl)


## Contributing

We appreciate contributions from users including reporting bugs, fixing
issues, improving performance and adding new features.

Take a look at the [contributing files](https://github.com/BioJulia/Contributing)
detailed contributor and maintainer guidelines, and code of conduct.


## Questions?

If you have a question about contributing or using BioJulia software, come
on over and chat to us on [Gitter](https://gitter.im/BioJulia/General), or you can try the
on over and chat to us on [the Julia Slack workspace](https://julialang.org/slack/), or you can try the
[Bio category of the Julia discourse site](https://discourse.julialang.org/c/domain/bio).
6 changes: 5 additions & 1 deletion docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
[deps]
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
FASTX = "c2308a5c-f048-11e8-3e8a-31650f418d12"
Kmers = "445028e4-d31f-4f27-89ad-17affd83fc22"
MinHash = "4b3c9753-2685-44e9-8a29-365b96c023ed"

[compat]
Documenter = "0.24"
Documenter = "1"
49 changes: 26 additions & 23 deletions docs/make.jl
Original file line number Diff line number Diff line change
@@ -1,29 +1,32 @@
using Documenter, Kmers

makedocs(
format = Documenter.HTML(),
sitename = "Kmers.jl",
pages = [
"Home" => "index.md",
"Kmer types" => "kmer_types.md",
"Constructing kmers" => "construction.md",
"Indexing & modifying kmers" => "transforms.md",
"Predicates" => "predicates.md",
"Random kmers" => "random.md",
"Iterating over Kmers" => "iteration.md",
"Translation" => "translate.md",
#"Pattern matching and searching" => "sequence_search.md",
#"Iteration" => "iteration.md",
#"Counting" => "counting.md",
#"I/O" => "io.md",
#"Interfaces" => "interfaces.md"
DocMeta.setdocmeta!(
Kmers,
:DocTestSetup,
:(using BioSequences, Kmers, Test);
recursive=true,
)
Comment on lines +3 to +8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this new? Very convenient!


makedocs(;
modules=[Kmers],
format=Documenter.HTML(; prettyurls=get(ENV, "CI", nothing) == "true"),
sitename="Kmers.jl",
pages=[
"Home" => "index.md",
"The Kmer type" => "kmers.md",
"Iteration" => "iteration.md",
"Translation" => "translation.md",
"Hashing" => "hashing.md",
"FAQ" => "faq.md",
"Cookbook" => ["MinHash" => "minhash.md", "Kmer composition" => "composition.md"],
],
authors = "Ben J. Ward, The BioJulia Organisation and other contributors."
authors="Jakob Nybo Nissen, Sabrina J. Ward, The BioJulia Organisation and other contributors.",
checkdocs=:exports,
)

deploydocs(
repo = "github.com/BioJulia/Kmers.jl.git",
push_preview = true,
deps = nothing,
make = nothing
deploydocs(;
repo="github.com/BioJulia/Kmers.jl.git",
push_preview=true,
deps=nothing,
make=nothing,
)
Empty file added docs/src/composition.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging the empty here. OK to leave it as a placeholder, but maybe add "This page is a work in progress" or something?

Empty file.
12 changes: 0 additions & 12 deletions docs/src/construction.md

This file was deleted.

39 changes: 39 additions & 0 deletions docs/src/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
```@meta
CurrentModule = Kmers
DocTestSetup = quote
using BioSequences
using Test
using Kmers
end
```
## FAQ
### Why can kmers not be compared to biosequences?
It may be surprising that kmers cannot be compared to other biosequences:

```jldoctest
julia> dna"TAG" == mer"TAG"d
ERROR: MethodError
[...]
```

In fact, this is implemented by a manually thrown `MethodError`; the generic case `Base.:==(::BioSequence, ::BioSequence)` is defined.

The reason for this is the consequence of the following limitations:
* `isequal(x, y)` implies `hash(x) == hash(y)`
* `isequal(x, y)` and `x == y` ought to be identical for well-defined elements (i.e. in the absence of `missing`s and `NaN`s etc.)
* `hash(::Kmer)` must be absolutely maximally efficient

If kmers were to be comparable to `BioSequence`, then the hashing of `BioSequence` should follow `Kmer`, which practically speaking would mean that all biosequences would need to be recoded to `Kmer`s before hashing.

### Why isn't there an iterator of unambiguous, canonical kmers or spaced, canonical kmers?
Any iterator of nucleotide kmers can be made into a canonical kmer iterator by simply calling `canonical` on its output kers.

The `CanonicalKmers` iterator is special cased, because with a step size of 1, it is generally faster to build the next kmer by storing both the reverse and forward kmer, then creating the next kmer by prepending/append the next symbol.

However, with a larger step size, it becomes more efficient to build the forward kmer, then reverse-complement the whole kmer.

### Why isn't there an iterator of skipmers/minimizers/k-min-mers, etc?
The concept of kmers have turned out to be remarkably flexible and useful in bioinformatics, and have spawned a neverending stream of variations.

We simply can't implement them all.
However, we hope to make it relatively easy to implement custom kmer iterators for downstream users.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, we hope to make it relatively easy to implement custom kmer iterators for downstream users.
However, we hope to make it relatively easy to [implement custom kmer iterators](@ref #Iteration) for downstream users.

I can't remember the exact syntax for this, but I think it's something like this

59 changes: 59 additions & 0 deletions docs/src/hashing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
```@meta
CurrentModule = Kmers
DocTestSetup = quote
using BioSequences
using Test
using Kmers
end
```

!!! warning
The value of hashes are guaranteed to be reproducible for a given version
of Kmers.jl and Julia, but may __change__ in new minor versions of Julia
or Kmers.jl
Comment on lines +10 to +13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably for MinHash.jl, but maybe worth revisiting hashing and compatibility with other implementations. See sourmash-bio/sourmash#2284 and links, esp BioJulia/BioSequences.jl#243 for prior discussion.

I don't think think there's anything to do here - we don't want to slow down Kmers.jl for the sake of compat, but if there's anything that would help enabling this (eg fast conversion to String if that's possible), it could be worth considering.


## Hashing
Kmers.jl implements `Base.hash`, yielding a `UInt` value:

```jldoctest; filter = r"^0x[0-9a-fA-F]+$"
julia> hash(mer"UGCUGUAC"r)
0xe5057d38c8907b22
```

The implementation of `Base.hash` for kmers strikes a compromise between providing a high-quality (non-cryptographic) hash, while being reasonably fast.
While hash collisions can easily be found, they are unlikely to occur at random.
When kmers are of the same (or compatible) alphabets, different kmers hash to different values, even when they have the same underlying bitpattern:

```jldoctest
julia> using BioSequences: encoded_data

julia> a = mer"TAG"d; b = mer"AAAAAAATAG"d;

julia> encoded_data(a) === encoded_data(b)
true

julia> hash(a) == hash(b)
false
```

When they are of compatible alphabets, and have the same content, they hash to the same value.
Currently, only DNA and RNA of the alphabets `DNAAlphabet` and `RNAAlphabet` are compatible:

```jldoctest
julia> a = mer"UUGU"r; b = mer"TTGT"d;

julia> a == b # equal
true

julia> a === b # not egal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
julia> a === b # not egal
julia> a === b # not equal

Is egal Danish for equal? That's a weird typo if not

false

julia> hash(a) === hash(b)
true
```

For some applications, fast hashing is absolutely crucial. For these cases, Kmers.jl provides [`fx_hash`](@ref), which trades off hash quality for speed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be worth linking to docs or another website explaining fx hash here. It's in the docstring for the function, so someone could click through, but I didn't know what it was before seeing that.


```@docs
fx_hash
```
Loading
Loading