-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking changes for v1 #35
base: master
Are you sure you want to change the base?
Changes from all commits
efd1d8b
4abc5e9
45a9f38
cfaacb3
2c64877
1ecb76c
7385d9b
1293e69
82b4767
603c591
a3ecda4
6c1535e
ba774f1
6584e2e
550dd6e
726021b
db1f16c
c7e5e80
2912114
648c4bd
1192b72
30952eb
c92db18
d916616
47b4e03
c280f48
78fe5dd
320596b
9c79f6a
825bc73
8801e32
2e7553f
2dd7acb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
always_for_in = true | ||
whitespace_typedefs = true | ||
whitespace_ops_in_indices = true | ||
remove_extra_newlines = true | ||
import_to_using = true | ||
normalize_line_endings = "unix" | ||
separate_kwargs_with_semicolon = true | ||
whitespace_in_kwargs = false |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,4 +2,6 @@ | |
*.jl.*.cov | ||
*.jl.mem | ||
.DS_Store | ||
Manifest.toml | ||
Manifest.toml | ||
TODO.md | ||
docs/build |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,31 @@ | ||
name = "Kmers" | ||
uuid = "445028e4-d31f-4f27-89ad-17affd83fc22" | ||
authors = ["Sabrina Jaye Ward <[email protected]>"] | ||
version = "0.1.0" | ||
authors = [ | ||
"Jakob Nybo Nissen <[email protected]>", | ||
"Sabrina Jaye Ward <[email protected]>" | ||
] | ||
version = "1.0.0" | ||
|
||
[weakdeps] | ||
StringViews = "354b36f9-a18e-4713-926e-db85100087ba" | ||
|
||
[deps] | ||
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59" | ||
BioSymbols = "3c28c6f8-a34d-59c4-9654-267d177fcfa9" | ||
StringViews = "354b36f9-a18e-4713-926e-db85100087ba" | ||
Comment on lines
+10
to
+15
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it make sense that |
||
|
||
[extensions] | ||
StringViewsExt = "StringViews" | ||
|
||
[compat] | ||
BioSequences = "3.1.3" | ||
julia = "1.5" | ||
Random = "1.10" | ||
julia = "1.8" | ||
StringViews = "1" | ||
|
||
[extras] | ||
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" | ||
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" | ||
|
||
[targets] | ||
test = ["Test"] | ||
test = ["Test", "Random"] |
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -3,57 +3,57 @@ | |||||||||||||||||||||||||||||
[![Latest Release](https://img.shields.io/github/release/BioJulia/Kmers.jl.svg)](https://github.com/BioJulia/Kmers.jl/releases/latest) | ||||||||||||||||||||||||||||||
[![MIT license](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/BioJulia/Kmers.jl/blob/master/LICENSE) | ||||||||||||||||||||||||||||||
[![Documentation](https://img.shields.io/badge/docs-stable-blue.svg)](https://biojulia.github.io/Kmers.jl/stable) | ||||||||||||||||||||||||||||||
[![Pkg Status](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active) | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
## Description | ||||||||||||||||||||||||||||||
Kmers.jl provide the `Kmer <: BioSequence` type which implement the concept of a | ||||||||||||||||||||||||||||||
[k-mer](https://en.wikipedia.org/wiki/K-mer), a biological sequence of exactly length `k`. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Kmers provides a specialised concrete `BioSequence` subtype, optimised for | ||||||||||||||||||||||||||||||
representing short immutable sequences called kmers: contiguous sub-strings of k | ||||||||||||||||||||||||||||||
nucleotides of some reference sequence. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
They are used extensively in bioinformatic analyses as an informational unit. | ||||||||||||||||||||||||||||||
This concept was popularised by short read assemblers. | ||||||||||||||||||||||||||||||
Analyses within the kmer space benefit from a simple formulation of the sampling | ||||||||||||||||||||||||||||||
problem and direct in-hash comparisons. | ||||||||||||||||||||||||||||||
K-mers are used frequently in bioinformatics because, when k is small and known at | ||||||||||||||||||||||||||||||
compile time, these sequences can be efficiently represented as integers and stored | ||||||||||||||||||||||||||||||
directly in CPU registers, allowing for much more efficient computation than arbitrary-length sequences. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Kmers provides the type representing kmers as well as the implementations of | ||||||||||||||||||||||||||||||
the APIs specified by the | ||||||||||||||||||||||||||||||
[`BioSequences.jl`](https://github.com/BioJulia/BioSequences.jl) package. | ||||||||||||||||||||||||||||||
In Kmers.jl, the `Kmer` type is psrameterized by its length, and its data is stored in an `NTuple`. This makes `Kmers` bitstypes and highly efficient. | ||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
## Installation | ||||||||||||||||||||||||||||||
Conceptually, one may use the following analogy: | ||||||||||||||||||||||||||||||
* `BioSequence` is like `AbstractVector` | ||||||||||||||||||||||||||||||
* `LongSequence` is like `Vector` | ||||||||||||||||||||||||||||||
* `Kmer` is like [`SVector`](https://github.com/JuliaArrays/StaticArrays.jl) from `StaticArrays` | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Kmers.jl is tightly coupled to the | ||||||||||||||||||||||||||||||
[`BioSequences.jl`](https://github.com/BioJulia/BioSequences.jl) package, | ||||||||||||||||||||||||||||||
and relies on its internals. | ||||||||||||||||||||||||||||||
Hence, you should expect strict compat bounds on BioSequences.jl. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
## Usage | ||||||||||||||||||||||||||||||
### ⚠️ WARNING ⚠️ | ||||||||||||||||||||||||||||||
`Kmer`s are parameterized by their length. That means any operation on `Kmer`s that change their length, such as `push`, `pop`, slicing, or masking (logical indexing) will be **type unstable** and hence slow and memory inefficient, unless you write your code in such as way that the compiler can use constant folding. | ||||||||||||||||||||||||||||||
Comment on lines
+28
to
+30
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I've seen this blank-line-between-headers on markdown lint before. I don't really care, and this will be the only comment I leave about it. Feel free to disregard. |
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Further, as `Kmer`s are immutable and their operations are aggressively inlined and unrolled, | ||||||||||||||||||||||||||||||
they become inefficent as they get longer. | ||||||||||||||||||||||||||||||
For example, reverse-complementing a 32-mer takes 26 ns, compared to 102 ns for the equivalent `LongSequence`. However, for 512-mers, the `LongSequence` takes 126 ns, and the `Kmer` 16 μs! | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Kmers.jl is intended for high-performance computing. If you do not need the extra performance that register-stored sequences provide, you might consider using `LongSequence` from BioSequences.jl instead | ||||||||||||||||||||||||||||||
Comment on lines
+32
to
+36
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Seems like you're using semantic line breaks in some places and not others. Again, this is purely a style thing, and someone can go back and change this in future. Feel free to ignore. |
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
## Installation | ||||||||||||||||||||||||||||||
You can install BioSequences from the julia | ||||||||||||||||||||||||||||||
REPL. Press `]` to enter pkg mode, and enter the following: | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
```julia | ||||||||||||||||||||||||||||||
add Kmers | ||||||||||||||||||||||||||||||
pkg> add Kmers | ||||||||||||||||||||||||||||||
``` | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
If you are interested in the cutting edge of the development, please check out | ||||||||||||||||||||||||||||||
If you are interested in the cutting edge of development, please check out | ||||||||||||||||||||||||||||||
the master branch to try new features before release. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
## Testing | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Kmers is tested against Julia `1.X` on Linux, OS X, and Windows. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
[![Unit tests](https://github.com/BioJulia/Kmers.jl/workflows/Unit%20tests/badge.svg?branch=master)](https://github.com/BioJulia/Kmers.jl/actions?query=workflow%3A%22Unit+tests%22+branch%3Amaster) | ||||||||||||||||||||||||||||||
[![Documentation](https://github.com/BioJulia/Kmers.jl/workflows/Documentation/badge.svg?branch=master)](https://github.com/BioJulia/BioKmers.jl/actions?query=workflow%3ADocumentation+branch%3Amaster) | ||||||||||||||||||||||||||||||
[![](https://codecov.io/gh/BioJulia/Kmers.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/BioJulia/Kmers.jl) | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
## Contributing | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
We appreciate contributions from users including reporting bugs, fixing | ||||||||||||||||||||||||||||||
issues, improving performance and adding new features. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Take a look at the [contributing files](https://github.com/BioJulia/Contributing) | ||||||||||||||||||||||||||||||
detailed contributor and maintainer guidelines, and code of conduct. | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
## Questions? | ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
If you have a question about contributing or using BioJulia software, come | ||||||||||||||||||||||||||||||
on over and chat to us on [Gitter](https://gitter.im/BioJulia/General), or you can try the | ||||||||||||||||||||||||||||||
on over and chat to us on [the Julia Slack workspace](https://julialang.org/slack/), or you can try the | ||||||||||||||||||||||||||||||
[Bio category of the Julia discourse site](https://discourse.julialang.org/c/domain/bio). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,9 @@ | ||
[deps] | ||
BioSequences = "7e6ae17a-c86d-528c-b3b9-7f778a29fe59" | ||
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4" | ||
FASTX = "c2308a5c-f048-11e8-3e8a-31650f418d12" | ||
Kmers = "445028e4-d31f-4f27-89ad-17affd83fc22" | ||
MinHash = "4b3c9753-2685-44e9-8a29-365b96c023ed" | ||
|
||
[compat] | ||
Documenter = "0.24" | ||
Documenter = "1" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,29 +1,32 @@ | ||
using Documenter, Kmers | ||
|
||
makedocs( | ||
format = Documenter.HTML(), | ||
sitename = "Kmers.jl", | ||
pages = [ | ||
"Home" => "index.md", | ||
"Kmer types" => "kmer_types.md", | ||
"Constructing kmers" => "construction.md", | ||
"Indexing & modifying kmers" => "transforms.md", | ||
"Predicates" => "predicates.md", | ||
"Random kmers" => "random.md", | ||
"Iterating over Kmers" => "iteration.md", | ||
"Translation" => "translate.md", | ||
#"Pattern matching and searching" => "sequence_search.md", | ||
#"Iteration" => "iteration.md", | ||
#"Counting" => "counting.md", | ||
#"I/O" => "io.md", | ||
#"Interfaces" => "interfaces.md" | ||
DocMeta.setdocmeta!( | ||
Kmers, | ||
:DocTestSetup, | ||
:(using BioSequences, Kmers, Test); | ||
recursive=true, | ||
) | ||
Comment on lines
+3
to
+8
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this new? Very convenient! |
||
|
||
makedocs(; | ||
modules=[Kmers], | ||
format=Documenter.HTML(; prettyurls=get(ENV, "CI", nothing) == "true"), | ||
sitename="Kmers.jl", | ||
pages=[ | ||
"Home" => "index.md", | ||
"The Kmer type" => "kmers.md", | ||
"Iteration" => "iteration.md", | ||
"Translation" => "translation.md", | ||
"Hashing" => "hashing.md", | ||
"FAQ" => "faq.md", | ||
"Cookbook" => ["MinHash" => "minhash.md", "Kmer composition" => "composition.md"], | ||
], | ||
authors = "Ben J. Ward, The BioJulia Organisation and other contributors." | ||
authors="Jakob Nybo Nissen, Sabrina J. Ward, The BioJulia Organisation and other contributors.", | ||
checkdocs=:exports, | ||
) | ||
|
||
deploydocs( | ||
repo = "github.com/BioJulia/Kmers.jl.git", | ||
push_preview = true, | ||
deps = nothing, | ||
make = nothing | ||
deploydocs(; | ||
repo="github.com/BioJulia/Kmers.jl.git", | ||
push_preview=true, | ||
deps=nothing, | ||
make=nothing, | ||
) |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Flagging the empty here. OK to leave it as a placeholder, but maybe add "This page is a work in progress" or something? |
This file was deleted.
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,39 @@ | ||||||
```@meta | ||||||
CurrentModule = Kmers | ||||||
DocTestSetup = quote | ||||||
using BioSequences | ||||||
using Test | ||||||
using Kmers | ||||||
end | ||||||
``` | ||||||
## FAQ | ||||||
### Why can kmers not be compared to biosequences? | ||||||
It may be surprising that kmers cannot be compared to other biosequences: | ||||||
|
||||||
```jldoctest | ||||||
julia> dna"TAG" == mer"TAG"d | ||||||
ERROR: MethodError | ||||||
[...] | ||||||
``` | ||||||
|
||||||
In fact, this is implemented by a manually thrown `MethodError`; the generic case `Base.:==(::BioSequence, ::BioSequence)` is defined. | ||||||
|
||||||
The reason for this is the consequence of the following limitations: | ||||||
* `isequal(x, y)` implies `hash(x) == hash(y)` | ||||||
* `isequal(x, y)` and `x == y` ought to be identical for well-defined elements (i.e. in the absence of `missing`s and `NaN`s etc.) | ||||||
* `hash(::Kmer)` must be absolutely maximally efficient | ||||||
|
||||||
If kmers were to be comparable to `BioSequence`, then the hashing of `BioSequence` should follow `Kmer`, which practically speaking would mean that all biosequences would need to be recoded to `Kmer`s before hashing. | ||||||
|
||||||
### Why isn't there an iterator of unambiguous, canonical kmers or spaced, canonical kmers? | ||||||
Any iterator of nucleotide kmers can be made into a canonical kmer iterator by simply calling `canonical` on its output kers. | ||||||
|
||||||
The `CanonicalKmers` iterator is special cased, because with a step size of 1, it is generally faster to build the next kmer by storing both the reverse and forward kmer, then creating the next kmer by prepending/append the next symbol. | ||||||
|
||||||
However, with a larger step size, it becomes more efficient to build the forward kmer, then reverse-complement the whole kmer. | ||||||
|
||||||
### Why isn't there an iterator of skipmers/minimizers/k-min-mers, etc? | ||||||
The concept of kmers have turned out to be remarkably flexible and useful in bioinformatics, and have spawned a neverending stream of variations. | ||||||
|
||||||
We simply can't implement them all. | ||||||
However, we hope to make it relatively easy to implement custom kmer iterators for downstream users. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I can't remember the exact syntax for this, but I think it's something like this |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,59 @@ | ||||||
```@meta | ||||||
CurrentModule = Kmers | ||||||
DocTestSetup = quote | ||||||
using BioSequences | ||||||
using Test | ||||||
using Kmers | ||||||
end | ||||||
``` | ||||||
|
||||||
!!! warning | ||||||
The value of hashes are guaranteed to be reproducible for a given version | ||||||
of Kmers.jl and Julia, but may __change__ in new minor versions of Julia | ||||||
or Kmers.jl | ||||||
Comment on lines
+10
to
+13
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably for MinHash.jl, but maybe worth revisiting hashing and compatibility with other implementations. See sourmash-bio/sourmash#2284 and links, esp BioJulia/BioSequences.jl#243 for prior discussion. I don't think think there's anything to do here - we don't want to slow down Kmers.jl for the sake of compat, but if there's anything that would help enabling this (eg fast conversion to |
||||||
|
||||||
## Hashing | ||||||
Kmers.jl implements `Base.hash`, yielding a `UInt` value: | ||||||
|
||||||
```jldoctest; filter = r"^0x[0-9a-fA-F]+$" | ||||||
julia> hash(mer"UGCUGUAC"r) | ||||||
0xe5057d38c8907b22 | ||||||
``` | ||||||
|
||||||
The implementation of `Base.hash` for kmers strikes a compromise between providing a high-quality (non-cryptographic) hash, while being reasonably fast. | ||||||
While hash collisions can easily be found, they are unlikely to occur at random. | ||||||
When kmers are of the same (or compatible) alphabets, different kmers hash to different values, even when they have the same underlying bitpattern: | ||||||
|
||||||
```jldoctest | ||||||
julia> using BioSequences: encoded_data | ||||||
|
||||||
julia> a = mer"TAG"d; b = mer"AAAAAAATAG"d; | ||||||
|
||||||
julia> encoded_data(a) === encoded_data(b) | ||||||
true | ||||||
|
||||||
julia> hash(a) == hash(b) | ||||||
false | ||||||
``` | ||||||
|
||||||
When they are of compatible alphabets, and have the same content, they hash to the same value. | ||||||
Currently, only DNA and RNA of the alphabets `DNAAlphabet` and `RNAAlphabet` are compatible: | ||||||
|
||||||
```jldoctest | ||||||
julia> a = mer"UUGU"r; b = mer"TTGT"d; | ||||||
|
||||||
julia> a == b # equal | ||||||
true | ||||||
|
||||||
julia> a === b # not egal | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Is |
||||||
false | ||||||
|
||||||
julia> hash(a) === hash(b) | ||||||
true | ||||||
``` | ||||||
|
||||||
For some applications, fast hashing is absolutely crucial. For these cases, Kmers.jl provides [`fx_hash`](@ref), which trades off hash quality for speed: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if it would be worth linking to docs or another website explaining fx hash here. It's in the docstring for the function, so someone could click through, but I didn't know what it was before seeing that. |
||||||
|
||||||
```@docs | ||||||
fx_hash | ||||||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only want to test on release?