Skip to content

Commit

Permalink
Merge pull request #99 from dynatrace-oss/readme
Browse files Browse the repository at this point in the history
Readme
  • Loading branch information
oertl authored Apr 2, 2023
2 parents 11d6487 + 721361e commit 01cf0d5
Show file tree
Hide file tree
Showing 5 changed files with 177 additions and 4 deletions.
70 changes: 67 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,27 @@

hash4j is a Java library by Dynatrace that includes various non-cryptographic hash algorithms and data structures that are based on high-quality hash functions.

## Adding hash4j to your build
## Content
- [First steps](#first-steps)
- [Hash algorithms](#hash-algorithms)
- [Similarity hashing](#similarity-hashing)
- [Approximate distinct counting](#approximate-distinct-counting)
- [File hashing](#file-hashing)
- [Consistent hashing](#consistent-hashing)
- [Contribution FAQ](#contribution-faq)

## First steps
To add a dependency on hash4j using Maven, use the following:
```xml
<dependency>
<groupId>com.dynatrace.hash4j</groupId>
<artifactId>hash4j</artifactId>
<version>0.8.0</version>
<version>0.9.0</version>
</dependency>
```
To add a dependency using Gradle:
```gradle
implementation 'com.dynatrace.hash4j:hash4j:0.8.0'
implementation 'com.dynatrace.hash4j:hash4j:0.9.0'
```

## Hash algorithms
Expand Down Expand Up @@ -153,6 +162,61 @@ HyperLogLog and UltraLogLog sketches can be reduced to corresponding sketches wi
HyperLogLog can be made compatible with implementations of other libraries which also use a single 64-bit hash value as input. The implementations usually differ only in which bits of the hash value are used for the register index and which bits are used to determine the number of leading (or trailing) zeros.
Therefore, if the bits of the hash value are permuted accordingly, compatibility can be achieved.

## File hashing
This library contains an implementation of [ImoHash](https://github.com/kalafut/imohash) that
allows fast hashing of files.
It is based on the idea of hashing only the beginning,
a middle part and the end, of large files,
which is usually sufficient to distinguish files.
Unlike cryptographic hashing algorithms, this method is not suitable for verifying the integrity of files.
However, this algorithm can be useful for file indexes, for example, to find identical files.

### Usage
```java
// create some file in the given path
File file = path.resolve("test.txt").toFile();
try (FileWriter fileWriter = new FileWriter(file)) {
fileWriter.write("this is the file content");
}

// use ImoHash to hash that file
HashValue128 hash = FileHashing.imohash1_0_2().hashFileTo128Bits(file);
// returns 0xd317f2dad6ea7ae56ff7fdb517e33918
```
See also [FileHashingDemo.java](src/test/java/com/dynatrace/hash4j/file/FileHashingDemo.java).

## Consistent hashing
This library contains an implementation of [JumpHash](https://arxiv.org/abs/1406.2294)
that can be used to achieve distributed agreement when assigning hash values to a given number of buckets.
The hash values are distributed uniformly over the buckets.
The algorithm also minimizes the number of reassignments needed for balancing when the number of buckets changes.

### Usage
```java
// create a consistent bucket hasher
ConsistentBucketHasher consistentBucketHasher =
ConsistentHashing.jumpHash(PseudoRandomGeneratorProvider.splitMix64_V1());

long[] hashValues = {9184114998275508886L, 7090183756869893925L, -8795772374088297157L};

// determine assignment of hash value to 2 buckets
Map<Integer, List<Long>> assignment2Buckets =
LongStream.of(hashValues)
.boxed()
.collect(groupingBy(hash -> consistentBucketHasher.getBucket(hash, 2)));
// gives {0=[7090183756869893925, -8795772374088297157], 1=[9184114998275508886]}

// determine assignment of hash value to 3 buckets
Map<Integer, List<Long>> assignment3Buckets =
LongStream.of(hashValues)
.boxed()
.collect(groupingBy(hash -> consistentBucketHasher.getBucket(hash, 3)));
// gives {0=[-8795772374088297157], 1=[9184114998275508886], 2=[7090183756869893925]}
// hash value 7090183756869893925 got reassigned from bucket 0 to bucket 2
// probability of reassignment is equal to 1/3
```
See also [ConsistentHashingDemo.java](src/test/java/com/dynatrace/hash4j/consistent/ConsistentHashingDemo.java).

## Contribution FAQ

### Python
Expand Down
2 changes: 1 addition & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ java {
}

group = 'com.dynatrace.hash4j'
version = '0.8.0'
version = '0.9.0'

spotless {
ratchetFrom 'origin/main'
Expand Down
4 changes: 4 additions & 0 deletions src/main/java/com/dynatrace/hash4j/file/FileHashing.java
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ public interface FileHashing {
* <p>For a description of the algorithm see <a
* href="https://github.com/kalafut/imohash/blob/v1.0.2/algorithm.md">here</a>.
*
* <p>This algorithm does not return a uniformly distributed hash value.
*
* @return a file hasher instance
*/
static FileHasher128 imohash1_0_2() {
Expand All @@ -44,6 +46,8 @@ static FileHasher128 imohash1_0_2() {
* <p>For a description of the algorithm and the parameters see <a
* href="https://github.com/kalafut/imohash/blob/v1.0.2/algorithm.md">here</a>.
*
* <p>This algorithm does not return a uniformly distributed hash value.
*
* @param sampleSize the sample size
* @param sampleThreshold the sample threshold
* @return a file hasher instance
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
/*
* Copyright 2023 Dynatrace LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.dynatrace.hash4j.consistent;

import static java.util.stream.Collectors.groupingBy;
import static org.assertj.core.api.AssertionsForInterfaceTypes.assertThat;

import com.dynatrace.hash4j.random.PseudoRandomGeneratorProvider;
import java.util.List;
import java.util.Map;
import java.util.stream.LongStream;
import org.junit.jupiter.api.Test;

class ConsistentHashingDemo {

@Test
void demoJumphash() {

// create a consistent bucket hasher
ConsistentBucketHasher consistentBucketHasher =
ConsistentHashing.jumpHash(PseudoRandomGeneratorProvider.splitMix64_V1());

long[] hashValues = {9184114998275508886L, 7090183756869893925L, -8795772374088297157L};

// determine assignment of hash value to 2 buckets
Map<Integer, List<Long>> assignment2Buckets =
LongStream.of(hashValues)
.boxed()
.collect(groupingBy(hash -> consistentBucketHasher.getBucket(hash, 2)));
// gives {0=[7090183756869893925, -8795772374088297157], 1=[9184114998275508886]}

// determine assignment of hash value to 3 buckets
Map<Integer, List<Long>> assignment3Buckets =
LongStream.of(hashValues)
.boxed()
.collect(groupingBy(hash -> consistentBucketHasher.getBucket(hash, 3)));
// gives {0=[-8795772374088297157], 1=[9184114998275508886], 2=[7090183756869893925]}
// hash value 7090183756869893925 got reassigned from bucket 0 to bucket 2
// probability of reassignment is equal to 1/3

assertThat(assignment2Buckets)
.hasToString("{0=[7090183756869893925, -8795772374088297157], 1=[9184114998275508886]}");
assertThat(assignment3Buckets)
.hasToString(
"{0=[-8795772374088297157], 1=[9184114998275508886], 2=[7090183756869893925]}");
}
}
45 changes: 45 additions & 0 deletions src/test/java/com/dynatrace/hash4j/file/FileHashingDemo.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
/*
* Copyright 2023 Dynatrace LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package com.dynatrace.hash4j.file;

import static org.assertj.core.api.Assertions.assertThat;

import com.dynatrace.hash4j.hashing.HashValue128;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.file.Path;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;

class FileHashingDemo {

@Test
void demoImohash(@TempDir Path path) throws IOException {

// create some file in the given path
File file = path.resolve("test.txt").toFile();
try (FileWriter fileWriter = new FileWriter(file)) {
fileWriter.write("this is the file content");
}

// use ImoHash to hash that file
HashValue128 hash = FileHashing.imohash1_0_2().hashFileTo128Bits(file);
// returns 0xd317f2dad6ea7ae56ff7fdb517e33918

assertThat(hash).hasToString("0xd317f2dad6ea7ae56ff7fdb517e33918");
}
}

0 comments on commit 01cf0d5

Please sign in to comment.