Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HashTable grows even though capacity is not exceeded #602

Open
e00E opened this issue Dec 18, 2024 · 6 comments
Open

HashTable grows even though capacity is not exceeded #602

e00E opened this issue Dec 18, 2024 · 6 comments

Comments

@e00E
Copy link

e00E commented Dec 18, 2024

HashTable::with_capacity docs say:

The hash table will be able to hold at least capacity elements without reallocating.

The following test shows that this statement is incorrect:

#[test]
fn foo() {
    let capacity = 100;
    let mut hash_table = HashTable::with_capacity(capacity);
    for i in 0u64..1_000 {
        dbg!(i);
        assert!(hash_table.len() < capacity);
        let hasher =
            |_: &_| unreachable!("hasher won't be called because there is capacity left");
        let eq = |j: &u64| *j == i;
        let Entry::Vacant(entry) = hash_table.entry(i, eq, hasher) else {
            unreachable!("actually unreachable");
        };
        entry.insert(i);
        if hash_table.len() == capacity {
            hash_table.find_entry(i, eq).unwrap().remove();
        }
    }
}

Expectation:
As the assert shows, every time the loop begins there is at least one capacity left. The HashTable does not need to grow and it will not call our hasher in HashTable::entry.

Reality:
The HashTable attempts to grow and calls the hasher, triggering the unreachable. The debug prints tell us that this happens at i=112.

This is a problem because it prevents you from writing code that relies on HashTable not growing when enough capacity has been allocated. You should be able to write such code.

@cuviper
Copy link
Member

cuviper commented Dec 18, 2024

This is because remove() may leave a DELETED marker, which reduces the reported capacity(). Your assertion doesn't reflect this since you're comparing to a fixed capacity value.

I proposed a cleanup method in #255, but you would still need a proper hasher to use that.

@e00E
Copy link
Author

e00E commented Dec 18, 2024

There are two issues here.

  1. The documentation of with_capacity is wrong. The test HashTable never holds more elements than capacity, yet it reallocates.
  2. Inserting when capacity is left might move elements around unexpectedly.

There should be a way to initialize HashTable that ensures the table will not grow unless you exceed the initially given size. This is an important feature of collections. Currently with_capacity is buggy/misleading because it implies that guarantee but it does not actually uphold it. All std collections' with_capacity method guarantee this. If HashTable needs more space than the given capacity, then it is free to internally allocate more in with_capacity. It is not free to allocate this later.

I'm unsure whether there should be a guarantee for not moving elements around when there is capacity. I assumed HashMap worked this way but I see now it is not stated anywhere. You probably don't want to guarantee this to get more freedom with the implementation.

@hkBst
Copy link

hkBst commented Jan 19, 2025

Are deletions leaving unusable entries, that can only be reused by rehashing the entire table, the only reason that capacity can go down?

@bobrik
Copy link

bobrik commented Jan 28, 2025

Double insert for the same key leads to the same issue, but seemingly only when capacity is equal to length:

use std::collections::HashMap;

const PROBE_SIZE: usize = 7;

fn main() {
    let mut map = HashMap::with_capacity(PROBE_SIZE);

    for i in 0..PROBE_SIZE {
        map.insert(i, 0);
        map.insert(i, 0);

        eprintln!(
            "i = {i}, len = {}, capacity = {}",
            map.len(),
            map.capacity()
        );
    }
}
i = 0, len = 1, capacity = 7
i = 1, len = 2, capacity = 7
i = 2, len = 3, capacity = 7
i = 3, len = 4, capacity = 7
i = 4, len = 5, capacity = 7
i = 5, len = 6, capacity = 7
i = 6, len = 7, capacity = 14

@hkBst
Copy link

hkBst commented Jan 28, 2025

I think the issue is that the hashtable does not check whether the key is already present before deciding to grow. The following only does the duoble insert for the last iteration and has the same output as your own:

use std::collections::HashMap;

const PROBE_SIZE: usize = 7;

fn main() {
    let mut map = HashMap::with_capacity(PROBE_SIZE);

    for i in 0..PROBE_SIZE {
        map.insert(i, 0);
        if i == PROBE_SIZE - 1{
            map.insert(i, 0);
        }

        eprintln!(
            "i = {i}, len = {}, capacity = {}",
            map.len(),
            map.capacity()
        );
    }
}

@bobrik
Copy link

bobrik commented Jan 29, 2025

Right, it happens here:

I'm not really sure how to make it work out of the box without incurring performance penalties, but it seems like a good behavior to uphold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants