Skip to content
This repository has been archived by the owner on Sep 20, 2023. It is now read-only.

make Digest a strict datatype? #70

Open
joeyh opened this issue Feb 26, 2016 · 6 comments
Open

make Digest a strict datatype? #70

joeyh opened this issue Feb 26, 2016 · 6 comments

Comments

@joeyh
Copy link

joeyh commented Feb 26, 2016

Digest can be partially evaluated, which presents several gotchas.

I had some simple code using cryptonite to get a hash:

md5 :: L.ByteString -> Digest MD5
md5 = hashLazy

md5hasher :: L.ByteString -> String
md5hasher = show . md5

md5hasher <$> L.readFile foo

This worked ok, running in constant space. But, it kept the file handle open until some later point when the hash was used (or possibly open past use until garbage collected). That turned out to prevent eg, deleting the file under windows (which doesn't allow open files to be deleted), which broke my test suite. So, I made what seemed like an innocuous change to "fix" that:

!hash <- md5hasher <$> L.readFile foo
return hash

But this version sometimes buffers the whole file content in memory. The bang pattern doesn't fully force the String, and so the Digest isn't fully forced either, and the hash can end up partially evaluated. In one case, hashing a 30 gb file resulted in a 30 gb malloc!

So, I had to do this (or could have used the NFData instance for Digest for same effect):

!hash <- md5hasher <$> L.readFile foo
return (length hash `seq` hash)

Which I think will avoid all problems, but was not easy to arrive at.

From my brief review of the API, I don't see any benefits to letting Digest be partially evaluated. So why not make the data type internally strict, or have hashLazy fully force it.

@vincenthz
Copy link
Member

Yes, I think that's a sensible change. I wonder if Context should have the same treatment

@vincenthz
Copy link
Member

I'm a bit puzzled actually. Just was looking at the code, and trying couple of things to try to reproduce but just couldn't; is the problem not in the part that transform a Digest x -> String and lack of evaluation of String ?

@joeyh
Copy link
Author

joeyh commented Feb 28, 2016

Vincent Hanquez wrote:

I'm a bit puzzled actually. Just was looking at the code, and trying couple of
things to try to reproduce but just couldn't; is the problem not in the code
that transform a Digest x -> String ?

I had some difficulty reproducing it too, only saw the problem on some
systems and not others, and only in some use patterns (comparison with
known checksums).

I don't see how Digest x -> String could need 3 terabytes of memory, so
I assume the memory use is in a partially evaluated Digest keeping
references to parts of the lazy ByteString, so that the whole input data
remains in memory.

see shy jo

@vincenthz
Copy link
Member

sorry for being unclear, but when I say the problem is in the Digest x -> String part, I meant it's actually not forcing any evaluation of the Digest anyway unless you're actually looking at least at the first char which ! is not going to do.

i.e. the problem is:

 !hashAsString <- md5hasher <$> L.readline foo
 return hashAsString

Whereas if you write:

!hash <- md5 <$> L.readFile foo
return $ show hash

I think the ! is doing the right thing in the second case, and only the digest -> String would be done lazily. Whereas in the first case, the ! on a String would do nothing.

@vincenthz
Copy link
Member

although, it doesn't quite explain needing so much memory, maybe some context are not folded properly to the next one..

@joeyh
Copy link
Author

joeyh commented Feb 28, 2016

Yes, part of the problem was indeed that bang doesn't fully force String.
It forced it enough to fix my FD-remaining-open problem, but not enough
to avoid buffering the file in ram. While that was my mistake to make,
it wouldn't have mattered that I made that mistake if Digest was itself
strict.

Also, it's not really clear from the public API whether bang fully forces
a Digest. newtype Digest a = Digest Bytes, and the implementation of
Bytes is not exposed so it's difficult to reason about how to force it,
other than using the handy NFData instance of course.

see shy jo

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants