Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding long strings can lead to large heap footprint #19

Open
drivehappy opened this issue May 26, 2020 · 0 comments
Open

Decoding long strings can lead to large heap footprint #19

drivehappy opened this issue May 26, 2020 · 0 comments

Comments

@drivehappy
Copy link

drivehappy commented May 26, 2020

The issue occurs when …

  1. Using Decode.string on bytes for longer expected strings will result in JavaScript engines (V8 in this case) utilizing an internal string concatenation performance optimization.
  2. This was initially observed when utilizing the eriktim/elm-protocol-buffers package, which uses the elm/bytes decoding for strings.

The data we utilize can potentially have many strings that are quite long. A heap snapshot under Chrome indicates at least 20 bytes of overhead for each individual character that is concatenated during string construction. In our case this leads to ~200mB consumed in the heap by concatenated strings.

Why does the issue occur?

V8 strings appear to internally have 2 possible representations (https://gist.github.com/mraleph/3397008):

  • flat strings are immutable arrays of characters
  • cons strings are pairs of strings, result of concatenation.

If you concat a and b you get a cons-string (a, b) that represents result of concatenation. If you later concat d to that you get another cons-string ((a, b), d).

Indexing into such a "tree-like" string is not O(1) so to make it faster V8 flattens the string when you index: copies all characters into a flat string.

In this case, it appears the implementation of _Bytes_read_string may likely end up using the "Cons string" type in the JS engine.

Could you show an example?

module Main exposing (main)

import Bytes exposing (..)
import Bytes.Decode as Decode
import Bytes.Encode as Encode

main : Program () () ()
main =
    let
        encodedBytes =
            "This is a longer string for testing special JS concatenated strings"
            |> Encode.string
            |> Encode.encode

        decodedBytesStr =
            encodedBytes
            |> Decode.decode (Decode.string (Bytes.width encodedBytes))
    in
    Platform.worker
        { init = \_ -> ( (), Cmd.none )
        , update = \_ _ -> ( (), Cmd.none )
        , subscriptions = \_ -> Sub.none
        }

SSCCE

Once the above code runs:

  1. Open Chome devtools
  2. Memory tab
  3. Select "Heap snapshot" then press "Take snapshot"
  4. Filter snapshot by "(concatenated string)"
  5. Observe the many instances of the string "This is a longer string for ..."
    a. The shallow size reported for each individual character cons is 20 bytes
    b. The final retained size is 1348 bytes

How can we solve this?

According to this post (https://gist.github.com/mraleph/3397008), indexing into a cons string forces it to be flattened.

My approach would be to simply perform this once the string has been completely decoded:

var _Bytes_read_string = F3(function(len, bytes, offset)
{
	var string = '';
	var end = offset + len;
	for (; offset < end;)
	{
		~snip~
	}

	// Force JS engines to flatten the string if it's internally represented by a "cons string"
	string[0];

	return __Utils_Tuple2(offset, string);
}

Possibly for very large strings this index operation could be done in the loop itself (possibly on every X iteration), but there are additional tradeoffs on performance of evaluating the condition.

Note: The above index may not work, as it seems some JS engines try to avoid it still. Even the npm flatstr package (https://www.npmjs.com/package/flatstr) current approach doesn't seem to work:

string | 0

However, effectively doing a string reverse reverse seems to workaround the cons string.

Final notes

I struggled a bit on where the solution for this issue should lie (either in elm/bytes, eriktim/elm-protocol-buffers or in our own code). I decided on opening this here, as I think a solution at the lowest level would be the most appropriate and resolve this unexpected memory consumption for other libraries/applications that may use the string decoding. I don't see a use-case for keeping an internal cons string representation once the decoding has completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant