Decoding long strings can lead to large heap footprint #19

drivehappy · 2020-05-26T17:39:07Z

The issue occurs when …

Using Decode.string on bytes for longer expected strings will result in JavaScript engines (V8 in this case) utilizing an internal string concatenation performance optimization.
This was initially observed when utilizing the eriktim/elm-protocol-buffers package, which uses the elm/bytes decoding for strings.

The data we utilize can potentially have many strings that are quite long. A heap snapshot under Chrome indicates at least 20 bytes of overhead for each individual character that is concatenated during string construction. In our case this leads to ~200mB consumed in the heap by concatenated strings.

Why does the issue occur?

V8 strings appear to internally have 2 possible representations (https://gist.github.com/mraleph/3397008):

flat strings are immutable arrays of characters

cons strings are pairs of strings, result of concatenation.

If you concat a and b you get a cons-string (a, b) that represents result of concatenation. If you later concat d to that you get another cons-string ((a, b), d).

Indexing into such a "tree-like" string is not O(1) so to make it faster V8 flattens the string when you index: copies all characters into a flat string.

In this case, it appears the implementation of _Bytes_read_string may likely end up using the "Cons string" type in the JS engine.

Could you show an example?

module Main exposing (main)

import Bytes exposing (..)
import Bytes.Decode as Decode
import Bytes.Encode as Encode

main : Program () () ()
main =
    let
        encodedBytes =
            "This is a longer string for testing special JS concatenated strings"
            |> Encode.string
            |> Encode.encode

        decodedBytesStr =
            encodedBytes
            |> Decode.decode (Decode.string (Bytes.width encodedBytes))
    in
    Platform.worker
        { init = \_ -> ( (), Cmd.none )
        , update = \_ _ -> ( (), Cmd.none )
        , subscriptions = \_ -> Sub.none
        }

SSCCE

Once the above code runs:

Open Chome devtools
Memory tab
Select "Heap snapshot" then press "Take snapshot"
Filter snapshot by "(concatenated string)"
Observe the many instances of the string "This is a longer string for ..."
a. The shallow size reported for each individual character cons is 20 bytes
b. The final retained size is 1348 bytes

How can we solve this?

According to this post (https://gist.github.com/mraleph/3397008), indexing into a cons string forces it to be flattened.

My approach would be to simply perform this once the string has been completely decoded:

var _Bytes_read_string = F3(function(len, bytes, offset)
{
	var string = '';
	var end = offset + len;
	for (; offset < end;)
	{
		~snip~
	}

	// Force JS engines to flatten the string if it's internally represented by a "cons string"
	string[0];

	return __Utils_Tuple2(offset, string);
}

Possibly for very large strings this index operation could be done in the loop itself (possibly on every X iteration), but there are additional tradeoffs on performance of evaluating the condition.

Note: The above index may not work, as it seems some JS engines try to avoid it still. Even the npm flatstr package (https://www.npmjs.com/package/flatstr) current approach doesn't seem to work:

string | 0

However, effectively doing a string reverse reverse seems to workaround the cons string.

Final notes

I struggled a bit on where the solution for this issue should lie (either in elm/bytes, eriktim/elm-protocol-buffers or in our own code). I decided on opening this here, as I think a solution at the lowest level would be the most appropriate and resolve this unexpected memory consumption for other libraries/applications that may use the string decoding. I don't see a use-case for keeping an internal cons string representation once the decoding has completed.

The text was updated successfully, but these errors were encountered:

lue-bird mentioned this issue Dec 26, 2022

long String |> Bytes.Decode.string → heap footprint large gren-lang/core#50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding long strings can lead to large heap footprint #19

Decoding long strings can lead to large heap footprint #19

drivehappy commented May 26, 2020 •

edited

Loading

Decoding long strings can lead to large heap footprint #19

Decoding long strings can lead to large heap footprint #19

Comments

drivehappy commented May 26, 2020 • edited Loading

The issue occurs when …

Why does the issue occur?

Could you show an example?

How can we solve this?

Final notes

drivehappy commented May 26, 2020 •

edited

Loading