You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using Decode.string on bytes for longer expected strings will result in JavaScript engines (V8 in this case) utilizing an internal string concatenation performance optimization.
This was initially observed when utilizing the eriktim/elm-protocol-buffers package, which uses the elm/bytes decoding for strings.
The data we utilize can potentially have many strings that are quite long. A heap snapshot under Chrome indicates at least 20 bytes of overhead for each individual character that is concatenated during string construction. In our case this leads to ~200mB consumed in the heap by concatenated strings.
cons strings are pairs of strings, result of concatenation.
If you concat a and b you get a cons-string (a, b) that represents result of concatenation. If you later concat d to that you get another cons-string ((a, b), d).
Indexing into such a "tree-like" string is not O(1) so to make it faster V8 flattens the string when you index: copies all characters into a flat string.
In this case, it appears the implementation of _Bytes_read_string may likely end up using the "Cons string" type in the JS engine.
Could you show an example?
module Main exposing (main)
import Bytes exposing (..)
import Bytes.Decode as Decode
import Bytes.Encode as Encode
main : Program () () ()
main =
let
encodedBytes =
"This is a longer string for testing special JS concatenated strings"
|> Encode.string
|> Encode.encode
decodedBytesStr =
encodedBytes
|> Decode.decode (Decode.string (Bytes.width encodedBytes))
in
Platform.worker
{ init = \_ -> ( (), Cmd.none )
, update = \_ _ -> ( (), Cmd.none )
, subscriptions = \_ -> Sub.none
}
Observe the many instances of the string "This is a longer string for ..."
a. The shallow size reported for each individual character cons is 20 bytes
b. The final retained size is 1348 bytes
My approach would be to simply perform this once the string has been completely decoded:
var _Bytes_read_string = F3(function(len, bytes, offset)
{
var string = '';
var end = offset + len;
for (; offset < end;)
{
~snip~
}
// Force JS engines to flatten the string if it's internally represented by a "cons string"
string[0];
return __Utils_Tuple2(offset, string);
}
Possibly for very large strings this index operation could be done in the loop itself (possibly on every X iteration), but there are additional tradeoffs on performance of evaluating the condition.
Note: The above index may not work, as it seems some JS engines try to avoid it still. Even the npm flatstr package (https://www.npmjs.com/package/flatstr) current approach doesn't seem to work:
string | 0
However, effectively doing a string reverse reverse seems to workaround the cons string.
Final notes
I struggled a bit on where the solution for this issue should lie (either in elm/bytes, eriktim/elm-protocol-buffers or in our own code). I decided on opening this here, as I think a solution at the lowest level would be the most appropriate and resolve this unexpected memory consumption for other libraries/applications that may use the string decoding. I don't see a use-case for keeping an internal cons string representation once the decoding has completed.
The text was updated successfully, but these errors were encountered:
The issue occurs when …
Decode.string
on bytes for longer expected strings will result in JavaScript engines (V8 in this case) utilizing an internal string concatenation performance optimization.eriktim/elm-protocol-buffers
package, which uses theelm/bytes
decoding for strings.The data we utilize can potentially have many strings that are quite long. A heap snapshot under Chrome indicates at least 20 bytes of overhead for each individual character that is concatenated during string construction. In our case this leads to ~200mB consumed in the heap by concatenated strings.
Why does the issue occur?
V8 strings appear to internally have 2 possible representations (https://gist.github.com/mraleph/3397008):
In this case, it appears the implementation of
_Bytes_read_string
may likely end up using the "Cons string" type in the JS engine.Could you show an example?
SSCCE
Once the above code runs:
a. The shallow size reported for each individual character cons is 20 bytes
b. The final retained size is 1348 bytes
How can we solve this?
According to this post (https://gist.github.com/mraleph/3397008), indexing into a cons string forces it to be flattened.
My approach would be to simply perform this once the string has been completely decoded:
Possibly for very large strings this index operation could be done in the loop itself (possibly on every X iteration), but there are additional tradeoffs on performance of evaluating the condition.
Note: The above index may not work, as it seems some JS engines try to avoid it still. Even the npm flatstr package (https://www.npmjs.com/package/flatstr) current approach doesn't seem to work:
However, effectively doing a string reverse reverse seems to workaround the cons string.
Final notes
I struggled a bit on where the solution for this issue should lie (either in
elm/bytes
,eriktim/elm-protocol-buffers
or in our own code). I decided on opening this here, as I think a solution at the lowest level would be the most appropriate and resolve this unexpected memory consumption for other libraries/applications that may use the string decoding. I don't see a use-case for keeping an internal cons string representation once the decoding has completed.The text was updated successfully, but these errors were encountered: