Slow and seemingly unreasonably large memory usage #69

Goblinlordx · 2023-04-27T07:15:22Z

Have you tried latest version of polars?

[yes]

What version of polars are you using?

0.7.3

What operating system are you using polars on?

Macos 13.0

What node version are you using

node 18.6.0

Describe your bug.

Memory usage seems unreasonably large and slow when passing data to nodejs-polars.

With the example code below, it takes about 20 seconds and balloons the memory usage to around 3.5GB.

When doing a simple copy of the data in JS only, the operation takes 10-20ms and peak memory usage is around 325MB:

What are the steps to reproduce the behavior?

const { DataFrame, Series } = require("nodejs-polars")

function runBuggyCode(entries) {
  let df = new DataFrame(entries)
  const timestampSeries = new Series("created_at", new Array(df.height).fill(Date.now()))
  df = df.withColumn(timestampSeries)
  df.writeParquet()
}

async function main() {
  const data = Array(50000)
    .fill(null)
    .map((_, i) =>
      Array(100)
        .fill(0)
        .map((_, ii) => i * 100 + ii)
    )

  let count = 0
  let peakMemUsage = 0
  while (true) {
    const start = Date.now()
    runBuggyCode(data)
    console.log("process time:", Date.now() - start, "ms")
    peakMemUsage = Math.max(process.memoryUsage().rss, peakMemUsage)
    console.log("peak rss mem usage:", Math.round(peakMemUsage / 1024 ** 2), "MB")
    console.log("run:", count++)
    await new Promise((res) => setTimeout(res, 100))
  }
}

if (require.main === module) {
  void main().catch((err) => console.error(err))
}

Example JS replacement for comparison:

function runBuggyCode(entries) {
  const copy = entries.map(row => row.slice())
}

What is the actual behavior?

Code runs as expected, it just uses what seems to be an unreasonable amount of memory.

What is the expected behavior?

Memory usage should be somewhat comparable to "maybe" 2x the usage for the data when not using library.

The text was updated successfully, but these errors were encountered:

universalmind303 · 2023-04-28T16:34:21Z

@Goblinlordx There is a significant amount of overhead when going from JS <--> rust, or vice-versa. Polars is best used when reading/writing directly from the filesystem, or using the lazy apis as much as possible. Unfortunately the upfront cost of converting JS to Rust is quite expensive.

However, you can work around this by using TypedArrays instead of arrays when initializing your Series. TypedArrays have much lower overhead than other js objects. The napi framework briefly explains this overhead here and here.

by simply using a TypedArray, we can reduce the footprint dramatically.

  const data = Array(50000)
    .fill(null)
    .map((_, i) =>
      Int32Array.from(Array(100)
        .fill(0)
        .map((_, ii) => i * 100 + ii))
    )

Without TypedArrays

process time: 21465 ms
peak rss mem usage: 3654 MB

with TypedArrays

process time: 101 ms
peak rss mem usage: 258 MB

Goblinlordx · 2023-05-02T00:32:53Z

Ah, I can confirm that resolves this specific case. However, this doesn't really seem to fit a more realistic case like even one given in the examples in the documentation:

>const df = pl.DataFrame(
...   {
...     A: [1, 2, 3, 4, 5],
...     fruits: ["banana", "banana", "apple", "apple", "banana"],
...     B: [5, 4, 3, 2, 1],
...     cars: ["beetle", "audi", "beetle", "beetle", "beetle"],
...   }
... )
> df.sort("fruits").select(
...     "fruits",
...     "cars",
...     pl.lit("fruits").alias("literal_string_fruits"),
...     pl.col("B").filter(pl.col("cars").eq(lit("beetle"))).sum(),
...     pl.col("A").filter(pl.col("B").gt(2)).sum().over("cars").alias("sum_A_by_cars"),
...     pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
...     pl.col("A").reverse().over("fruits").flatten().alias("rev_A_by_fruits")
...   )
shape: (5, 8)
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr │
│ ---      ┆ ---      ┆ ng_fruits    ┆ --- ┆ rs          ┆ uits        ┆ uits        │
│ str      ┆ str      ┆ ---          ┆ i64 ┆ ---         ┆ ---         ┆ ---         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┘

Is there a way to accomplish this with the arrays with strings for this? How would I recreate this in an efficient way with the type of code creating a large dataset in JS to work with efficiently?

Edit: I updated the code I provided to use typed array as well as a string array:


function runBuggyCode(entries) {
  let df = new DataFrame(entries)
  df.writeParquet()
  // const copy = {A: entries.A.slice(), B: entries.B.slice() }
}

async function main() {
  const data = {
    A: new Int32Array(10000000).fill(0).map((_, i) => i),
    B: Array(10000000)
      .fill("")
      .map((__, i) => i),
  }

  let count = 0
  let peakMemUsage = 0
  while (true) {
    const start = Date.now()
    runBuggyCode(data)
    console.log("process time:", Date.now() - start, "ms")
    peakMemUsage = Math.max(process.memoryUsage().rss, peakMemUsage)
    console.log("peak rss mem usage:", Math.round(peakMemUsage / 1024 ** 2), "MB")
    console.log("run:", count++)
    await new Promise((res) => setTimeout(res, 100))
  }
}

if (require.main === module) {
  void main().catch((err) => console.error(err))
}

Actually, the memory usage seems somewhat acceptable (roughly 2x). It is still relatively slow but I assume that can't be helped. The memory usage does have somewhat odd behavior though. It's initial memory usage seems expected but it looks like there is memory not being freed between operations. This results in memory usage growing to about 3-4x memory usage slowly over many iterations. I think it would be useful if there was a way to prevent this type of growth of memory usage. If it is not, that is ok, I just feel it should probably be noted somewhere for users.

For me, I think this was the effect I was observing. Basically, the way I had this setup before was I was finding and fetching relevant data files (thousands with some GB+ in size). These were Gzipped files which would then be stream gunzipped and processed transforming them into dataframes for use with Polars and performing some preliminary operations on them before finally emitting chunks as parquet format files. In the mean time, I have removed polars from my usage as the memory usage is prohibitive as it results in my containers getting killed by the orchestrator after some amount of time. There seems to be some condition that might prevent memory from being freed as there were cases where I observed the memory usage growing to over 68GB. It was hard to isolate the specific cause of this because the memory seemed to just grow in an unbounded way when utilizing actual data. I couldn't recreate this exact behavior when using synthetic data sets.

universalmind303 · 2023-05-02T18:12:41Z

So there is not much we can do about js <--> rust for hot loops such as is the case with instantiating dataframes.

The optimal solution would be to use apache arrow to create the values in pure JS, then pass the buffer off to polars for processing. This is notably faster than going through the node-api (napi). However, we currently can't interop with arrow-js because they don't yet support LargeUtf8. . While not as fast, you can do something similar with JSON if you are working with row oriented data.

const data = [{}, {}, {}]
const json_data = JSON.stringify(data)
const buff = Buffer.from(json_data)
const df = pl.readJSON(buff)

There seems to be some condition that might prevent memory from being freed as there were cases where I observed the memory usage growing to over 68GB.

This definitely seems like a potential memory leak, or the GC not triggering properly. Ill look in to this further.

Goblinlordx · 2023-05-03T07:08:20Z

As a note on the memory issue I had specifically, I would note that this was strictly an increase in rss memory usage and heap usage seemed to remain relatively stable (and low). So it definitely seemed like something regarding allocation outside js but it seems like it could be pretty difficult to track down.

I really appreciate the info and feedback. We can close this if you like or you can keep it open if you want to use it for tracking trying to find some mysterious memory leak (😢 ). I will move forward with other options for the moment and will take your suggestions into advisement.

Again, I really appreciate the insight and feedback. 👍 Thanks a ton~

universalmind303 · 2023-05-19T07:06:21Z

looks like the node.js team has been looking in to creating some fast paths for initializing objects via node-api

nodejs/node#45905

overlookmotel · 2023-05-25T14:58:37Z

Concerning memory for buffers not getting freed ("observed the memory usage growing to over 68GB"), the reason may be that "finalize" callbacks which free the memory are scheduled to run with setImmediate(). So if your entire loop runs synchronously, memory usage will mount up endlessly until next tick of the event loop.

If that is the problem, waiting a tick after each turn of the loop would allow the finalize callbacks to fire, and memory to be freed.

I hit the same problem in a different context: napi-rs/napi-rs#1171 (comment)

For discussion of this problem, and the reasons behind it, see the conversation from this comment onwards:

nodejs/node-addon-api#917 (comment)

Hope that's helpful.

drewbitt · 2024-04-29T15:20:57Z

The optimal solution would be to use apache arrow to create the values in pure JS, then pass the buffer off to polars for processing. This is notably faster than going through the node-api (napi). However, we currently can't interop with arrow-js because they don't yet support LargeUtf8. . While not as fast, you can do something similar with JSON if you are working with row oriented data.

FYI, Arrow added LargeUtf8 in version 15. apache/arrow#35780

Goblinlordx added the bug Something isn't working label Apr 27, 2023

Goblinlordx changed the title ~~Slow and seemingly unreasonable large memory usage~~ Slow and seemingly unreasonably large memory usage Apr 28, 2023

universalmind303 mentioned this issue Sep 18, 2023

appears as if gc not freeing values created via node-api & napi-rs oven-sh/bun#5659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow and seemingly unreasonably large memory usage #69

Slow and seemingly unreasonably large memory usage #69

Goblinlordx commented Apr 27, 2023 •

edited

Loading

universalmind303 commented Apr 28, 2023 •

edited

Loading

Goblinlordx commented May 2, 2023 •

edited

Loading

universalmind303 commented May 2, 2023

Goblinlordx commented May 3, 2023 •

edited

Loading

universalmind303 commented May 19, 2023

overlookmotel commented May 25, 2023

drewbitt commented Apr 29, 2024

Slow and seemingly unreasonably large memory usage #69

Slow and seemingly unreasonably large memory usage #69

Comments

Goblinlordx commented Apr 27, 2023 • edited Loading

Have you tried latest version of polars?

What version of polars are you using?

What operating system are you using polars on?

What node version are you using

Describe your bug.

What are the steps to reproduce the behavior?

What is the actual behavior?

What is the expected behavior?

universalmind303 commented Apr 28, 2023 • edited Loading

Goblinlordx commented May 2, 2023 • edited Loading

universalmind303 commented May 2, 2023

Goblinlordx commented May 3, 2023 • edited Loading

universalmind303 commented May 19, 2023

overlookmotel commented May 25, 2023

drewbitt commented Apr 29, 2024

Goblinlordx commented Apr 27, 2023 •

edited

Loading

universalmind303 commented Apr 28, 2023 •

edited

Loading

Goblinlordx commented May 2, 2023 •

edited

Loading

Goblinlordx commented May 3, 2023 •

edited

Loading