Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow and seemingly unreasonably large memory usage #69

Open
Goblinlordx opened this issue Apr 27, 2023 · 7 comments
Open

Slow and seemingly unreasonably large memory usage #69

Goblinlordx opened this issue Apr 27, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@Goblinlordx
Copy link

Goblinlordx commented Apr 27, 2023

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

0.7.3

What operating system are you using polars on?

Macos 13.0

What node version are you using

node 18.6.0

Describe your bug.

Memory usage seems unreasonably large and slow when passing data to nodejs-polars.

With the example code below, it takes about 20 seconds and balloons the memory usage to around 3.5GB.
image

When doing a simple copy of the data in JS only, the operation takes 10-20ms and peak memory usage is around 325MB:
image

What are the steps to reproduce the behavior?

const { DataFrame, Series } = require("nodejs-polars")

function runBuggyCode(entries) {
  let df = new DataFrame(entries)
  const timestampSeries = new Series("created_at", new Array(df.height).fill(Date.now()))
  df = df.withColumn(timestampSeries)
  df.writeParquet()
}

async function main() {
  const data = Array(50000)
    .fill(null)
    .map((_, i) =>
      Array(100)
        .fill(0)
        .map((_, ii) => i * 100 + ii)
    )

  let count = 0
  let peakMemUsage = 0
  while (true) {
    const start = Date.now()
    runBuggyCode(data)
    console.log("process time:", Date.now() - start, "ms")
    peakMemUsage = Math.max(process.memoryUsage().rss, peakMemUsage)
    console.log("peak rss mem usage:", Math.round(peakMemUsage / 1024 ** 2), "MB")
    console.log("run:", count++)
    await new Promise((res) => setTimeout(res, 100))
  }
}

if (require.main === module) {
  void main().catch((err) => console.error(err))
}

Example JS replacement for comparison:

function runBuggyCode(entries) {
  const copy = entries.map(row => row.slice())
}

What is the actual behavior?

Code runs as expected, it just uses what seems to be an unreasonable amount of memory.

What is the expected behavior?

Memory usage should be somewhat comparable to "maybe" 2x the usage for the data when not using library.

@Goblinlordx Goblinlordx added the bug Something isn't working label Apr 27, 2023
@Goblinlordx Goblinlordx changed the title Slow and seemingly unreasonable large memory usage Slow and seemingly unreasonably large memory usage Apr 28, 2023
@universalmind303
Copy link
Collaborator

universalmind303 commented Apr 28, 2023

@Goblinlordx There is a significant amount of overhead when going from JS <--> rust, or vice-versa. Polars is best used when reading/writing directly from the filesystem, or using the lazy apis as much as possible. Unfortunately the upfront cost of converting JS to Rust is quite expensive.

However, you can work around this by using TypedArrays instead of arrays when initializing your Series. TypedArrays have much lower overhead than other js objects. The napi framework briefly explains this overhead here and here.

by simply using a TypedArray, we can reduce the footprint dramatically.

  const data = Array(50000)
    .fill(null)
    .map((_, i) =>
      Int32Array.from(Array(100)
        .fill(0)
        .map((_, ii) => i * 100 + ii))
    )

Without TypedArrays

process time: 21465 ms
peak rss mem usage: 3654 MB

with TypedArrays

process time: 101 ms
peak rss mem usage: 258 MB

@Goblinlordx
Copy link
Author

Goblinlordx commented May 2, 2023

Ah, I can confirm that resolves this specific case. However, this doesn't really seem to fit a more realistic case like even one given in the examples in the documentation:

>const df = pl.DataFrame(
...   {
...     A: [1, 2, 3, 4, 5],
...     fruits: ["banana", "banana", "apple", "apple", "banana"],
...     B: [5, 4, 3, 2, 1],
...     cars: ["beetle", "audi", "beetle", "beetle", "beetle"],
...   }
... )
> df.sort("fruits").select(
...     "fruits",
...     "cars",
...     pl.lit("fruits").alias("literal_string_fruits"),
...     pl.col("B").filter(pl.col("cars").eq(lit("beetle"))).sum(),
...     pl.col("A").filter(pl.col("B").gt(2)).sum().over("cars").alias("sum_A_by_cars"),
...     pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),
...     pl.col("A").reverse().over("fruits").flatten().alias("rev_A_by_fruits")
...   )
shape: (5, 8)
┌──────────┬──────────┬──────────────┬─────┬─────────────┬─────────────┬─────────────┐
│ fruits   ┆ cars     ┆ literal_stri ┆ B   ┆ sum_A_by_ca ┆ sum_A_by_fr ┆ rev_A_by_fr │
│ ---      ┆ ---      ┆ ng_fruits    ┆ --- ┆ rs          ┆ uits        ┆ uits        │
│ str      ┆ str      ┆ ---          ┆ i64 ┆ ---         ┆ ---         ┆ ---         │
│          ┆          ┆ str          ┆     ┆ i64         ┆ i64         ┆ i64         │
╞══════════╪══════════╪══════════════╪═════╪═════════════╪═════════════╪═════════════╡
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 4           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "apple"  ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 7           ┆ 3           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "banana" ┆ "audi"   ┆ "fruits"     ┆ 11  ┆ 2           ┆ 8           ┆ 2           │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "banana" ┆ "beetle" ┆ "fruits"     ┆ 11  ┆ 4           ┆ 8           ┆ 1           │
└──────────┴──────────┴──────────────┴─────┴─────────────┴─────────────┴─────────────┘

Is there a way to accomplish this with the arrays with strings for this? How would I recreate this in an efficient way with the type of code creating a large dataset in JS to work with efficiently?

Edit: I updated the code I provided to use typed array as well as a string array:


function runBuggyCode(entries) {
  let df = new DataFrame(entries)
  df.writeParquet()
  // const copy = {A: entries.A.slice(), B: entries.B.slice() }
}

async function main() {
  const data = {
    A: new Int32Array(10000000).fill(0).map((_, i) => i),
    B: Array(10000000)
      .fill("")
      .map((__, i) => i),
  }

  let count = 0
  let peakMemUsage = 0
  while (true) {
    const start = Date.now()
    runBuggyCode(data)
    console.log("process time:", Date.now() - start, "ms")
    peakMemUsage = Math.max(process.memoryUsage().rss, peakMemUsage)
    console.log("peak rss mem usage:", Math.round(peakMemUsage / 1024 ** 2), "MB")
    console.log("run:", count++)
    await new Promise((res) => setTimeout(res, 100))
  }
}

if (require.main === module) {
  void main().catch((err) => console.error(err))
}

Actually, the memory usage seems somewhat acceptable (roughly 2x). It is still relatively slow but I assume that can't be helped. The memory usage does have somewhat odd behavior though. It's initial memory usage seems expected but it looks like there is memory not being freed between operations. This results in memory usage growing to about 3-4x memory usage slowly over many iterations. I think it would be useful if there was a way to prevent this type of growth of memory usage. If it is not, that is ok, I just feel it should probably be noted somewhere for users.

For me, I think this was the effect I was observing. Basically, the way I had this setup before was I was finding and fetching relevant data files (thousands with some GB+ in size). These were Gzipped files which would then be stream gunzipped and processed transforming them into dataframes for use with Polars and performing some preliminary operations on them before finally emitting chunks as parquet format files. In the mean time, I have removed polars from my usage as the memory usage is prohibitive as it results in my containers getting killed by the orchestrator after some amount of time. There seems to be some condition that might prevent memory from being freed as there were cases where I observed the memory usage growing to over 68GB. It was hard to isolate the specific cause of this because the memory seemed to just grow in an unbounded way when utilizing actual data. I couldn't recreate this exact behavior when using synthetic data sets.

@universalmind303
Copy link
Collaborator

So there is not much we can do about js <--> rust for hot loops such as is the case with instantiating dataframes.

The optimal solution would be to use apache arrow to create the values in pure JS, then pass the buffer off to polars for processing. This is notably faster than going through the node-api (napi). However, we currently can't interop with arrow-js because they don't yet support LargeUtf8. . While not as fast, you can do something similar with JSON if you are working with row oriented data.

const data = [{}, {}, {}]
const json_data = JSON.stringify(data)
const buff = Buffer.from(json_data)
const df = pl.readJSON(buff)

There seems to be some condition that might prevent memory from being freed as there were cases where I observed the memory usage growing to over 68GB.

This definitely seems like a potential memory leak, or the GC not triggering properly. Ill look in to this further.

@Goblinlordx
Copy link
Author

Goblinlordx commented May 3, 2023

As a note on the memory issue I had specifically, I would note that this was strictly an increase in rss memory usage and heap usage seemed to remain relatively stable (and low). So it definitely seemed like something regarding allocation outside js but it seems like it could be pretty difficult to track down.

I really appreciate the info and feedback. We can close this if you like or you can keep it open if you want to use it for tracking trying to find some mysterious memory leak (😢 ). I will move forward with other options for the moment and will take your suggestions into advisement.

Again, I really appreciate the insight and feedback. 👍 Thanks a ton~

@universalmind303
Copy link
Collaborator

looks like the node.js team has been looking in to creating some fast paths for initializing objects via node-api

nodejs/node#45905

@overlookmotel
Copy link

Concerning memory for buffers not getting freed ("observed the memory usage growing to over 68GB"), the reason may be that "finalize" callbacks which free the memory are scheduled to run with setImmediate(). So if your entire loop runs synchronously, memory usage will mount up endlessly until next tick of the event loop.

If that is the problem, waiting a tick after each turn of the loop would allow the finalize callbacks to fire, and memory to be freed.

I hit the same problem in a different context: napi-rs/napi-rs#1171 (comment)

For discussion of this problem, and the reasons behind it, see the conversation from this comment onwards:

nodejs/node-addon-api#917 (comment)

Hope that's helpful.

@drewbitt
Copy link

The optimal solution would be to use apache arrow to create the values in pure JS, then pass the buffer off to polars for processing. This is notably faster than going through the node-api (napi). However, we currently can't interop with arrow-js because they don't yet support LargeUtf8. . While not as fast, you can do something similar with JSON if you are working with row oriented data.

FYI, Arrow added LargeUtf8 in version 15. apache/arrow#35780

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants