Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream database file line by line to avoid string length limit in Node.js #5

Merged
merged 36 commits into from
Oct 19, 2021

Conversation

eliot-akira
Copy link

@eliot-akira eliot-akira commented Aug 11, 2021

This pull request implements streaming the database file line by line, based on louischatriot#463. Solves issue #3.

It follows the original PR closely, excluding unnecessary formatting changes. I adapted the syntax to the new codebase, such as the use of let and const instead of var, and some arrow functions.

It passes all server-side tests, and in the second commit, I made changes necessary to pass all browser tests. It turned out that localforage returns the entire database file content, so it was impractical to turn that into a stream. So, in the browser it reads the file the same way as before.

@eliot-akira
Copy link
Author

@tex0l Just in case you might have missed this PR..

The original pull request mentions that streaming files line by line allows loading large databases "without running into the string length limit for node", which is reportedly 256MB. So, it seems like a useful feature to support larger file sizes. (Ideally it'd be best to generate a huge file for testing, but I have not done this.)

@tex0l
Copy link

tex0l commented Sep 20, 2021

Hi @eliot-akira
Thank you for the ping, I'll try to give it a look this week or this weekend.

Copy link

@tex0l tex0l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eliot-akira,
First of all, thank you for porting the PR to @seald-io/nedb, and sorry for the delay in reviewing this PR..
I've reviewed it thoroughly, I've explained everything directly line by line in the files changed, but the TLDR is:

  1. the PR is not properly linted, you can npm run lint to check the linting status, and if you want to auto-fix small issues, you can run npx standard --fix (will fix anything that doesn't change runtime) ;
  2. I'd rather not import byline, the package makes 150 lines and hasn't been updated in 5 years (https://github.com/jahewson/node-byline/blob/master/lib/byline.js). I think we should either find a maintained replacement, or include the code and its tests into nedb, and reformat it to this package's standards.
  3. I'm not comfortable with the behavior with the corruptAlertThreshold, I'd rather strictly adhere to the node convention of having at least err or res to be null. What I recommend is when the corruptAlertThreshold is reached to stop reading from the stream (and clean it properly without having it hanging in the process), and throw an error. We can open a new issue to return the corruptItems count if this information is useful to anyone.

@arantes555 do you share the same opinion?

I won't have time to implement the modifications though, can you make the @eliot-akira, please? I'll try to be quicker in my next review.

browser-version/lib/byline.js Outdated Show resolved Hide resolved
lib/persistence.js Outdated Show resolved Hide resolved
lib/persistence.js Outdated Show resolved Hide resolved
lib/persistence.js Outdated Show resolved Hide resolved
lib/persistence.js Outdated Show resolved Hide resolved
test/persistence.test.js Outdated Show resolved Hide resolved
test/persistence.test.js Outdated Show resolved Hide resolved
test/persistence.test.js Outdated Show resolved Hide resolved
webpack.config.js Outdated Show resolved Hide resolved
lib/persistence.js Outdated Show resolved Hide resolved
@eliot-akira
Copy link
Author

eliot-akira commented Oct 3, 2021

Thank you for a detailed review of the PR. I ran the linter, made suggested improvements, and resolved the associated comments.

There's still a question about byline tests. I wonder if the forked byline maybe deserves its own repo.

@tex0l
Copy link

tex0l commented Oct 4, 2021

Thank you for implementing the fixes so quickly!

For byline, I agree with you, we'd be better off with a maintained fork. However, I won't have the time to maintain it. I'd rather consider this a helper function inside the package rather than a package in itself. Even if there are big resource files, I think we should include the test files inside nedb, it won't be included in the npm package, it only pollutes the git repo ;).

For the behavior of treatRawStream when there are corruptedItems, I made a suggestion in the comments, tell me what you think.

Once both items are fixed, the next step will be publishing a pre-version we'll test internally at Seald to check it doesn't break everything.

@eliot-akira
Copy link
Author

Sounds good. If I find the time, I'd like to actually create a database larger than 256MB, and test the "string length limit for node" that they were talking about. It would be nice to confirm that this PR really solves it.

@eliot-akira
Copy link
Author

Done. I've improved the function treatRawStream and its test; then moved over and adapted the tests from byline.

@tex0l
Copy link

tex0l commented Oct 5, 2021

Hi @eliot-akira
Thank you very much for the work.
We'll include it in our next test cycle to test it works as intended in all environements.
In the meantime, I've released a beta version 2.1.0-0 (it is in the branch https://github.com/seald/nedb/tree/stream-files-by-line) you can use from npm.

Once it passed this test cycle, I'll merge the branch, release a 2.1.0, and close the PR :).

By the way, I took the liberty to add you to the contributors, and name you in the changelog if that's ok for you.

@eliot-akira
Copy link
Author

Nice, I didn't realize it was hacktoberfest, haha.

@tex0l
Copy link

tex0l commented Oct 7, 2021

Hi @eliot-akira,
I merged the modifications you made after @arantes555's review in the branch https://github.com/seald/nedb/tree/stream-files-by-line, fixed one thing we missed (we needed to reference the browser version of byline.js in the "browser" of the package.json for bundlers to pick the fake byline.js instead of the full one in web apps), and released another preversion 2.1.0-1.

@eliot-akira
Copy link
Author

eliot-akira commented Oct 9, 2021

I learned more about the string length limit in Node.js. In the V8 source (src/objects/string.h), there is the following comment:

// Maximal string length.
// The max length is different on 32 and 64 bit platforms. Max length for
// 32-bit platforms is ~268.4M chars. On 64-bit platforms, max length is
// ~536.8M chars.

I created a test script to generate a database filled with large strings, and was able to confirm that there is in fact such a limit.

With the original nedb, when the database is larger than half a gigabyte, it throws while loading:

Error: Cannot create a string longer than 0x1fffffe8 characters
code: 'ERR_STRING_TOO_LONG'

Using this PR branch, I found that it does load the database but still ends with an error.

~/github/nedb/lib/persistence.js:81
      toPersist += this.afterSerialization(model.serialize(doc)) + '\n'
                                                                 ^
RangeError: Invalid string length

This is in the function persistCachedDatabase(), which performs the compaction operation. It tries to create the entire content of a database file as a string, before passing it to storage.crashSafeWriteFile(). To practically support such large databases, this part needs to stream the file content line by line, instead of all at once.

So, it looks like the PR is not complete yet. When I find the time, I will see how to adapt these two functions.

@eliot-akira eliot-akira changed the title Stream files line by line when parsing to avoid memory limits Stream database file line by line to avoid string length limit in Node.js Oct 9, 2021
@eliot-akira
Copy link
Author

OK, in commit fc1d0e0, I changed it so the file is written line by line when persisting cached database.

Great, now I've confirmed that this PR does allow loading large databases, beyond the string length limit.

browser-version/lib/storage.js Show resolved Hide resolved
lib/storage.js Outdated Show resolved Hide resolved
lib/storage.js Outdated
stream.close(cb)
})
readable.on('error', () => cb)
stream.on('error', () => cb)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this call cb with an error?

readable.on('error', err => cb(err))
stream.on('error', err => cb(err))

..or maybe simpler:

readable.on('error', cb)
stream.on('error', cb)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that's what I wanted to write 🤦

@arantes555
Copy link

arantes555 commented Oct 15, 2021

@tex0l just noticed a last detail left to deal with: for the test_lac/loadAndCrash.test.js file, it mokey-patches fs.writeFile to test what happens in case of crash. This needs to be tweaked to trigger a crash with fs.createWriteStream.

I'll try doing that now, and then we should be good to publish :)

@arantes555
Copy link

arantes555 commented Oct 15, 2021

@tex0l just noticed a last detail left to deal with: for the test_lac/loadAndCrash.test.js file, it mokey-patches fs.writeFile to test what happens in case of crash. This needs to be tweaked to trigger a crash with fs.createWriteStream.

I'll try doing that now, and then we should be good to publish :)

My bad, you already took care of that ^^

@arantes555
Copy link

Successfully published @seald-io/[email protected] 😄

It seems all good to me. We'll do some tests with this pre-version on our internal projects, and release ! Excellent job

@arantes555
Copy link

Actually, I'll push another commit with a minor detail : I'm extracting writeFileLines into its own function, not inlined into the crashSafe version

@tex0l
Copy link

tex0l commented Oct 15, 2021

Hi @arantes555 @eliot-akira,
I rechecked everything, it seems ok for me.
We'll test this version in our internal packages, and if everything passes, we'll merge :).

@eliot-akira
Copy link
Author

@tex0l @arantes555
Sounds good! Thank you both for a thorough review and improvement session.

@tex0l tex0l merged commit 1d8c888 into seald:master Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants