-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP content-sqlite: preallocate space to avoid outside diskfull events #6217
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6217 +/- ##
==========================================
- Coverage 83.34% 83.31% -0.03%
==========================================
Files 521 521
Lines 85209 85346 +137
==========================================
+ Hits 71020 71110 +90
- Misses 14189 14236 +47
|
10252d5
to
080a316
Compare
Nice progress! Want to break this up into multiple PRs before we start a review? |
080a316
to
72c369b
Compare
Yeah, lemme split out the cleanups, then the "setups", and stuff into new PRs. |
72c369b
to
b8d29f0
Compare
Problem: Some new features will require a slightly larger tmpfs for testing than the current 1m one. Add an additionl 5m tmpfs mount for testing.
Problem: In the near future we may wish to open the sqlite multiple times within the content-sqlite module. The current content_sqlite_opendb() takes a "truncate" parameter that would truncate the db on every open. This is not what we want if we are closing/opening the database multiple times. Solution: Split out the truncate into a new "setup" function. This setup function will be called once when the module is loaded.
Problem: In the near future we may need to close the sqlite db and reopen it. It would be wise to re-initialize context variables when closing the db to avoid re-using older configs. Init all appropriate variables in the content-sqlite context when closing the sqlite db.
Problem: In the near future, we may need to open the sqlite db with different settings during setup. Solution: Have the content_sqlite_opendb() function take journal_mode and synchronous as input parameters.
Problem: When a disk runs out of space, the content-sqlite module can no longer work. It would be convenient if space could be reserved ahead of time to prevent these types of problems. Support a new preallocate module and config option that takes a byte maximum as input. This option will internally create a new special database and write data to that database until the byte max is reached. Then this database will be dropped. This internally reserves space for the sole use of sqlite. Note that this feature will not work if any type of journaling or write ahead log is used, as that will also require disk space.
Problem: There is no coverage for the new content-sqlite preallocate config. Add tests in t0012-content-sqlite.t and t0090-content-enospc.t.
Problem: If a disk fills up, sqlite may no longer be able to operate because journal files can no longer be written to disk. However, if space was pre-allocated, sqlite can still be used if we turn off journaling. If ENOSPC is hit and pre-allocated space was configured, turn off journaling in an attempt to keep the content-sqlite module functional, although with slower performance.
Problem: There is no coverage to test if content-sqlite preallocate works if journaling was initially enabled. Add coverage in t/t0090-content-enospc.t.
b8d29f0
to
a73a549
Compare
Problem: The new journal_mode, synchronous, and preallocate configs for the content-sqlite module are not documented. Add them in new doc/man5/flux-config-content-sqlite.rst.
a73a549
to
059d388
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a WIP but I wanted to mention a couple thoughts.
We probably don't need to have a special table for the preallocated space. We have a blob table already - couldn't we just write some number of blobs then delete them? (Less code)
Do we want to fail if we cannot preallocate the requested amount? I would think not? It seems messed up if say we want to squat on 100G but are using 10G and can't start.
`sqlite <https://www.sqlite.org/>`_, | ||
`sqlite pragmas <https://www.sqlite.org/pragma.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to put those URLs in a RESOURCES section like most of the other section 5 man pages.
The URLs are spelled out so you can cut and paste them from the man command output.
For example, in flux-config(5):
RESOURCES
=========
.. include:: common/resources.rst
Flux Administrator's Guide: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/guide/admin.html
TOML: Tom's Obvious Minimal Language: https://toml.io/en/
if (content_sqlite_setup (ctx, truncate) < 0) | ||
goto done; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function name is not very descriptive. Maybe just do the
if (truncate)
(void)unlink (ctx->dbfile)
inline?
After consulting with @kkier a bit on this, we think it may be better to present admins with a choice of
That is a trade-off they are most qualified to make based on their judgement of the relative impacts. From their perspective I think it's more of a question of whether they want to reserve some amount of disk for flux or share it among multiple consumers, rather than whether or not they want to make a partition. If this PR successfully provided a mechanism to reserve space without a partition, it wouldn't change much in that calculus. IOW: I think we should invest some development effort in the second option for now rather than this space reservation scheme. |
goes down hard meaning broker shall exit? Edit: and bring down full instance?
Assuming the above, perhaps I can bring up something I proposed in #6169 before. It wouldn't be hard to reserve a small amount of space, say 500m (small and relatively low cost, regardless of the journaling/synchronous mechanisms in play), and in the event ENOSPC is hit, just delete it. That way we have a small amount of space to save/checkpoint all of that remaining data that is currently flying around before crashing. we may have to closedb/opendb yet again, but given the circumstances, I think it's a reasonable tradeoff to try and preserve some state. |
I could be wrong but it seems like, for transactional consistency, the delete of the placeholder must go through the journal if the journal is active, but if the file system is full, that would fail. If you close the database, the close ought to try to flush the journal backlog, but of course that will fail if you can't delete the placeholder. Does that make sense? |
It sounds like once the database is in WAL mode, this persists across a reopen. https://www3.sqlite.org/wal.html (see "Persistence of WAL Mode") That seems like it means if we like WAL mode, we won't be able to accomplish much by reserving space in the db. It likely also means that the WAL is not checkpointed on |
Well it does seem like |
Doh! You're right. I forgot that we need space for the journal or WAL. |
Per discussion in #6169
Support a new "preallocate" config & module option to pre-allocate a specific amount of space to sqlite so that we can hopefully avoid ENOSPC issues when outside actors fill up the disk. Option name debatable ... "preallocate_size"?
The technique is to basically create a database table, write a bunch a data to it, then drop the table. This won't work with journaling b/c the journal needs space too. But code is added to disable journaling if ENOSPC is hit.
There's a lot of "setup" commits in here, we could probably split out into another PR.
TODOs
option take meg/gig suffix?ehhh scratch that, i don't what the config file inputs to always be strings