WIP content-sqlite: preallocate space to avoid outside diskfull events #6217

chu11 · 2024-08-15T18:01:13Z

Per discussion in #6169

Support a new "preallocate" config & module option to pre-allocate a specific amount of space to sqlite so that we can hopefully avoid ENOSPC issues when outside actors fill up the disk. Option name debatable ... "preallocate_size"?

The technique is to basically create a database table, write a bunch a data to it, then drop the table. This won't work with journaling b/c the journal needs space too. But code is added to disable journaling if ENOSPC is hit.

There's a lot of "setup" commits in here, we could probably split out into another PR.

TODOs

~~option take meg/gig suffix?~~ ehhh scratch that, i don't what the config file inputs to always be strings
test at bigger scale/size (i.e. not 5m mounts)
I probably should document this somewhere
BIG THING TO DISCUSS IN ISSUE - how to put database back into journaling mode when space issues fixed. Disabling journaling does lead to a healthy slowdown in performance (job throughupt maybe 15% down). This can probably be a different issue/PR, but needs to be discussed.

src/modules/content-sqlite/content-sqlite.c

codecov · 2024-08-15T21:51:51Z

Codecov Report

Attention: Patch coverage is 83.33333% with 25 lines in your changes missing coverage. Please review.

Project coverage is 83.31%. Comparing base (7af197f) to head (1098fa1).
Report is 4 commits behind head on master.

Files	Patch %	Lines
src/modules/content-sqlite/content-sqlite.c	83.33%	25 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6217      +/-   ##
==========================================
- Coverage   83.34%   83.31%   -0.03%     
==========================================
  Files         521      521              
  Lines       85209    85346     +137     
==========================================
+ Hits        71020    71110      +90     
- Misses      14189    14236      +47

Files	Coverage Δ
src/modules/content-sqlite/content-sqlite.c	`74.12% <83.33%> (+2.40%)`	⬆️

... and 4 files with indirect coverage changes

garlick · 2024-08-16T00:00:19Z

Nice progress!

Want to break this up into multiple PRs before we start a review?

chu11 · 2024-08-16T13:50:37Z

Want to break this up into multiple PRs before we start a review?

Yeah, lemme split out the cleanups, then the "setups", and stuff into new PRs.

Problem: Some new features will require a slightly larger tmpfs for testing than the current 1m one. Add an additionl 5m tmpfs mount for testing.

Problem: In the near future we may wish to open the sqlite multiple times within the content-sqlite module. The current content_sqlite_opendb() takes a "truncate" parameter that would truncate the db on every open. This is not what we want if we are closing/opening the database multiple times. Solution: Split out the truncate into a new "setup" function. This setup function will be called once when the module is loaded.

Problem: In the near future we may need to close the sqlite db and reopen it. It would be wise to re-initialize context variables when closing the db to avoid re-using older configs. Init all appropriate variables in the content-sqlite context when closing the sqlite db.

Problem: In the near future, we may need to open the sqlite db with different settings during setup. Solution: Have the content_sqlite_opendb() function take journal_mode and synchronous as input parameters.

Problem: When a disk runs out of space, the content-sqlite module can no longer work. It would be convenient if space could be reserved ahead of time to prevent these types of problems. Support a new preallocate module and config option that takes a byte maximum as input. This option will internally create a new special database and write data to that database until the byte max is reached. Then this database will be dropped. This internally reserves space for the sole use of sqlite. Note that this feature will not work if any type of journaling or write ahead log is used, as that will also require disk space.

Problem: There is no coverage for the new content-sqlite preallocate config. Add tests in t0012-content-sqlite.t and t0090-content-enospc.t.

Problem: If a disk fills up, sqlite may no longer be able to operate because journal files can no longer be written to disk. However, if space was pre-allocated, sqlite can still be used if we turn off journaling. If ENOSPC is hit and pre-allocated space was configured, turn off journaling in an attempt to keep the content-sqlite module functional, although with slower performance.

Problem: There is no coverage to test if content-sqlite preallocate works if journaling was initially enabled. Add coverage in t/t0090-content-enospc.t.

Problem: The new journal_mode, synchronous, and preallocate configs for the content-sqlite module are not documented. Add them in new doc/man5/flux-config-content-sqlite.rst.

garlick

I know this is a WIP but I wanted to mention a couple thoughts.

We probably don't need to have a special table for the preallocated space. We have a blob table already - couldn't we just write some number of blobs then delete them? (Less code)

Do we want to fail if we cannot preallocate the requested amount? I would think not? It seems messed up if say we want to squat on 100G but are using 10G and can't start.

garlick · 2024-08-20T23:10:47Z

doc/man5/flux-config-content-sqlite.rst

+`sqlite <https://www.sqlite.org/>`_,
+`sqlite pragmas <https://www.sqlite.org/pragma.html>`_


Might want to put those URLs in a RESOURCES section like most of the other section 5 man pages.
The URLs are spelled out so you can cut and paste them from the man command output.
For example, in flux-config(5):

RESOURCES ========= .. include:: common/resources.rst Flux Administrator's Guide: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/guide/admin.html TOML: Tom's Obvious Minimal Language: https://toml.io/en/

garlick · 2024-08-20T23:14:52Z

src/modules/content-sqlite/content-sqlite.c

+    if (content_sqlite_setup (ctx, truncate) < 0)
+        goto done;


Function name is not very descriptive. Maybe just do the

if (truncate) (void)unlink (ctx->dbfile)

inline?

garlick · 2024-08-22T16:47:10Z

After consulting with @kkier a bit on this, we think it may be better to present admins with a choice of

ensure flux never runs out of space on rank 0 (whether by partition or whatever), or
flux goes down hard on ENOSPC and, when manually restarted, recovers what it can but doesn't stop for manual intervention

That is a trade-off they are most qualified to make based on their judgement of the relative impacts. From their perspective I think it's more of a question of whether they want to reserve some amount of disk for flux or share it among multiple consumers, rather than whether or not they want to make a partition. If this PR successfully provided a mechanism to reserve space without a partition, it wouldn't change much in that calculus.

IOW: I think we should invest some development effort in the second option for now rather than this space reservation scheme.

chu11 · 2024-08-22T17:28:59Z

flux goes down hard on ENOSPC and

goes down hard meaning broker shall exit? Edit: and bring down full instance?

when manually restarted, recovers what it can but doesn't stop for manual intervention

Assuming the above, perhaps I can bring up something I proposed in #6169 before. It wouldn't be hard to reserve a small amount of space, say 500m (small and relatively low cost, regardless of the journaling/synchronous mechanisms in play), and in the event ENOSPC is hit, just delete it. That way we have a small amount of space to save/checkpoint all of that remaining data that is currently flying around before crashing.

we may have to closedb/opendb yet again, but given the circumstances, I think it's a reasonable tradeoff to try and preserve some state.

garlick · 2024-08-22T18:05:36Z

I could be wrong but it seems like, for transactional consistency, the delete of the placeholder must go through the journal if the journal is active, but if the file system is full, that would fail. If you close the database, the close ought to try to flush the journal backlog, but of course that will fail if you can't delete the placeholder.

Does that make sense?

garlick · 2024-08-22T18:37:46Z

It sounds like once the database is in WAL mode, this persists across a reopen.

https://www3.sqlite.org/wal.html

(see "Persistence of WAL Mode")

That seems like it means if we like WAL mode, we won't be able to accomplish much by reserving space in the db. It likely also means that the WAL is not checkpointed on sqlite3_close() like I thought.

garlick · 2024-08-22T19:03:36Z

goes down hard meaning broker shall exit? Edit: and bring down full instance?

Well it does seem like content-sqlite should just close the database after we get an ENOSPC. I think we probably need to talk through what can happen next. Ideally it minimizes the need to manually intervene.

chu11 · 2024-08-22T19:23:07Z

I could be wrong but it seems like, for transactional consistency, the delete of the placeholder must go through the journal if the journal is active, but if the file system is full, that would fail. If you close the database, the close ought to try to flush the journal backlog, but of course that will fail if you can't delete the placeholder.

Doh! You're right. I forgot that we need space for the journal or WAL.

github-advanced-security bot found potential problems Aug 15, 2024

View reviewed changes

src/modules/content-sqlite/content-sqlite.c Dismissed Show dismissed Hide dismissed

chu11 mentioned this pull request Aug 15, 2024

content-sqlite: preallocate database space #6169

Open

chu11 force-pushed the issue6169_sqlite_preallocate branch 4 times, most recently from 10252d5 to 080a316 Compare August 15, 2024 22:28

chu11 force-pushed the issue6169_sqlite_preallocate branch from 080a316 to 72c369b Compare August 16, 2024 05:42

chu11 mentioned this pull request Aug 16, 2024

content-sqlite: misc cleanup #6220

Merged

chu11 force-pushed the issue6169_sqlite_preallocate branch from 72c369b to b8d29f0 Compare August 19, 2024 14:17

chu11 added 8 commits August 19, 2024 22:33

test: add 5m tmpfs

bc19543

Problem: Some new features will require a slightly larger tmpfs for testing than the current 1m one. Add an additionl 5m tmpfs mount for testing.

content-sqlite: open db given input parameters

ffa16c3

Problem: In the near future, we may need to open the sqlite db with different settings during setup. Solution: Have the content_sqlite_opendb() function take journal_mode and synchronous as input parameters.

t: cover content-sqlite preallocate

f366ae4

Problem: There is no coverage for the new content-sqlite preallocate config. Add tests in t0012-content-sqlite.t and t0090-content-enospc.t.

t: cover preallocate with journaling

82eb65e

Problem: There is no coverage to test if content-sqlite preallocate works if journaling was initially enabled. Add coverage in t/t0090-content-enospc.t.

chu11 force-pushed the issue6169_sqlite_preallocate branch from b8d29f0 to a73a549 Compare August 20, 2024 07:00

chu11 added 3 commits August 20, 2024 13:35

doc: document new content-sqlite config

816fae5

Problem: The new journal_mode, synchronous, and preallocate configs for the content-sqlite module are not documented. Add them in new doc/man5/flux-config-content-sqlite.rst.

fixup! content-sqlite: support preallocate config

297057f

test

059d388

chu11 force-pushed the issue6169_sqlite_preallocate branch from a73a549 to 059d388 Compare August 20, 2024 22:45

garlick reviewed Aug 20, 2024

View reviewed changes

chu11 mentioned this pull request Aug 26, 2024

content/content-sqlite: what to do on ENOSPC #6236

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP content-sqlite: preallocate space to avoid outside diskfull events #6217

WIP content-sqlite: preallocate space to avoid outside diskfull events #6217

chu11 commented Aug 15, 2024 •

edited

Loading

codecov bot commented Aug 15, 2024

garlick commented Aug 16, 2024

chu11 commented Aug 16, 2024

garlick left a comment

garlick Aug 20, 2024

garlick Aug 20, 2024

garlick commented Aug 22, 2024 •

edited

Loading

chu11 commented Aug 22, 2024 •

edited

Loading

garlick commented Aug 22, 2024

garlick commented Aug 22, 2024 •

edited

Loading

garlick commented Aug 22, 2024

chu11 commented Aug 22, 2024

		`sqlite <https://www.sqlite.org/>`_,
		`sqlite pragmas <https://www.sqlite.org/pragma.html>`_

WIP content-sqlite: preallocate space to avoid outside diskfull events #6217

Are you sure you want to change the base?

WIP content-sqlite: preallocate space to avoid outside diskfull events #6217

Conversation

chu11 commented Aug 15, 2024 • edited Loading

codecov bot commented Aug 15, 2024

Codecov Report

garlick commented Aug 16, 2024

chu11 commented Aug 16, 2024

garlick left a comment

Choose a reason for hiding this comment

garlick Aug 20, 2024

Choose a reason for hiding this comment

garlick Aug 20, 2024

Choose a reason for hiding this comment

garlick commented Aug 22, 2024 • edited Loading

chu11 commented Aug 22, 2024 • edited Loading

garlick commented Aug 22, 2024

garlick commented Aug 22, 2024 • edited Loading

garlick commented Aug 22, 2024

chu11 commented Aug 22, 2024

chu11 commented Aug 15, 2024 •

edited

Loading

garlick commented Aug 22, 2024 •

edited

Loading

chu11 commented Aug 22, 2024 •

edited

Loading

garlick commented Aug 22, 2024 •

edited

Loading