Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP content-sqlite: preallocate space to avoid outside diskfull events #6217

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Aug 15, 2024

Per discussion in #6169

Support a new "preallocate" config & module option to pre-allocate a specific amount of space to sqlite so that we can hopefully avoid ENOSPC issues when outside actors fill up the disk. Option name debatable ... "preallocate_size"?

The technique is to basically create a database table, write a bunch a data to it, then drop the table. This won't work with journaling b/c the journal needs space too. But code is added to disable journaling if ENOSPC is hit.

There's a lot of "setup" commits in here, we could probably split out into another PR.

TODOs

  • option take meg/gig suffix? ehhh scratch that, i don't what the config file inputs to always be strings
  • test at bigger scale/size (i.e. not 5m mounts)
  • I probably should document this somewhere
  • BIG THING TO DISCUSS IN ISSUE - how to put database back into journaling mode when space issues fixed. Disabling journaling does lead to a healthy slowdown in performance (job throughupt maybe 15% down). This can probably be a different issue/PR, but needs to be discussed.

src/modules/content-sqlite/content-sqlite.c Dismissed Show dismissed Hide dismissed
Copy link

codecov bot commented Aug 15, 2024

Codecov Report

Attention: Patch coverage is 83.33333% with 25 lines in your changes missing coverage. Please review.

Project coverage is 83.31%. Comparing base (7af197f) to head (1098fa1).
Report is 4 commits behind head on master.

Files Patch % Lines
src/modules/content-sqlite/content-sqlite.c 83.33% 25 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6217      +/-   ##
==========================================
- Coverage   83.34%   83.31%   -0.03%     
==========================================
  Files         521      521              
  Lines       85209    85346     +137     
==========================================
+ Hits        71020    71110      +90     
- Misses      14189    14236      +47     
Files Coverage Δ
src/modules/content-sqlite/content-sqlite.c 74.12% <83.33%> (+2.40%) ⬆️

... and 4 files with indirect coverage changes

@chu11 chu11 force-pushed the issue6169_sqlite_preallocate branch 4 times, most recently from 10252d5 to 080a316 Compare August 15, 2024 22:28
@garlick
Copy link
Member

garlick commented Aug 16, 2024

Nice progress!

Want to break this up into multiple PRs before we start a review?

@chu11 chu11 force-pushed the issue6169_sqlite_preallocate branch from 080a316 to 72c369b Compare August 16, 2024 05:42
@chu11
Copy link
Member Author

chu11 commented Aug 16, 2024

Want to break this up into multiple PRs before we start a review?

Yeah, lemme split out the cleanups, then the "setups", and stuff into new PRs.

Problem: Some new features will require a slightly larger tmpfs for
testing than the current 1m one.

Add an additionl 5m tmpfs mount for testing.
Problem: In the near future we may wish to open the sqlite multiple
times within the content-sqlite module.  The current
content_sqlite_opendb() takes a "truncate" parameter that would truncate
the db on every open.  This is not what we want if we are
closing/opening the database multiple times.

Solution: Split out the truncate into a new "setup" function.  This
setup function will be called once when the module is loaded.
Problem: In the near future we may need to close the sqlite db and
reopen it.  It would be wise to re-initialize context variables when
closing the db to avoid re-using older configs.

Init all appropriate variables in the content-sqlite context when closing the
sqlite db.
Problem: In the near future, we may need to open the sqlite db with
different settings during setup.

Solution: Have the content_sqlite_opendb() function take journal_mode
and synchronous as input parameters.
Problem: When a disk runs out of space, the content-sqlite module
can no longer work.  It would be convenient if space could be
reserved ahead of time to prevent these types of problems.

Support a new preallocate module and config option that takes a byte maximum
as input.  This option will internally create a new special database and
write data to that database until the byte max is reached.
Then this database will be dropped.  This internally reserves space for the sole
use of sqlite.

Note that this feature will not work if any type of journaling or
write ahead log is used, as that will also require disk space.
Problem: There is no coverage for the new content-sqlite preallocate
config.

Add tests in t0012-content-sqlite.t and t0090-content-enospc.t.
Problem: If a disk fills up, sqlite may no longer be able to operate
because journal files can no longer be written to disk.  However,
if space was pre-allocated, sqlite can still be used if we turn off
journaling.

If ENOSPC is hit and pre-allocated space was configured, turn off
journaling in an attempt to keep the content-sqlite module functional,
although with slower performance.
Problem: There is no coverage to test if content-sqlite preallocate
works if journaling was initially enabled.

Add coverage in t/t0090-content-enospc.t.
@chu11 chu11 force-pushed the issue6169_sqlite_preallocate branch from b8d29f0 to a73a549 Compare August 20, 2024 07:00
Problem: The new journal_mode, synchronous, and preallocate configs
for the content-sqlite module are not documented.

Add them in new doc/man5/flux-config-content-sqlite.rst.
@chu11 chu11 force-pushed the issue6169_sqlite_preallocate branch from a73a549 to 059d388 Compare August 20, 2024 22:45
Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a WIP but I wanted to mention a couple thoughts.

We probably don't need to have a special table for the preallocated space. We have a blob table already - couldn't we just write some number of blobs then delete them? (Less code)

Do we want to fail if we cannot preallocate the requested amount? I would think not? It seems messed up if say we want to squat on 100G but are using 10G and can't start.

Comment on lines +52 to +53
`sqlite <https://www.sqlite.org/>`_,
`sqlite pragmas <https://www.sqlite.org/pragma.html>`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to put those URLs in a RESOURCES section like most of the other section 5 man pages.
The URLs are spelled out so you can cut and paste them from the man command output.
For example, in flux-config(5):

RESOURCES
=========

.. include:: common/resources.rst

Flux Administrator's Guide: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/guide/admin.html

TOML: Tom's Obvious Minimal Language: https://toml.io/en/

Comment on lines +943 to +944
if (content_sqlite_setup (ctx, truncate) < 0)
goto done;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name is not very descriptive. Maybe just do the

if (truncate)
    (void)unlink (ctx->dbfile)

inline?

@garlick
Copy link
Member

garlick commented Aug 22, 2024

After consulting with @kkier a bit on this, we think it may be better to present admins with a choice of

  • ensure flux never runs out of space on rank 0 (whether by partition or whatever), or
  • flux goes down hard on ENOSPC and, when manually restarted, recovers what it can but doesn't stop for manual intervention

That is a trade-off they are most qualified to make based on their judgement of the relative impacts. From their perspective I think it's more of a question of whether they want to reserve some amount of disk for flux or share it among multiple consumers, rather than whether or not they want to make a partition. If this PR successfully provided a mechanism to reserve space without a partition, it wouldn't change much in that calculus.

IOW: I think we should invest some development effort in the second option for now rather than this space reservation scheme.

@chu11
Copy link
Member Author

chu11 commented Aug 22, 2024

flux goes down hard on ENOSPC and

goes down hard meaning broker shall exit? Edit: and bring down full instance?

when manually restarted, recovers what it can but doesn't stop for manual intervention

Assuming the above, perhaps I can bring up something I proposed in #6169 before. It wouldn't be hard to reserve a small amount of space, say 500m (small and relatively low cost, regardless of the journaling/synchronous mechanisms in play), and in the event ENOSPC is hit, just delete it. That way we have a small amount of space to save/checkpoint all of that remaining data that is currently flying around before crashing.

we may have to closedb/opendb yet again, but given the circumstances, I think it's a reasonable tradeoff to try and preserve some state.

@garlick
Copy link
Member

garlick commented Aug 22, 2024

I could be wrong but it seems like, for transactional consistency, the delete of the placeholder must go through the journal if the journal is active, but if the file system is full, that would fail. If you close the database, the close ought to try to flush the journal backlog, but of course that will fail if you can't delete the placeholder.

Does that make sense?

@garlick
Copy link
Member

garlick commented Aug 22, 2024

It sounds like once the database is in WAL mode, this persists across a reopen.

https://www3.sqlite.org/wal.html

(see "Persistence of WAL Mode")

That seems like it means if we like WAL mode, we won't be able to accomplish much by reserving space in the db. It likely also means that the WAL is not checkpointed on sqlite3_close() like I thought.

@garlick
Copy link
Member

garlick commented Aug 22, 2024

goes down hard meaning broker shall exit? Edit: and bring down full instance?

Well it does seem like content-sqlite should just close the database after we get an ENOSPC. I think we probably need to talk through what can happen next. Ideally it minimizes the need to manually intervene.

@chu11
Copy link
Member Author

chu11 commented Aug 22, 2024

I could be wrong but it seems like, for transactional consistency, the delete of the placeholder must go through the journal if the journal is active, but if the file system is full, that would fail. If you close the database, the close ought to try to flush the journal backlog, but of course that will fail if you can't delete the placeholder.

Doh! You're right. I forgot that we need space for the journal or WAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants