Skip to content

Reducing disk usage with online_delete

Elliot Lee edited this page Jan 2, 2024 · 2 revisions

When running rippled (whether as a validator or not), disk usage grows unbounded if the server is not configured to use online_delete.

For example, after a number of years, disk use could reach more than 3.9 TB.

In rippled.cfg: (NOTE this is not a complete rippled.cfg)

[server]
port_rpc
port_rpc_admin_local
port_peer
port_ws_admin_local
port_ws_public
#ssl_key = /etc/ssl/private/server.key
#ssl_cert = /etc/ssl/certs/server.crt

[port_rpc]
port = 5007
ip = 0.0.0.0
admin = 127.0.0.1
protocol = http

[port_rpc_admin_local]
port = 5005
ip = 0.0.0.0
admin = 127.0.0.1
protocol = http

[port_peer]
port = 51235
ip = 0.0.0.0
protocol = peer

[port_ws_admin_local]
port = 6006
ip = 127.0.0.1
admin = 127.0.0.1
protocol = ws

[port_ws_public]
port = 5006
ip = 0.0.0.0
protocol = ws

[node_db]
type=NuDB
path=/rippled2/rippled/db/nudb
open_files=2000
filter_bits=12
cache_mb=256
file_size_mb=8
file_size_mult=2
online_delete=1600000
advisory_delete=1
delete_batch=10000
backOff=10000

[ledger_history]
1600000

[database_path]
/rippled2/rippled/db

[debug_logfile]
/var/log/rippled/debug.log

[sntp_servers]
time.windows.com
time.apple.com
time.nist.gov
pool.ntp.org

[path_search_max]
0

[ips]
r.ripple.com 51235

[validators_file]
validators.txt

[rpc_startup]
{ "command": "log_level", "severity": "warning" }
{ "command": "log_level", "severity": "debug", "partition": "SHAMapStore" }

[ssl_verify]
1

[insight]
server=statsd
address=127.0.0.1:8125
prefix=rippled
/rippled2/rippled/db
# ls -alh
total 887G
-rw-r--r-- 1 root root 882G Feb  9 22:07 transaction.db
...

Total size of the db:

# du -ch -d 1 | sort -hr
3.9T	total
3.9T	.
3.1T	./nudb

# rippled can_delete will return the current value of can_delete but has no other effect.

  1. What’s the best way to reduce disk consumption?
  2. What’s a better rippled.cfg configuration to not have to need so much disk space?

rippled can_delete now should trigger online deletion, per the docs for can_delete.

It is normal for online deletion to take a while to run.

To reduce disk consumption:

  1. How much storage is consumed by logs? Rotating/compressing logs (and perhaps deleting old ones) could be a quick way to get some space.
  2. Can you tolerate some downtime? If you're OK with the validator being down ~30 min, you could shut it down gracefully, delete the database, and restart it. The validator should download the latest ledger from its peers, and re-sync within about 30 min.

For the long term: Your validator is configured with advisory_delete=1. This means you need to call can_delete periodically. Often this is done with cron and scheduled to run during times of low load.

I suggest that you determine how long that you want to keep data, and then configure accordingly. It appears as though you have never deleted data, but the configuration allows you to do that at will by executing the can_delete command described above. First, make the following 2 config changes and then restart rippled:

  1. remove the ledger_history section
  2. modify the online_delete setting from 1600000 to 256

Then restart rippled. Then, configure cron on the machine to delete data at every interval that you desire. This document will give sufficient detail for that: https://xrpl.org/configure-advisory-deletion.html

Initially, you have to run can_delete fully twice to reduce disk consumption. After that, consumption will decrease after each subsequent can_delete is executed.

It will eventually finish. And after doing it twice, it will be much faster in the future as long as it’s on a regular schedule.

Here's what to expect after 2 full online deletion intervals. If you didn't change the settings I described above (ledger_history and online_delete) from 1600000 to 256, then it will require about 9 days to pass (1,600,000 ledgers) before deleting again. Anyway, assuming you did that, what will happen after 2 can_delete activities complete is that the nudb subdirectory will be much smaller. It generally grows roughly 10GB a day.

So, you'll free up 3TB. However, the *.db files will not get any smaller. This is because of how SQLite works. Unfortunately, the files don't shrink, even after deleting data, unless a very time-consuming activity is performed, called VACUUM. However, VACUUM requires rippled to be offline, and it would probably take at least a full day based on the existing size. So, if you want to reclaim all of that disk space (~800GB) the best thing would be to wipe out the data and start fresh. Otherwise, expect to free up 3TB.

Freeing up e.g. 3TB should be enough.

If you do: grep SHAMapStore /var/log/rippled/debug.log

then all of the online deletion log messages will appear, including when it finishes, which should have the log entry like: finished rotation <sequence #>

It looks like you found the code in question, but here it is if you haven't and want to know all the gory details: https://github.com/XRPLF/rippled/blob/develop/src/ripple/app/misc/SHAMapStoreImp.cpp#L351-L423 SHAMapStoreImp.cpp

        // will delete up to (not including) lastRotated
        if (readyToRotate && !waitForImport)
        {
            JLOG(journal_.warn())
                << "rotating  validatedSeq " << validatedSeq << " lastRotated "

10+ hours but no effect on disk space on the first run.

  • remove the ledger_history section
  • modify the online_delete setting from 1600000 to 256
  • restart rippled
  • run can_delete now when it syncs back up

Is there any particular amount of ledger history that you want your server to store? Ledger history is usually only necessary to handle queries for historical data, and validators usually do not do that. Most operators run 2 or more rippled servers, with 1 configured as a validator, and at least 1 configured as a stock server, as described here: https://xrpl.org/configure-rippled.html

It makes sense to run a fallback server for redundancy as well.

The recommendation for validator operators is the settings above: the minimum ledger history to keep is 256 ledgers. A validator needs no more than that. We run “can_delete now” on our validator every day launched by cron. 256 ledgers is about 20 minutes, but you don’t want to run online deletion every 20 minutes—and it likely takes more time than that anyway. Keeping a minimum of a day’s amount of data works for us pretty well.

If you take out advisory_delete then it will rotate every 256 ledgers. It will do it non-stop essentially, which likely has performance implications. You'll see a lot more CPU and IO utilization all the time. If you instead only want to keep a day's worth of data, then re-enable advisory_delete and run it out of cron. On the other hand, you can estimate roughly a day's worth of ledgers as something like: 86400 seconds in a day / 5 seconds per ledger = 17280

If you want it that way, then modify to: online_delete=17280 and remove advisory_delete. In each case, remove ledger_history.

Basically, you probably don't want or need online deletion running non-stop, and a full day's worth of ledgers is still a small footprint.

We’ll return online_delete=17280 once we get some disk space back.

If you're deleting about 6,000,000 ledgers compared to about 1,500,000 or so, it will take much longer to run. The long duration will be in the Transactions and AccountTransactions tables, which I'm sure you're observing.

After this finishes, however, 3TB or so will be freed up under nudb subdirectory. You'll be clear from that point to then configure for regular intervals.

# grep SHAMapStore /var/log/rippled/debug.log

One last thing, set the deletion interval to something like a day. That should get rid of those Missing node errors. Something like 19000 or 20000 should do it.


Regarding SHAMapStore-related errors in debug.log:

2023-Sep-16 23:20:09.776223114 UTC SHAMapStore:WRN Waiting 5s for node to stabilize. state: connected. age 2s

If you're seeing a bunch of these messages (as opposed to a few every once in a while), then it means that your server is having trouble keeping up with the network under the extra load related to the deletion process. That almost always means your hardware is not powerful enough, frequently because of insufficient IOPS on your hard storage.