BP-47: Support for writing ledger entries with O_DIRECT #2943

mauricebarnum · 2021-12-15T17:30:30Z

BP

Follow the instructions at http://bookkeeper.apache.org/community/bookkeeper_proposals/ to create a proposal.

This is the master ticket for tracking BP-47 :

Add support for writing ledger entries, bypassing the file system's buffering. On supported systems (currently Linux and MacOS), files are opened with O_DIRECT and page-aligned buffers are maintained by the BookKeeper application. Access to the operating system's IO API is via a thin JNI binding.

Proposal PR - #2932

Vanlightly · 2021-12-15T17:50:28Z

This will need to be BP-47. BP-44 - 46 are not yet merged but do exist.

hangc0276 · 2022-02-08T02:19:05Z

@mauricebarnum Would you please give more details about the motivation and design for this proposal?

hangc0276 · 2022-04-02T01:10:01Z

Motivation

Ledger read/write logic

When the BookKeeper server receives a write entry request, it will write the entry into the memory table, which is a bookie-level cache. After the memory table is full, it will sort and trigger a flush into the operating system PageCache. The operating system PageCache will buffer that data again. When the PageCache flush is triggered, the data will be flushed to disk.

When the BookKeeper server receives a read entry request, it will check whether the memory table and read cache contains the target entry. If both caches are missed, it will query rocksDB to get the entry's location in the entry log file, and then read the target entry from the entry log file. After reading the entry, the read cache will pre-read more entries from the entry log files to ensure the following read keeps high cache hit rate. From the operating system perspective, when reading a log file from a specific position, it will check whether the target data has been cached in PageCache. If PageCache hits, it will be directly returned from PageCache. Otherwise, it will read the block data from the log file and cache the block data into PageCache. The block may contain multiple entries located near the target read position.

Drawbacks

For ledger writes, it will limit the write throughput for the following reasons.

The memtable and OS PageCache will double buffer entry data, which will consume more memory.
The flush mechanism of PageCache is controlled by the kernel and it's hard to tune by application, which is very important for IO intensive applications.
The number of kernel sync threads is limited by the number of disks, which is not conducive to RAID composed of multiple disks.

For ledger reads, it will also limit the read throughput for the following reasons.

When reading entry data from a log file, the OS will prefetch data and store them into PageCache. When a lot of topics from Pulsar fetch historical cold data, it will trigger fetch data from a lot of log files at the same time and a lot of data will be pre-fetched into PageCache. Due to ledger file special organization, which sorts and writes a lot of ledgers into the same file, the prefetched data may not belong to the target ledger, which will waste a lot of memory and reduce the PageCache hit rate.
After the OS pre-fetched a lot of data into PageCache, the eviction is also a big problem. The PageCache default eviction policy is LRU, it can't be controlled by Application except when we re-compile the kernel. We can't control which entries will be evicted and when to evict them.

Proposal

Based on the above issues, we introduce an optional support to bypass the operating system PageCache on supported systems (currently Linux and MacOS) by using the open(2) (https://man7.org/linux/man-pages/man2/open.2.html) flag O_DIRECT. fallocate(2) (https://man7.org/linux/man-pages/man2/fallocate.2.html) will be used, if available, to request that the filesystem allocate the required space before data is written.

The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing).

There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are
actually written.

The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must be written in aligned blocks. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will always parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0).

Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore.

We have designed a writeBuffer pool to hold the write entries and flush them to disk when the buffer is full. For entry reading, each entry log file has a reader to deal with reading. The reader is managed by a cache backed with an eviction policy. Each read has a specific size read buffer to hold read data.

To enable this, set dbStorage_directIOEntryLogger=true in the configuration.

Changes

Add bookkeeper-slogger module to provide support for structured logging with a pluggable logging backend. Provide an implementation using SLF4J.
Add native-io package to provide JNI bindings to operating system I/O api.
Introduce entry logger interface to support multi-implement of entry logger. Current support for PageCache based implementation and direct-io based implementation.
Add direct-io based implementation DirectEntryLogger, which is enabled by flag dbStorage_directIOEntryLogger
Refactor garbage collection and compaction to allow the entry logger to control which files are available to be garbage collected.

Implementation

For part 1,2,3,5, we will push individual PRs. For part 4, we are trying to split into two PRs, one for writing, another for reading.

Compatibility, Deprecation, and Migration Plan

We just modified the read and write logic of the entry log file, and didn't modify the organization of it.

So no compatibility concerns at this moment.

Test Plan

We will add tests for the following module.

BookKeeper-slogger
Native-io
Direct-io based implementation DirectEntryLogger
Garbage collection and compaction based on DirectEntryLogger

Others

I’m doing performance testing for the direct-io based implementation.

hangc0276 · 2022-04-02T01:10:57Z

@eolivelli @dlg99 @mauricebarnum I add more details about this proposal, Please help take a look, thanks a lot.

mauricebarnum added the type/proposal label Dec 15, 2021

mauricebarnum changed the title ~~BP-44: Support for writing ledger entries with O_DIRECT~~ BP-47: Support for writing ledger entries with O_DIRECT Dec 15, 2021

hangc0276 mentioned this issue Apr 8, 2022

Fix compaction throttle imprecise #3192

Closed

hangc0276 mentioned this issue Apr 18, 2022

BP-47 (task3): Abstract interface for entrylogger #3197

Merged

This was referenced Apr 30, 2022

BP-47 (task4): Aligned native buffer wrapper #3253

Merged

BP-47 (task5): Garbage collection support direct IO entrylogger #3256

Merged

hangc0276 mentioned this issue May 9, 2022

BP-47 (task6): Direct I/O entrylogger support #3263

Merged

hangc0276 mentioned this issue Jun 26, 2022

BP-47 (task7): DbLedgerStorage add direct entry logger support #3366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BP-47: Support for writing ledger entries with O_DIRECT #2943

BP-47: Support for writing ledger entries with O_DIRECT #2943

mauricebarnum commented Dec 15, 2021 •

edited

Loading

Vanlightly commented Dec 15, 2021

hangc0276 commented Feb 8, 2022

hangc0276 commented Apr 2, 2022

hangc0276 commented Apr 2, 2022

BP-47: Support for writing ledger entries with O_DIRECT #2943

BP-47: Support for writing ledger entries with O_DIRECT #2943

Comments

mauricebarnum commented Dec 15, 2021 • edited Loading

Vanlightly commented Dec 15, 2021

hangc0276 commented Feb 8, 2022

hangc0276 commented Apr 2, 2022

Motivation

Ledger read/write logic

Drawbacks

Proposal

Changes

Implementation

Compatibility, Deprecation, and Migration Plan

Test Plan

Others

hangc0276 commented Apr 2, 2022

mauricebarnum commented Dec 15, 2021 •

edited

Loading