-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BP-47: Support for writing ledger entries with O_DIRECT #2943
Comments
This will need to be BP-47. BP-44 - 46 are not yet merged but do exist. |
@mauricebarnum Would you please give more details about the motivation and design for this proposal? |
MotivationLedger read/write logicWhen the BookKeeper server receives a write entry request, it will write the entry into the memory table, which is a bookie-level cache. After the memory table is full, it will sort and trigger a flush into the operating system PageCache. The operating system PageCache will buffer that data again. When the PageCache flush is triggered, the data will be flushed to disk. When the BookKeeper server receives a read entry request, it will check whether the memory table and read cache contains the target entry. If both caches are missed, it will query rocksDB to get the entry's location in the entry log file, and then read the target entry from the entry log file. After reading the entry, the read cache will pre-read more entries from the entry log files to ensure the following read keeps high cache hit rate. From the operating system perspective, when reading a log file from a specific position, it will check whether the target data has been cached in PageCache. If PageCache hits, it will be directly returned from PageCache. Otherwise, it will read the block data from the log file and cache the block data into PageCache. The block may contain multiple entries located near the target read position. DrawbacksFor ledger writes, it will limit the write throughput for the following reasons.
For ledger reads, it will also limit the read throughput for the following reasons.
ProposalBased on the above issues, we introduce an optional support to bypass the operating system PageCache on supported systems (currently Linux and MacOS) by using the open(2) (https://man7.org/linux/man-pages/man2/open.2.html) flag O_DIRECT. fallocate(2) (https://man7.org/linux/man-pages/man2/fallocate.2.html) will be used, if available, to request that the filesystem allocate the required space before data is written. The implementation uses JNI to do direct I/O to files via posix syscalls. Fallocate is used if running on linux, otherwise this is skipped (at the cost of more filesystem operates during writing). There are two calls to write, writeAt and writeDelimited. I expect writeAt to be used for the entrylog headers, which entries will go through writeDelimited. In both cases, the calls may return before the syscalls occur. #flush() needs to be called to ensure things are The entry log format isn't much changed from what is used by the existing entrylogger. The biggest difference is the padding. Direct I/O must be written in aligned blocks. The size of the alignment varies by machine configuration, but 4K is a safe bet on most. As it is unlikely that entry data will land exactly on the alignment boundary, we need to add padding to writes. The existing entry logger has been changed to take this padding into account. When read as a signed int/long/byte the padding will always parse to a negative value, which distinguishes it from valid entry data (the entry size will always be positive) and also from preallocated space (which is always 0). Another difference in the format is that the header is now 4K rather than 1K. Again, this is to allow aligned rights. No changes are necessary to allow the existing entry logger to deal with the header change, as we create a dummy entry in the extra header space that the existing entry logger already knows to ignore. We have designed a writeBuffer pool to hold the write entries and flush them to disk when the buffer is full. For entry reading, each entry log file has a reader to deal with reading. The reader is managed by a cache backed with an eviction policy. Each read has a specific size read buffer to hold read data. To enable this, set dbStorage_directIOEntryLogger=true in the configuration. Changes
ImplementationFor part 1,2,3,5, we will push individual PRs. For part 4, we are trying to split into two PRs, one for writing, another for reading. Compatibility, Deprecation, and Migration PlanWe just modified the read and write logic of the entry log file, and didn't modify the organization of it. So no compatibility concerns at this moment. Test PlanWe will add tests for the following module.
OthersI’m doing performance testing for the direct-io based implementation. |
@eolivelli @dlg99 @mauricebarnum I add more details about this proposal, Please help take a look, thanks a lot. |
BP
This is the master ticket for tracking BP-47 :
Add support for writing ledger entries, bypassing the file system's buffering. On supported systems (currently Linux and MacOS), files are opened with O_DIRECT and page-aligned buffers are maintained by the BookKeeper application. Access to the operating system's IO API is via a thin JNI binding.
Proposal PR - #2932
The text was updated successfully, but these errors were encountered: