Skip to content

mv cache object warming

Matthew Von-Maszewski edited this page Jul 31, 2015 · 10 revisions

Status

  • merged to master -
  • code complete - July 30, 2015
  • development started - July 24, 2015

History / Context

leveldb's performance is highly dependent upon its caches: file cache, block cache, and operating system block cache. A restart of leveldb flushes at least the file cache and block cache. A server restart flushes all three. An application that depends heavily upon leveldb for fast access is weakened right after a leveldb starts due to two or three of its caches lacking good content.

The file cache is the most performance critical cache. A miss against the file cache requires an .sst table file to be opened, index read, index decompressed, bloom filter read, and bloom filter decompressed. These five steps happens before leveldb attempts to read the requested data. Large datasets can easily open 500 or more .sst table files per second while trying to properly populate the file cache. Each file open is not only slow, but also interferes with other leveldb file operations such as reads for an already open file.

The user application can suffer for minutes during a server restart. With Riak, the performance of an entire server cluster is impacted by the restart of just one server. This makes the problem even more critical since individual servers regularly need software and hardware restarts for software updates and security patches.

This branch adds logic to write the file names of open .sst table files to a temporary file. leveldb then reads the temporary file upon next startup and processes each file into the file cache. This branch does not attempt to preload the block cache or manipulate the operating system block cache.

This branch is enabled / disabled via leveldb::Options structure. It has a new member variable "cache_object_warming". It defaults to true.

Feature limitations

This branch represents a minimum feature set to satisfy a customer's critical need. Here are its design limitations. Some may be worthy of future development work.

  • The temporary file exists only from shutdown to startup. Once leveldb reads the file's content, the file is erased. Future work might rename this file as *.old for debug reference.

  • A server crash will not have any cache information at next startup. leveldb could rewrite the temporary file periodically, like 5 minute intervals or after X number of compactions.

  • Temporary file is intentionally in the format of a MANIFEST record. The MANIFEST might be a better place for the cache object data.

  • This branch must already be running on the server to capture the cache objects during the next shutdown. This implies that the update to put this branch in place will NOT have cache preloading upon first restart. An external tool to build the first temporary file, before stopping the server for this upgrade, would be beneficial.

Branch Description

db/db_impl.cc

The DBImpl object controls the creation and subsequent use of the cache object temporary file. DBImpl::~DBImpl calls SaveOpenFileList() during the database close (destruction). Similarly DB::Open calls PreloadTableCache() during the database open operations. Both calls are controlled by the new option cache_object_warming.

db/filename.cc, db/filename.h

leveldb generates all its filenames via routines in filename.cc. The cache object temporary file's name generation routines are added here for consistency to Googles methodology. The file's name is based upon the three letter acronym of Cache Object Warming.

db/table_cache.cc, db/table_cache.h

leveldb's TableCache contains the open file objects. Therefore it is the correct class to drive the creation and use of the temporary file when initiated by db_impl.cc.

SaveOpenFileList() opens the temporary file and creates a leveldb "log" object to format and write the data. The log object writes "records". The records are simply encoded versions of internal objects. SaveOpenFileList() asks the new DoubleCache::WriteCacheObjectWarming() routine to encode the file number, file level, and file size of every open file into one str::string object. That string becomes the single record written to the "log". leveldb's "log" format includes CRC validation for safety.

PreloadTableCache() performs the reverse of SaveOpenFileList(). It processes the temporary file using the "log" reader object. It is expected that some files listed in the temporary file may not exists by the time this routine executes. leveldb will have processed recovery logs, then potential executed one or more compactions by the time this routine executes. Some open files at the time of shutdown may be open due to long held iterators. Those old open files from iterator operations get deleted early upon database open. Therefore the temporary file is already out of date, and its listed files may not exist due to normal operations.

Neither routine contains significant error handling. An error represents a missed opportunity of better performance, but not data loss. Both errors and successes during processing get logged.

Previous internal agreements stated that std::auto_ptr would be appropriate for variables like cow_file and cow_log (C++11 is not available on all leveldb platforms and therefore its even better object types are not a consideration). That got lost in this initial implementation. Expect a future update.

db/version_edit.cc, db/version_edit.h

It is highly likely that the temporary file produced here is going to transition into the existing MANIFEST file. VersionEdit object currently generates the MANIFEST file's contents. The changes to these two files simple moves an enum from the .cc to .h file. This makes the enum visible to other source files while keeping it related to VersionEdit. Then the branch added an enum variable specifically for file cache objects. These allows the future addition of block cache objects if proven useful.

util/cache2.cc, util/cache2.h

util/options.cc, include/leveldb/options.h

Clone this wiki locally