Branch | Linux / Windows / MacOS |
---|---|
master |
Incremental asset delivery format - closely related to the casync project by Lennart Poettering (https://github.com/systemd/casync). When I started tinkering with this I did not know of that project but has since learned from it but choosen different approaches to a few things. If casync does what you need there is no point in diving into this besides curiousity. If there are aspects of casync that does not work for you (you need in-place updating of folders, or you need all the performance using threading) then it might be interesting.
Discord chat Discord chat at https://discord.com/invite/4VmZK9QBnv
Stable and have been tested on large data sets. Used by the go command line front end https://github.com/DanEngelbrecht/golongtail.git and the C# LongtailLib https://github.com/DanEngelbrecht/LongtailLib.git both of which are used on a daily basis in CI system and an internal Windows launcher.
It is very fast though, most functions that takes time are very multithreaded and fairly efficient and care has been taken to handle really large files (such as multi-gigabyte PAK files for games) efficiently.
It is a bit undocumented still.
git clone https://github.com/DanEngelbrecht/longtail.git
The target platforms are Windows, Linux and MacOS and it should build and run on all of them with MacOS being the least tested.
To build unit tests, cd to test
, call ./build.sh
for debug and ./build.sh release
for release on OSX/Linux, .\build.bat
for debug and .\build.bat release
for release on Windows.
Run test with output/test_debug
for debug and output/test
for release on OSX/Linux, output\test_debug.exe
for debug and output\test.exe
for release on Windows.
The preferred way of testing with the command line is to use https://github.com/DanEngelbrecht/golongtail.git which provides a nicer Go front end to longtail.
You can build the C command line but it is not as maintained and up to date.
To build the command line tool, cd to cmd
, call ./build.sh
for debug and ../build.sh release
for release on OSX/Linux, .\build.bat
for debug and .\build.bat release
for release on Windows.
Run the command line tool with output/longtail_debug
for debug and output/longtail
for release on OSX/Linux, output\longtail_debug.exe
for debug and output\longtail.exe
for release on Windows.
There are targets for dynamic libaries (.so and .dll) in shared_lib
and static libraries (.a) in static_lib
. Call ./build.sh
for debug and ./build.sh release
for release on OSX/Linux, .\build.bat
for debug and .\build.bat release
for release on Windows.
The format for the data is stored in a "zero-parse" format - as long as the CPU architecture is little-endian, writing or reading one of the data formats - VersionIndex, StoreIndex or StoredBlock is just a matter of reading it into a chunk of memory and can be used in place. There are no pointers or complex data types (unless you count compression of the StoredBlock chunk data).
The core of the library is in the src
folder and it defines a set of APIs that it need to operate. Each API is a structure with a number of C style function callbacks.
- HashAPI - To hash a block of data
- StorageAPI - To read/write files to a storage medium
- CompressionAPI - To compress/decompress blocks of data
- JobAPI - To execute tasks in paralell (with dependencies)
It also has hooks for ASSERTs and logging.
Besides a very limited number of standard C library includes it also depends on the excellent stb_ds.h
by Sean Barrett (http://nothings.org/stb_ds) for dynamic arrays and hash maps.
Although the core does not have any platform dependent code there are pre-built helpers in the lib
folder that implements a cross-platform base library which is used by various api implementations.
Currently there are:
- BLAKE2 - by BLAKE2 https://github.com/BLAKE2/BLAKE2
- BLAKE3 - by BLAKE3 team https://github.com/BLAKE2/BLAKE2
- MeowHash - by Mollyrocket https://mollyrocket.com/meowhash
- In-memory storage - used for test etc
- Native file system storage - basic file io using the platform layer in the
lib
folder
- Brotli - by Google https://github.com/google/brotli
- ZStandard - by Facebook https://github.com/facebook/zstd
- LZ4 - by Cyan4973 https://github.com/lz4/lz4
- Bikeshed - by Dan Engelbrecht https://github.com/DanEngelbrecht/bikeshed
Longtail also borrows the chunking algorithm used to split up assets into chunks from the casync project by Lennart Poettering (https://github.com/systemd/casync).
An asset inside a version is composed by a series of chunk, each chunk is identified by the hash of the content of the chunk. Each asset is identified by a hash-of-hashes of the chunks that composes the asset. Each chunk is in turn stored inside blocks and the block is identified by a hash-of-hashes of the chunks inside the block.
A version index is a collection of assets (or files if you will) indexed into chunks, each asset is associated with a size and zero to many chunks and a content hash which is a combined hash of all the assets individual chunk hashes. Each asset also get a (relative) path and hash of the path for the asset. The version index in itself does not contain any of the data of the assets, it merely points out which chunks (via hashes) it needs to be reconstructed.
A version index is a particular version of a collection of assets, either a complete set of assets or a subset. Version indexes can be combined or updated/overriden.
This is the "manifest" of a part of a game (or other) content- Small, efficient and easy to diff and make patches from, as well as easy to extend or replace with new asset data.
A version index can easily be constructed from a set of files/folders.
A stored block contains one or more chunks
of data with minimal meta data such as compress/uncompressed size and the hash of the uncompressed data of each chunk. Then stored block is identified by a combined hash of all the chunk hashes inside the block.
A store index is a collection of Content Blocks indexes. This store index is used via the Version Index to reconstruct assets to a path.
A store index can easily be constructed from a set of content blocks or you can create it based on a Version Index.
Represents the difference between two Version Indexes - it contains which asset paths was created, modfied or deleted.
It can scan a folder on the file system, hash it and chunk it up, create content blocks and store index for the data.
It can create a diff between one folder and another, find out which chunks it needs to go between said version and apply the change to a folder. It can write one version to a new folder or it can update an existing folder.
It is very fast, it is significantly faster att all the stages than desync (by Frank Olbricht https://github.com/folbricht/desync) which itself does threading of tasks to speed it up compared to casync. This is when compared on the use cases that I care about, it might not be faster in your use cases.
Modifying an existing folder is something desync can't (currently) do and is crucial to achive reasonable speed when working with large amount of data (multi-gigabyte folders).
The C99 version in this repo has intentionally been restricted to not add any big complicated dependencies but instead provide interfaces where you can extend it, for example - it has two forms of storage - disk and memory, it does not have http, S3, GCS or any other fancy stuff in it, for that kind of storge, use the golang version https://github.com/DanEngelbrecht/golongtail.git.
This has been an active choice, as the library is writting in C99 for ultimate portability some sacrifices had to be done. It has minimal dependencies and no complicated build system but it is written so adding other storage mechanisms or exchanging hashing or other parts are reasonably easy.