-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance: what can image-spec do to improve handling of large images? #1190
Comments
kubernetes/enhancements#4642 is relevant to this too |
https://docs.google.com/document/d/1Bs4fnP8rhPMaoPoLSYVvuRq-z9vkGPQ0rKbmfH4I7js/edit#heading=h.xw1gqgyqs5b |
https://github.com/project-machine/puzzlefs was made to solve the problems my OCIv2 proposal discussed quite a few years ago. I haven't looked into it very deeply unforuntately, and I don't think it will help much with large artefact-filled images. (My view has slowly moved to thinking that CDC and other compression methods make more sense on the distribution side. If we did that, it would be possible to make large images with any content equally deduplicated. There are downsides to this approach too, but embedding CDC parameters into the image-spec seems like a repeat of the nightmare we've had with compression algorithm settings but now with the added issue that changing the settings would cause you to waste cross-image deduplication.) |
Recently I occasionally found @gregkh already mentioned EROFS many years ago in OCI community :-).. https://groups.google.com/a/opencontainers.org/g/dev/c/icXssT3zQxE/m/N4YZsbZcAwAJ I may need to rephrase EROFS again here: Instead of just reinventing a wheel for Android only, the original goal is to address Squashfs runtime performance issue since it doesn't fulfill for high-performance use cases like smartphones. Users cannot accept unacceptable dynamic app latencies (and currently most Android vendors already switch to EROFS since they have the same issue with applying compression). Squashfs on-disk format hasn't been even updated for a decade (currently even without a filesystem UUID), and various previous improved attempts (at least at the time when I decided to redesign a high-performance image filesystem format) was ignored [1][2][3]. The goal of EROFS filesystem was to launch a general image filesystem project for various use cases with high-performance. It may vary from system firmware, container images, app sandboxes, and even AI data model, etc. For example, people could use the same image for system firmwares on raw block devices (like Container OS use cases) and container image. If some new on-disk feature could benefit to most image use cases, we will consider to add with discussion too and new contributors are always welcome. From my own perspective, although OCI tar format has many flaws, but the format is quite simple at least and various operation systems can support parsing tar without any barrier. Besides, the docker image format has been existed almost for a decade too, many base layers are already formed in tar layer format. As a public cloud vendor (like my current employer, Alibaba Cloud), image compatibility is quite important for our customers, and I guess that other cloud vendors may have the same concern since there is enough old OCI-compatible runtimes to be considered. If people would like some on-demand fetching, there is already technologies to resolve that like SOCI, stargz, etc. If people want to directly mount a filesystem in-kernel (although I'm not sure why such requirement is really important compared with performance and osboot concerns unlike system firmware use cases), they could use a Squashfs or EROFS index with OCI tar data instead. I'm very happy if OCI community could have a chance to consider using EROFS in some form, but my opinion is that we may need to improve the current OCI format to overcome some current high-priority OCI image concern first. But if some specific areas like AI model needs some specific filesystem blob, I think some EROFS layer blobs for such specific use cases are fine too, btw, EROFS already has a IANA-registered media type as "vnd.erofs" My own experience is that EROFS just becomes slowly used recently because many server users are still in 3.10 or 4.18 kernels, it doesn't matter for system image use cases like our original Android system images (because users will upgrade the whole system if they decide to use EROFS), but it may take many years before actual users use a new in-kernel feature like container images.
Actually EROFS already has a varient-CDC since Linux 6.1 although it's unlike the traditional CDC, but the result is almost the same. My experience is that CDC is good at text meterials but it has little benefit to executable binaries (I guess that is what we care about more in term of image sizes and runtime performance) because jump and data load instructions will kill all the possiblity of such data deduplication like the following code snippets of two minor versions of libc: In reality, the end result for executable binaries or something will be eventually like a page-unaligned block-based deduplication (like reflink) or file-based deduplication (like ostree). IMHO, CDC-like approach without compression is suitable for archive uses and transfer uses (like casync or likewise), but I would have certain reservations as a kernel filesystem developer for runtime uses due to its block/page-unaligned chunks. CDC is unfriendly to page cache sharing (or FSDAX secure container memory sharing) and data movement is almost always needed. Extra data movement also slows down the performance compared to reflink approaches unless compression is also considered, yet EROFS already has compressed data deduplication feature for two years since 2022. I think the only way to deduplicate these executable binaries is "delta compression", but I'm not sure if it's really a new on-disk feature for Linux kernel anyway. I guess most users are already happy with ostree or likewise, it needs carefully evaluation though. [1] |
@hsiangkao thanks for calling out all the salient points. Just curious, how well is in-kernel erofs supported? in terms of community size and history etc? is there a recommended minimum Linux kernel version? is there a recommended userspace erofs implementation? |
On Thu, May 30, 2024 at 11:57:17AM -0700, Ramkumar Chinchani wrote:
@hsiangkao thanks for calling out all the salient points.
Just curious, how well is in-kernel erofs supported? in terms of community size and history etc? is there a recommended minimum Linux kernel version?
It is very well supported and used in a few hundred million, if not over
a billion, devices everyday (i.e. it is one of the very very few file
systems that Android allows to be used in their systems.)
Highly recommended.
As for "minimum Linux kernel version", please always just use the latest
stable Linux kernel version for any kernel feature. To use an older one
is never recommended :)
|
I fully agree with Greg's point: always use the latest stable kernel. Anyway, to your question, it depends on the feature requirement. If the intention is just to use EROFS format as an index (like a stargz-like TOC) to refer tar data (for lazy pulling), I think Linux 5.4+ is enough. The current distro configs could be checked out by https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=EROFS_FS |
Additional considerations ... overlayfs (Linux kernel version 4.x but also supported by various *BSD), squashfs (Linux kernel version 2.6.x, also supported by various *BSD at least recently) and erofs (Linux kernel 5.x, not supported on *BSD?). MS Win support is another matter altogether. |
I'm quite open to that since EROFS is not designed for some specific use case. If OCI community considers EROFS in some form (or as an alternative), that is quite awecome. If not, EROFS will still improve new features consistently to fulfill generic image use cases. The feature development of EROFS is always active from Android vendors, some cloud vendors, etc. |
Adding more notes ... OCI artifacts may package "many small-ish files" such as container image rootfs or "a few very large files" such as AI models. |
some thought here. large model file is really large, for example the size of LLMA3 70b fp16 is about 141GB. one way to handling such huge file is to use same storage for image registry and compute node, i.e., the model file can be stored in the distributed file system with raw format, which is shared among the registry and compute node, no data tranfer between registry backend and compute node. when the client in the compute node pull the model blob, the registry returns the location of the model file, the client find that it's located in the file system the compute node can access, no blob downloading is requried. |
Anyway, you could also treat OCI artifacts (like a kind of object storage) as shared immutable storage (like a read-only mini- gfs2, ocfs2), in which way you also don't need to download any blob locally in advance (like hundreds of GiB), just virtual block device clients with nbd/tcmu/ublk or (if you really need some local caching) caching framework like fscache. |
^ how compressible is this model file? |
Is there interest in porting erofs-utils to golang? since most utilities in this world are golang-based? |
On Thu, Jun 06, 2024 at 01:10:01PM -0700, Ramkumar Chinchani wrote:
Is there interest in porting erofs-utils to golang? since most utilities in this world are golang-based?
The language the code is in should not matter, as you end up with a
binary in the end. So this should not be an issue at all.
Mainly interested in *creating* a erofs layer/image (so that it is compatible with overlayfs).
Great, but the language of the tools does not prevent this :)
good luck!
greg k-h
|
If the goal is to produce and consume erofs layers - so that they can just be copied over and mounted, then there are two touch points, which may or may not be ok with binary invocations.
Maybe as a initial poc, go bindings (cgo) instead? |
I recall reading (perhaps incorrectly! 🙈❤️) that many kernel filesystems are not designed to be hardened against attacker-controlled raw input, but given the use cases for erofs, I'm guessing that its implementation is hardened against malicious inputs? 👀😇 |
This is really a best-effort stuff. Unlike generic fses with complex metadata and journalling (so some consistency issues between different kinds of metadata are always challenging), EROFS core on-disk format is quite simple [1]. EROFS project addresses any new syzkaller fuzzing reports and we also have our own fuzzer to find potential bugs. However, EROFS is not a complete freezed filesystem project, thus new useful ondisk/runtime features will be added by the time according to new scenarios/inputs, so that there may be some new issues raised (as we are all humans and not bug-free.) Unlike some other fses, EROFS will address new found/reported issues in time and that is all the guarantee I could give. So yes, in brief, the implementation is hardened against malicious inputs with best efforts. Or we could find some ways to let users only use core stable features but it looks like a non-technical issue anyway (again, latest stable kernels are always preferred to address all kernel issues). [1] https://erofs.docs.kernel.org/en/latest/core_ondisk.html |
Some runtime like gVisor [1] already landed core on-disk EROFS support in their own go form to enable efficient image passthrough to sandboxes. But some other alternative way (like cgo) is helpful since EROFS is still actively under development, so maintaining various language implementations up to date is somewhat challenging due to limited time & engineering resources (although we may have some experimental Rust implementation developped by students later). Anyway, C is still the quite portable language among all architectures / platforms / distributions. |
Fixes opencontainers/image-spec#1190 Signed-off-by: Ramkumar Chinchani <[email protected]>
Fixes opencontainers/image-spec#1190 Signed-off-by: Ramkumar Chinchani <[email protected]>
@hsiangkao what is a good forum to coordinate the cgo changes? and where should they land? is the project receptive to any cgo-related refactoring? At the very least the main primitives that are needed (as golang APIs) are:
What we are trying to achieve is substitute tar blob workflows with erofs blob ones with minimal impact to ecosystem tooling. |
Hi @rchincha, Thanks for your reply!
OCI meeting time is too late for me to attend regularly (It seems 0:00~1:00AM on my side)...sigh...
A erofs-go project or just integrate into erofs-utils? which one you prefer?
Yeah, definitely!
Agreed.
Yeah, it seems containerd could also use it too.
What's the purpose of this API?
Yeah, I'm very happy to coordinate and work, thanks! |
@hsiangkao coordination doesn't have to be via a OCI meeting. It can just be over PR reviews and maybe more convenient. PopulateErofs() is just a placeholder API to indicate entries are copied into the newly created erofs layout. Maybe we just mount and copy things in/out instead and don't actually need this. |
Fixes opencontainers/image-spec#1190 Signed-off-by: Ramkumar Chinchani <[email protected]>
Sounds good, yet currently erofs-utils development works in a mailing list way for years like almost kernel projects.
Okay, got it. |
You may also be interested in https://github.com/containers/composefs which uses EROFS for metadata, but splits out shared files into a backing store which gives various advantages like dedup on disk and in the page cache too. |
Yes, ostree+composefs is also a great way to distribute images. And composefs has been more widely used now. In my spare time, I try to go on work on containerd erofs snapshotter support if possible, and hopefully this snapshotter could support:
|
To be clear, the composefs core is agnostic to higher level tooling. Yes, it can be used with ostree, but we are also looking at a strong OCI native integration with composefs and some of that already exists in the containers/storage composefs backend: https://github.com/containers/storage/blob/main/docs/containers-storage-composefs.md Connecting with that but backing up to a higher level though...some of the discussion in this thread seems to be basically floating to "switch to EROFS instead of tar in an OCI v2"...I'm not really in favor of that...I think it'd be too traumatic for the ecosystem, and there's other ways to gain key desirable properties. Especially this thread started around handling large images in an incremental way, and I think it makes sense to try to standardize work on estargz and https://github.com/containers/storage/blob/main/docs/containers-storage-zstd-chunked.md |
I would be considerably in favor of using a filesystem as the medium instead of tar for a next-generation OCI format. We are definitely running into problems (as many others have outlined) with the tarball-based format, and the band-aid solutions to work around it have similar portability problems with not enough benefit over just adopting something like EROFS for this. It has always bothered me that we called OCI an image format when it in fact isn't one, and with things like Fedora's bootc attempting to glue OCI to operating systems, having an OCI format that's actually image based would permit those things to work with some kind of reasonable performance and scalability. |
Just my personal opinion, I think the ways to enhance tar but make tar compatible like (e)stargz, zstd::chunked or SOCI are all good. Yet it'd be better to give another option of their TOC format in addition to JSON. |
Fixes opencontainers/image-spec#1190 Signed-off-by: Ramkumar Chinchani <[email protected]>
There are known issues with the tar format. Folks are already moving toward layers as full filesystems - soci etc being a smart tarfs (via fuse) while still staying with tar etc. Also you don't have to use a newer format if you don't want to. This is still a consensus-building exercise until working prototypes are demonstrated I suppose. |
Fixes opencontainers/image-spec#1190 Signed-off-by: Ramkumar Chinchani <[email protected]>
Fixes opencontainers/image-spec#1190 Signed-off-by: Ramkumar Chinchani <[email protected]>
Hi @rchincha, JFYI, file-backed mount has been merged upstream: |
Now that OCI artifacts has landed and getting mindshare and use cases, some issues are popping up. Best to standardize them.
Perhaps time to resurrect this?
https://groups.google.com/a/opencontainers.org/g/dev/c/Zk3yf45HIdA
The text was updated successfully, but these errors were encountered: