tarball: Streaming image parser PoC #1429

abitrolly · 2022-08-14T09:34:30Z

TarBuffered scans stream (io.Reader) once for filename and saves
unused sections in memory for later access. This should speedup
parsing a bit, because right now tarball is scanned several times,
and should save resources and speed for parsing well-formed images
from network.

Solves #1339.

codecov-commenter · 2022-08-14T09:36:37Z

Codecov Report

Merging #1429 (7f2ecfc) into main (7196cf3) will increase coverage by 0.00%.
The diff coverage is 83.63%.

@@           Coverage Diff           @@
##             main    #1429   +/-   ##
=======================================
  Coverage   73.40%   73.40%           
=======================================
  Files         115      115           
  Lines        8757     8804   +47     
=======================================
+ Hits         6428     6463   +35     
- Misses       1688     1696    +8     
- Partials      641      645    +4

Impacted Files	Coverage Δ
pkg/v1/tarball/image.go	`76.47% <83.33%> (-0.49%)`	⬇️
pkg/v1/mutate/mutate.go	`72.16% <100.00%> (+0.09%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

abitrolly · 2022-08-14T09:39:01Z

pkg/v1/tarball/image.go

@@ -26,8 +26,10 @@ import (
 	"path"
 	"path/filepath"
 	"sync"
+	"unsafe"


Is there a lib to to track the size of all data that is referenced by this struct? unsafe.Sizeof returns the value 40 regardless of how many megabytes are already stored in content and headers map.

type TarBuffered struct { tf *tar.Reader pos int headers map[int]*tar.Header content map[int][]byte EOF bool }

abitrolly · 2022-08-15T18:03:36Z

The code structure with partial.go, returning innards of Image without actual reference to Image makes it rather hard to debug the flow when export fails with file already closed errors.

I don't see there could be an efficient way to avoid high memory usage without restructuring the image format itself opencontainers/image-spec#936

abitrolly · 2022-08-16T19:03:15Z

This doesn't pass export from stdin stream on my machine. Need to write that test.

abitrolly · 2022-08-16T19:46:41Z

Added #1436 that tests stdin and stdout export is not broken.

pkg/crane/export.go

TarBuffered scans stream (`io.Reader`) once for filename and saves unused sections in memory for later access. This should speedup parsing a bit, because right now tarball is scanned several times, and should save resources and speed for parsing well-formed images from network. See google#1339.

Go subclassing with partials is rather hardcore here

abitrolly · 2022-08-17T05:37:19Z

pkg/v1/mutate/mutate.go

@@ -220,6 +221,7 @@
 // If a caller doesn't read the full contents, they should Close it to free up
 // resources used during extraction.
 func Extract(img v1.Image) io.ReadCloser {
+	logs.Debug.Printf("mutate: extract %T", img)


Not sure how dumping type of v1.Image gives access to AuthConfig.

I'm not sure how that could happen either, but I can tell you this debug logging will probably be very noisy and likely not very useful. If it's helpful while working on this change that's fine, but we should remove it before reviewing and merging.

Do you have any tools to replace Debug prints while figuring out how Go code works? I am doing "pen and paper" reconstruction of the calls, which is not very effective.

I typically just rely on logging and pen and paper debugging, but I've heard good things about delve for interactive debugging, which is also integrated into things like VSCode.

imjasonh · 2022-08-18T13:30:09Z

The code structure with partial.go, returning innards of Image without actual reference to Image makes it rather hard to debug the flow when export fails with file already closed errors.

I don't see there could be an efficient way to avoid high memory usage without restructuring the image format itself opencontainers/image-spec#936

Yeah this may just be a limitation of the format. That format is unfortunately unlikely to change any time soon. Attempts have been made, the most successful of which is estargz which abuses the tar format to allow for random access reads. This wouldn't be helpful in your case though since you can't guarantee it will be in the estargz format.

Would it be reasonable to enforce some soft memory limit and fail if asked to buffer more than that during extraction? This could be configured with an env var maybe.

abitrolly · 2022-08-18T19:05:33Z

estargz places TOC at the end, so it won't be effective indeed. But I see a way to make a linter out of this code to measure image processing overhead that can be converted to some FinOps metrics, so that people have a motivation to produce well-formed images to place those savings into their OKR reports.

github-actions · 2022-11-17T01:30:59Z

This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Keep fresh with the 'lifecycle/frozen' label.

abitrolly · 2022-11-17T05:09:19Z

The thought that I badly need money that come with job keeps me away from completing this PR. Somewhat hard to process the code in a stressed state of mind, but I will try to complete this to put into resume or something.

In the meanwhile I found another contender for static network sync file format https://github.com/zchunk/zchunk which unlike estargz places header at the start. Maybe estargz can be modified too to place TOC at start for streaming access?

github-actions · 2023-02-17T01:31:00Z

This Pull Request is stale because it has been open for 90 days with
no activity. It will automatically close after 30 more days of
inactivity. Keep fresh with the 'lifecycle/frozen' label.

abitrolly commented Aug 14, 2022

View reviewed changes

abitrolly mentioned this pull request Aug 14, 2022

Use buffering (bufio?) for reading image tarball on export #1339

Closed

abitrolly force-pushed the tarbuffer branch from de432fe to 7c4fa5b Compare August 16, 2022 16:12

abitrolly added 2 commits August 16, 2022 22:35

e2e: pull and export stdin and stdout

076236e

List all resulting tar files

b0c6d9c

github-advanced-security bot found potential problems Aug 17, 2022

View reviewed changes

pkg/crane/export.go Fixed Show fixed Hide fixed

abitrolly added 8 commits August 17, 2022 07:58

Presubmit requires comment

f62ab3e

Style fix

69e7441

Reuse buffer for loadTarDescriptorAndConfig()

abf6a21

Use scanFile() to detect if an image is compressed

34aefce

Improve error messages, show memory used by layer

0f6cc5e

Attempt to process Stdin stream directly

36af408

Log img type in mutate.Extract()

7f2ecfc

Go subclassing with partials is rather hardcore here

abitrolly force-pushed the tarbuffer branch from 513d71d to 7f2ecfc Compare August 17, 2022 04:58

github-advanced-security bot found potential problems Aug 17, 2022

View reviewed changes

abitrolly mentioned this pull request Aug 19, 2022

Replace compressed tar with format for random access opencontainers/image-spec#936

Open

github-actions bot added the lifecycle/stale label Nov 17, 2022

github-actions bot removed the lifecycle/stale label Nov 18, 2022

github-actions bot added the lifecycle/stale label Feb 17, 2023

github-actions bot closed this Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tarball: Streaming image parser PoC #1429

tarball: Streaming image parser PoC #1429

abitrolly commented Aug 14, 2022 •

edited

Loading

codecov-commenter commented Aug 14, 2022 •

edited

Loading

abitrolly Aug 14, 2022

abitrolly commented Aug 15, 2022 •

edited

Loading

abitrolly commented Aug 16, 2022

abitrolly commented Aug 16, 2022

abitrolly Aug 17, 2022

imjasonh Aug 18, 2022

abitrolly Aug 18, 2022

imjasonh Aug 18, 2022

imjasonh commented Aug 18, 2022

abitrolly commented Aug 18, 2022

github-actions bot commented Nov 17, 2022

abitrolly commented Nov 17, 2022

github-actions bot commented Feb 17, 2023

tarball: Streaming image parser PoC #1429

tarball: Streaming image parser PoC #1429

Conversation

abitrolly commented Aug 14, 2022 • edited Loading

codecov-commenter commented Aug 14, 2022 • edited Loading

Codecov Report

abitrolly Aug 14, 2022

Choose a reason for hiding this comment

abitrolly commented Aug 15, 2022 • edited Loading

abitrolly commented Aug 16, 2022

abitrolly commented Aug 16, 2022

abitrolly Aug 17, 2022

Choose a reason for hiding this comment

imjasonh Aug 18, 2022

Choose a reason for hiding this comment

abitrolly Aug 18, 2022

Choose a reason for hiding this comment

imjasonh Aug 18, 2022

Choose a reason for hiding this comment

imjasonh commented Aug 18, 2022

abitrolly commented Aug 18, 2022

github-actions bot commented Nov 17, 2022

abitrolly commented Nov 17, 2022

github-actions bot commented Feb 17, 2023

abitrolly commented Aug 14, 2022 •

edited

Loading

codecov-commenter commented Aug 14, 2022 •

edited

Loading

abitrolly commented Aug 15, 2022 •

edited

Loading