For more efficient Git packing of ZIP based files.
Many popular applications, such as Microsoft and Open Office, save their documents as XML in compressed zip containers. Small changes to these document's contents may result in big changes to their compressed binary container file. When compressed files are stored in a Git repository these big differences make delta compression inefficient or impossible and the repository size is roughly the sum of its revisions.
This small program acts as a Git clean filter driver. It reads a ZIP file from stdin and outputs the same ZIP content to stdout, but without compression.
- human readbale/plain-text diffs of (ZIP based) archives, (if they contain plain-text files)
- smaller overall repository size if the archive contents change frequently
- slower
git add
/git commit
process - (optional) slower checkout process
On every git add
operation, the files assigned to the ZIP based file type in
.gitattributes are piped through this filter to remove their compression.
Git internally uses zlib compression to store the resulting blob,
so the final size of the loose object in the repository is usually comparable
to the size of the original compressed ZIP document.
The advantage of passing uncompressed data to Git, is that during garbage collection, when Git merges loose objects into packfiles, the delta compression it uses will be able to more efficiently pack the common data it finds among these uncompressed revisions. This can reduce the repository size by up to 50%, depending on the data.
The smudge filter will re-compress the ZIP documents when they are checked out. The rezipped file may be a different size than the original, because of the compression level used by the filter. The use of this filter at checkout will save disk space in the working directory, at the expense of performance during checkout. I have not found any application yet, that refused to read an uncompressed ZIP document, so the smudge filter is optional. This also means that repositories may be downloaded and used immediately, without any special burdon on the recipients to install this filter driver.
If other contributors add compressed ZIP documents to the repository
without using the clean filter (the one applied during add
/commit
),
the only harm will be the usual loss of packing efficiency for compressed
documents during garbage collection, and non-verbose diffs.
The idea to commit ZIP documents to the repository in uncompressed form was based on concepts demonstrated in the Mercurial Zipdoc extension by Andreas Gobell.
OoXmlUnpack is a similar program for Mercurial, written in C#, which also pretty-prints the XML files and adds some file handling features specific to Excel.
callegar/Rezip should be compatible with this Git filter, but is written as a bash script to drive Info-ZIP zip/unzip executables.
Zippey is a similar method available for Git, written in python, but it stores uncompressed data as custom records within the Git repository. This format is not directly usable without the smudge filter, so it is a less portable option.
This filter is only concerned with the efficient storage of ZIP data within Git.
For human readable diffs between revisions,
You will need to add a Git textconv
program that can convert your format into text.
Direct merges are not possible, since they would corrupt the ZIP CRC checksum.
If the data within the ZIP is plain-text,
then you could visualize differences with a textconv
program like
zipdoc.
For more complex documents, there are domain specific options.
For example for
word processing,
Excel,
and
Simulink.
This program requires Java JRE 8 or newer.
Store ReZip.class somewhere in your home directory,
for example ~/bin
, or in your repository.
Define the filter drivers in ~/.gitconfig
:
git config --global --replace-all filter.rezip.clean "java -cp ~/bin ReZip --store"
# optionally add smudge filter:
git config --global --add filter.rezip.smudge "java -cp ~/bin ReZip"
Assign filter attributes to paths in <repo-root>/.gitattributes
:
# MS Office
*.docx filter=rezip
*.xlsx filter=rezip
*.pptx filter=rezip
# OpenOffice
*.odt filter=rezip
*.ods filter=rezip
*.odp filter=rezip
# Misc
*.mcdx filter=rezip
*.slx filter=rezip
As described in gitattributes, you may see unnecessary merge conflicts when you add attributes to a file that causes the repository format for that file to change. To prevent this, Git can be told to run a virtual check-out and check-in of all three stages of a file when resolving a three-way merge:
git config --add --bool merge.renormalize true
The following are based on my experience in real-world cases. Use at your own risk. Your mileage may vary.
- One packed repository with rezip was 54% of the size of the packed repository storing compressed ZIPs.
- Another repository with 280 *.slx files and over 3000 commits was originally 281 MB and was reduced to 156 MB using this technique (55% of baseline).
I found that the loose objects stored without this filter were about 5% smaller than the original file size (zlib on top of zip compression). When using the rezip filter, the loose objects were about 10% smaller than the original files, since zlib could work more efficiently on uncompressed data. The packed repository with rezip was only 10% smaller than the packed repository storing compressed zips. I think this unremarkable efficiency improvement is due to a large number of *.png files in the presentation which were already stored without compression in the original *.pptx.