-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blosc2 LZ4 is 2.8x slower than Blosc LZ4 #270
Comments
Hi, I would not expect such a difference.
From a quick look, both plugins are compiled with the same compiler options, both libs contains AVX2 instructions and both use the same LZ4 lib. attn @FrancescAlted |
Having a second look at this, it looks to be related to HDF5 chunking.
Using larger chunks, e.g.,
|
Well seen @t20100 ! Indeed, 5250 elements is very little for Blosc2, as it requires much larger chunksizes for allowing scalability. FWIW, to eliminate any doubt about hdf5plugin implementation, here it is the output of an script using h5py (via hdf5plugin), PyTables (via the internal blosc2 plugin) and Python-Blosc2 (a native wrapper to blosc2): $ python compare-blosc-blosc2.py
time blosc (h5py): 2.201
time blosc2 (h5py): 2.122
time blosc (tables): 2.116
time blosc2 (tables): 2.165
time blosc2 (blosc2): 0.439 And the sizes: $ ls -l *.h5 *.b2nd
-rw-r--r-- 1 faltet blosc 20301232 sep 4 17:28 blosc-h5py.h5
-rw-r--r-- 1 faltet blosc 20301592 sep 4 17:28 blosc-tables.h5
-rw-r--r-- 1 faltet blosc 20015938 sep 4 17:28 blosc2-h5py.h5
-rw-r--r-- 1 faltet blosc 20016298 sep 4 17:28 blosc2-tables.h5
-rw-r--r-- 1 faltet blosc 20097086 sep 4 17:28 blosc2nd.b2nd The script is here (I have changed the random distribution to allow actual compression to happen, and increased the chunk to 80 MB, for better reproducibility): import tables
import hdf5plugin
import h5py
import numpy
from time import time
import blosc2
_ = hdf5plugin.get_config()
#print(_)
def testh5(fname, x, **kwargs):
with h5py.File(fname, 'w') as h5file:
h5file.create_dataset('/x', data=x, chunks=x.shape, **kwargs)
def testh5_tables(fname, x, filters):
with tables.open_file(fname, "w") as h5file:
h5file.create_carray('/', 'x', filters=filters, obj=x, chunkshape=x.shape)
def test_blosc2(fname, x, cparams):
blosc2.asarray(x, urlpath=fname, mode="w", cparams=cparams, chunks=x.shape)
#x_abc = numpy.random.normal(size=(20, 500, 1000))
rng = numpy.random.default_rng()
x_abc = rng.integers(low=0, high=10000, size=(20, 500, 1000), dtype=numpy.int64)
cname = "lz4"
codec = blosc2.Codec.LZ4
clevel = 1
shuffle = True
chunks = x_abc.shape
### h5py ###
t0 = time()
for i in range(10):
testh5('blosc-h5py.h5', x_abc, **hdf5plugin.Blosc(cname=cname, clevel=clevel, shuffle=shuffle))
print(f"time blosc (h5py): {time() - t0:.3f}")
t0 = time()
for i in range(10):
testh5('blosc2-h5py.h5', x_abc, **hdf5plugin.Blosc2(cname=cname, clevel=clevel, filters=shuffle))
print(f"time blosc2 (h5py): {time() - t0:.3f}")
### pytables ###
t0 = time()
for i in range(10):
filters = tables.Filters(complevel=clevel, complib="blosc:%s" % cname, shuffle=True)
testh5_tables('blosc-tables.h5', x_abc, filters)
print(f"time blosc (tables): {time() - t0:.3f}")
t0 = time()
for i in range(10):
filters = tables.Filters(complevel=clevel, complib="blosc2:%s" % cname, shuffle=True)
testh5_tables('blosc2-tables.h5', x_abc, filters)
print(f"time blosc2 (tables): {time() - t0:.3f}")
### blosc2 NDim ###
t0 = time()
for i in range(10):
cparams = {"codec": codec, "clevel": clevel, "filters": [blosc2.Filter.SHUFFLE]}
test_blosc2('blosc2nd.b2nd', x_abc, cparams)
print(f"time blosc2 (blosc2): {time() - t0:.3f}") |
FWIW, if one still wants small chunks, it is better to use 1 single thread with Blosc/Blosc2. With the original array (8 MB):
|
Thank you for all suggestions.
For some reason Questions about
|
Yes, performance for such a 'small' datasets tends to be quite dependent on the CPU. On my MacBook Air (M1 processor):
Regarding your questions:
|
Thanks. Two more questions:
|
There is support in
This is already done in BTW, |
Having said that, my personal view is that there is no much point in optimizing the compressor too much, unless you find a way to bypass the HDF5 pipeline and do direct chunking (see benchmarks in e.g. https://www.blosc.org/posts/blosc2-pytables-perf/). |
I checkout
hdf5plugin
tagv4.1.3
and compiled it with all optimizations uptoAVX2
. The python code below shows that Blosc2 is 2.8x slower than Blosc (I am running both compressors on a single thread). Is this expected? Am I doing something wrong?The text was updated successfully, but these errors were encountered: