Blosc2 LZ4 is 2.8x slower than Blosc LZ4 #270

dmbelov · 2023-09-02T01:25:00Z

I checkout hdf5plugin tag v4.1.3 and compiled it with all optimizations upto AVX2. The python code below shows that Blosc2 is 2.8x slower than Blosc (I am running both compressors on a single thread). Is this expected? Am I doing something wrong?

import hdf5plugin
import h5py
import numpy

In [2]: hdf5plugin.get_config()
Out[2]: HDF5PluginConfig(build_config=HDF5PluginBuildConfig(openmp=True, native=True, bmi2=True, sse2=True, avx2=True, avx512=False, cpp11=True, cpp14=True, ipp=False, filter_file_extension='.so', embedded_filters=('blosc', 'blosc2', 'bshuf', 'bzip2', 'fcidecomp', 'lz4', 'sz', 'sz3', 'zfp', 'zstd')), registered_filters={'bshuf': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5bshuf.so', 'blosc': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5blosc.so', 'blosc2': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5blosc2.so', 'bzip2': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5bzip2.so', 'fcidecomp': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5fcidecomp.so', 'lz4': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5lz4.so', 'sz': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5sz.so', 'sz3': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5sz3.so', 'zfp': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5zfp.so', 'zstd': '/home/dmitryb/soft/hdf5plugin/build/lib.linux-x86_64-3.10/hdf5plugin/plugins/libh5zstd.so'})

def testh5(fname, x, **kwargs):
    with h5py.File(fname, 'w') as h5file:
        h5file.create_dataset('/x', data=x, **kwargs)

In [9]: x_abc = numpy.random.normal(size=(20, 50, 1000))

In [10]: %timeit testh5('blosc.h5', x_abc, **hdf5plugin.Blosc(cname='lz4', clevel=9, shuffle=1))
9.74 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit testh5('blosc2.h5', x_abc, **hdf5plugin.Blosc2(cname='lz4', clevel=9, filters=1))
27.9 ms ± 570 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The text was updated successfully, but these errors were encountered:

t20100 · 2023-09-04T10:02:57Z

Hi,

I would not expect such a difference.
I run the same code a few times and also get blosc (1) faster, though with less difference:

blosc (1):
- From 62.6 ms ± 4.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
- Up to 79.3 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
blosc2:
- From 106 ms ± 5.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
- Up to 112 ms ± 8.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

From a quick look, both plugins are compiled with the same compiler options, both libs contains AVX2 instructions and both use the same LZ4 lib.

attn @FrancescAlted

t20100 · 2023-09-04T12:46:04Z

Having a second look at this, it looks to be related to HDF5 chunking.
To get the chunking:

with h5py.File('blosc.h5', 'r') as h5file:
     print(h5file['x'].chunks)

Using larger chunks, e.g., chunks=x_abc.shape instead of using the default chunking ((3, 7, 250) = 5250 elements per chunk) leads to similar performance for both blosc and blosc2:

%timeit testh5('blosc.h5', x_abc, **hdf5plugin.Blosc(cname='lz4', clevel=9, shuffle=1), chunks=x_abc.shape)
75 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit testh5('blosc2.h5', x_abc, **hdf5plugin.Blosc2(cname='lz4', clevel=9, filters=1), chunks=x_abc.shape)
72.1 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

FrancescAlted · 2023-09-04T15:31:29Z

Well seen @t20100 ! Indeed, 5250 elements is very little for Blosc2, as it requires much larger chunksizes for allowing scalability. FWIW, to eliminate any doubt about hdf5plugin implementation, here it is the output of an script using h5py (via hdf5plugin), PyTables (via the internal blosc2 plugin) and Python-Blosc2 (a native wrapper to blosc2):

$ python compare-blosc-blosc2.py
time blosc (h5py): 2.201
time blosc2 (h5py): 2.122
time blosc (tables): 2.116
time blosc2 (tables): 2.165
time blosc2 (blosc2): 0.439

And the sizes:

$ ls -l *.h5 *.b2nd
-rw-r--r-- 1 faltet blosc 20301232 sep  4 17:28 blosc-h5py.h5
-rw-r--r-- 1 faltet blosc 20301592 sep  4 17:28 blosc-tables.h5
-rw-r--r-- 1 faltet blosc 20015938 sep  4 17:28 blosc2-h5py.h5
-rw-r--r-- 1 faltet blosc 20016298 sep  4 17:28 blosc2-tables.h5
-rw-r--r-- 1 faltet blosc 20097086 sep  4 17:28 blosc2nd.b2nd

The script is here (I have changed the random distribution to allow actual compression to happen, and increased the chunk to 80 MB, for better reproducibility):

import tables
import hdf5plugin
import h5py
import numpy
from time import time
import blosc2


_ = hdf5plugin.get_config()
#print(_)

def testh5(fname, x, **kwargs):
    with h5py.File(fname, 'w') as h5file:
        h5file.create_dataset('/x', data=x, chunks=x.shape, **kwargs)

def testh5_tables(fname, x, filters):
    with tables.open_file(fname, "w") as h5file:
        h5file.create_carray('/', 'x', filters=filters, obj=x, chunkshape=x.shape)

def test_blosc2(fname, x, cparams):
    blosc2.asarray(x, urlpath=fname, mode="w", cparams=cparams, chunks=x.shape)

#x_abc = numpy.random.normal(size=(20, 500, 1000))
rng = numpy.random.default_rng()
x_abc = rng.integers(low=0, high=10000, size=(20, 500, 1000), dtype=numpy.int64)

cname = "lz4"
codec = blosc2.Codec.LZ4
clevel = 1
shuffle = True
chunks = x_abc.shape

### h5py ###
t0 = time()
for i in range(10):
    testh5('blosc-h5py.h5', x_abc, **hdf5plugin.Blosc(cname=cname, clevel=clevel, shuffle=shuffle))
print(f"time blosc (h5py): {time() - t0:.3f}")

t0 = time()
for i in range(10):
    testh5('blosc2-h5py.h5', x_abc, **hdf5plugin.Blosc2(cname=cname, clevel=clevel, filters=shuffle))
print(f"time blosc2 (h5py): {time() - t0:.3f}")

### pytables ###
t0 = time()
for i in range(10):
    filters = tables.Filters(complevel=clevel, complib="blosc:%s" % cname, shuffle=True)
    testh5_tables('blosc-tables.h5', x_abc, filters)
print(f"time blosc (tables): {time() - t0:.3f}")

t0 = time()
for i in range(10):
    filters = tables.Filters(complevel=clevel, complib="blosc2:%s" % cname, shuffle=True)
    testh5_tables('blosc2-tables.h5', x_abc, filters)
print(f"time blosc2 (tables): {time() - t0:.3f}")

### blosc2 NDim ###
t0 = time()
for i in range(10):
    cparams = {"codec": codec, "clevel": clevel, "filters": [blosc2.Filter.SHUFFLE]}
    test_blosc2('blosc2nd.b2nd', x_abc, cparams)
print(f"time blosc2 (blosc2): {time() - t0:.3f}")

FrancescAlted · 2023-09-04T16:02:11Z

FWIW, if one still wants small chunks, it is better to use 1 single thread with Blosc/Blosc2. With the original array (8 MB):

$ BLOSC_NTHREADS=1 python compare-blosc-blosc2.py
time blosc (h5py): 0.920
time blosc2 (h5py): 0.883
time blosc (tables): 0.800
time blosc2 (tables): 0.903
time blosc2 (blosc2): 0.060
$ ls -l *.h5 *.b2nd
-rw-r--r-- 1 faltet blosc 7522886 sep  4 17:58 blosc-h5py.h5
-rw-r--r-- 1 faltet blosc 7523246 sep  4 17:58 blosc-tables.h5
-rw-r--r-- 1 faltet blosc 7533880 sep  4 17:58 blosc2-h5py.h5
-rw-r--r-- 1 faltet blosc 7534240 sep  4 17:58 blosc2-tables.h5
-rw-r--r-- 1 faltet blosc 7515025 sep  4 17:58 blosc2nd.b2nd

dmbelov · 2023-09-05T12:42:30Z

Thank you for all suggestions. BLOSC_NTHREADS=1 works the best for me, increasing the shape only increases compression time both for BLOSC and BLOSC2. Here are statistics with auto and x.shape chunks:

time blosc (h5py, chunks=True): 0.284
time blosc2 (h5py, chunks=True): 0.316
time blosc (h5py, chunks=(20, 500, 1000)): 0.484
time blosc2 (h5py, chunks=(20, 500, 1000)): 0.597

For some reason blosc2 is slower than blosc on my CPU.

Questions about ptyhon-blosc2:

Is there a way to install python-blosc2 using conda?
Why python-blosc2 is so much faster than blosc2 in HDF5? Is there a way to make HDF5 as fast as python-blosc2?

FrancescAlted · 2023-09-05T15:34:30Z

Yes, performance for such a 'small' datasets tends to be quite dependent on the CPU. On my MacBook Air (M1 processor):

$ BLOSC_NTHREADS=1 python prova.py
time blosc (h5py, chunks=None): 0.494
time blosc2 (h5py, chunks=None): 0.364
time blosc (h5py, chunks=(20, 500, 1000)): 0.481
time blosc2 (h5py, chunks=(20, 500, 1000)): 0.285

Regarding your questions:

The Blosc team is not responsible to produce conda binaries (maybe pinging someone on the conda teams would help?). However, we are providing binary wheels that should be easily installable via pip install blosc2. I regularly install like this in conda environments, and it works great.
HDF5 uses a plugin subsystem that is known to be slow. There are solutions for making HDF5 faster, and we are making rapid progress for implementing this in PyTables. We have plans for porting this effort into h5py too. Stay tuned!

dmbelov · 2023-09-05T19:06:07Z

Thanks. Two more questions:

In the following article Intel suggests that IPP library improves performance of LZ4 algo. Have you tried it? It, yes, how can I compile lz4 used in hdf5plugin to use IPP?
https://www.intel.com/content/www/us/en/developer/articles/technical/building-a-faster-lz4-with-intel-integrated-performance-primitives.html
Currently, I can compile hdf5plugin to either use or do not use AVX512 (and other optimizations). Do you have plans to write a code that will dynamically choose correct code at run time depending on what is supported by CPU?

t20100 · 2023-09-06T08:14:35Z

how can I compile lz4 used in hdf5plugin to use IPP?

There is support in hdf5plugin to use lz4 from IPP: Set the env. var. HDF5PLUGIN_INTEL_IPP_DIR to the path of IPP (see http://www.silx.org/doc/hdf5plugin/latest/install.html#available-options).
When I tried it, the performance improvement was not worth it on my machine, but I expect this to depend on the CPU.

choose correct code at run time depending on what is supported by CPU?

This is already done in blosc and blosc2 (see blosc_get_cpu_features in src/c-blosc2/blosc/shuffle.c) but not for the bitshuffle filter where this is chosen at compile time (see e.g., https://github.com/silx-kit/hdf5plugin/blob/b4f2914f75e178a0a0f5e5e0eb06f588f92554c9/src/bitshuffle/src/bitshuffle_core.c#L1798C9-L1814).

BTW, blosc(1|2) does not support AVX512, it's for the bitshuffle filter.

FrancescAlted · 2023-09-06T08:23:24Z

Indeed. Blosc2 gained support for using IPP several years ago, but we have disabled it by default because there was not evidence enough that it leads to better speed / cratio. Moreover, @t20100 is reporting above a similar experience.
No support for AVX512 for Blosc2 yet.

Having said that, my personal view is that there is no much point in optimizing the compressor too much, unless you find a way to bypass the HDF5 pipeline and do direct chunking (see benchmarks in e.g. https://www.blosc.org/posts/blosc2-pytables-perf/).

t20100 mentioned this issue May 29, 2024

[Question] Is there any filter that can compress np.int(xx)/np.uint(xx) #304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blosc2 LZ4 is 2.8x slower than Blosc LZ4 #270

Blosc2 LZ4 is 2.8x slower than Blosc LZ4 #270

dmbelov commented Sep 2, 2023

t20100 commented Sep 4, 2023

t20100 commented Sep 4, 2023 •

edited

Loading

FrancescAlted commented Sep 4, 2023

FrancescAlted commented Sep 4, 2023

dmbelov commented Sep 5, 2023

FrancescAlted commented Sep 5, 2023

dmbelov commented Sep 5, 2023

t20100 commented Sep 6, 2023

FrancescAlted commented Sep 6, 2023

Blosc2 LZ4 is 2.8x slower than Blosc LZ4 #270

Blosc2 LZ4 is 2.8x slower than Blosc LZ4 #270

Comments

dmbelov commented Sep 2, 2023

t20100 commented Sep 4, 2023

t20100 commented Sep 4, 2023 • edited Loading

FrancescAlted commented Sep 4, 2023

FrancescAlted commented Sep 4, 2023

dmbelov commented Sep 5, 2023

FrancescAlted commented Sep 5, 2023

dmbelov commented Sep 5, 2023

t20100 commented Sep 6, 2023

FrancescAlted commented Sep 6, 2023

t20100 commented Sep 4, 2023 •

edited

Loading