Releases: neuralmagic/deepsparse
Releases · neuralmagic/deepsparse
DeepSparse v0.12.2 Patch Release
This is a patch release for 0.12.0 that contains the following changes:
- Protobuf is restricted to version < 4.0 as the newer version breaks ONNX.
DeepSparse v0.12.1 Patch Release
This is a patch release for 0.12.0 that contains the following changes:
- Improper label mapping no longer crashes for validation flows within DeepSparse transformers.
- DeepSparse Server now exposes proper routes for SageMaker.
- Dependency issue with DeepSparse Server no longer installs an old version of a library that caused crashing issues in some use cases.
DeepSparse v0.12.0
New Features:
Documentation:
- SparseServer.UI: a Streamlit app for deploying the DeepSparse Server for exploring the inference performance of BERT on the question answering task.
- DeepSparse Server README:
deepsparse.server
capabilities, including single model and multi-model inferencing. - Twitter NLP Inference Examples added.
Changes:
Performance:
- Speedup for large batch sizes when using sync mode on AMD EPYC processors.
- AVX2 improvements for
- Up to 40% speedup out of the box for dense quantized models.
- Up to 20% speedup for pruned quantized BERT, ResNet-50 and MobileNet.
- Speedup from sparsity realized for ConvInteger operators.
- Model compilation time decreased on systems with many cores.
- Multi-stream Scheduler: certain computations that were executed during runtime are now precomputed.
- Hugging Face Transformers integration updated to latest state from upstream main branch.
Documentation:
- DeepSparse README: references to
deepsparse.server
,deepsparse.benchmark
, and Transformer pipelines. - DeepSparse Benchmark README: highlights of
deepsparse.benchmark
CLI command. - Transformers 🤗 Inference Pipelines: examples included on how to run inference via Python for several NLP tasks.
Resolved Issues:
- When running quantized BERT with a sequence length not divisible by 4, the DeepSparse Engine will no longer disable optimizations and see very poor performance.
- Users executing
arch.bin
now receive a correct architecture profile of their system.
Known Issues:
- When running the DeepSparse engine on a system with a nonuniform system topology, for example, an AMD EPYC processor where some cores per core-complex (CCX) have been disabled, model compilation will never terminate. A workaround is to set the environment variable
NM_SERIAL_UNIT_GENERATION=1
.
DeepSparse v0.11.2 Patch Release
This is a patch release for 0.11.0 that contains the following changes:
- Fixed an assertion error that would occur when using
deepsparse.benchmark
on AMD machines with the argument-pin none
.
Known Issues:
- When running quantized BERT with a sequence length not divisible by 4, the DeepSparse Engine will disable optimizations and see very poor performance.
DeepSparse v0.11.1 Patch Release
This is a patch release for 0.11.0 that contains the following changes:
- When running NanoDet-Plus-m, the DeepSparse Engine will no longer fail with an assertion (See #279).
- The DeepSparse Engine now respects the cpu affinity set by the calling thread. This is essential for the new Command-line (CLI) tool
multi-process-benchmark.py
to function correctly. This script allows users to measure the performance using multiple separate processes in parallel. - Fixed a performance regression on BERT batch size 1 sequence length 128 models.
DeepSparse v0.11.0
New Features:
- High-performance sparse quantized convolutional neural networks supported on AVX2 systems.
- CCX detection added to the DeepSparse Engine for AMD systems.
deepsparse.server
integration and CLIs added with Hugging Face transformers pipelines support.
Changes:
Performance improvements made for
- FP32 sparse BERT models
- batch size 1 networks
- quantized sparse BERT models
- Pooling operations
Resolved Issues:
- When hyperthreads are disabled in the BIOS, core/socket information on certain systems can now be detected.
- Hugging Face transformers validation flows for QQP now giving correct accuracy metrics.
- PyTorch downloaded for YOLO model stubs now supported.
Known Issues:
- When running NanoDet-Plus-m, the DeepSparse Engine will fail with an assertion (See #279). A hotfix is being pursued.
DeepSparse v0.10.0
New Features:
- Quantization support enabled on AVX2 instruction set for GEMM and elementwise operations.
NM_SPOOF_ARCH
environment variable added for testing different architectural configurations.- Elastic scheduler implemented as an alternative to the single-stream or multi-stream schedulers.
deepsparse.benchmark
application is now usable from the command-line after installing deepsparse to simplify benchmarking.deepsparse.server
CLI and API added with transformers support to make serving models like BERT with pipelines easy.
Changes:
- More robust architecture detection added to help resolve CPU topology, such as when running inside a virtual machine.
- Tensor columns improved, leading to significant speedups from 5 to 20% in pruned YOLO (larger batch size), BERT (smaller batch size), MobileNet, and ResNet models.
- Sparse quantized network performance improved on machines that do not support VNNI instructions.
- Performance improved for dense BERT with large batch sizes.
Resolved Issues:
- Possible crashes eliminated for:
- Pooling operations with small image sizes
- Rarely, networks containing convolution or GEMM operations
- Some models with many residual connections
Known Issues:
- None
DeepSparse v0.9.1 Patch Release
This is a patch release for 0.9.0 that contains the following changes:
- YOLACT models and other models with constant outputs no longer fail with a mismatched shape error on multi-socket systems with batch sizes greater than 1. However, a corner case exists where a model with a constant output whose first dimension is equal to the (nonunit) batch size will fail.
- GEMM operations where the number of columns of the output matrix is not divisible by 16 will no longer fail with an assertion error.
- Broadcasted inputs to elementwise operators no longer fail with an assertion error.
- Int64 multiplications no longer fail with an illegal instruction on AVX2.
DeepSparse v0.9.0
New Features:
- Support optimized for resize operators with coordinate transformations of pytorch_half_pixel and align_corners.
- Up-to-date version check implemented for DeepSparse.
- YOLACT and DeepSparse integration added in examples/dbolya-yolact.
Changes:
- The parameter for the number of sockets to use has been removed -- the Python interface now only takes only the number of cores as a parameter.
- Tensor columns have been optimized. Users will see performance improvements specifically for pruned quantized BERT models:
- The softmax operator can now take advantage of tensor columns.
- Inference batch sizes that are not divisible by 16 are now supported.
- Various performance improvements made to:
- certain networks, such as YOLOv5, on AVX2 systems.
- dense convolutions on some AVX-512 systems.
- API docs recompiled.
Resolved Issues:
- In rare circumstances, users could have experienced an assertion error when executing networks with depthwise convolutions.
Known Issues:
- YOLACT models fail with a mismatched shape error on multi-socket systems with batch size greater than 1. This issue applies to any model with a constant output.
- In some circumstances, GEMM operations where the number of columns of the output matrix is not divisible by 16 may fail with an assertion error.
DeepSparse v0.8.0
New Features:
- Tensor columns have been optimized, improving the performance of some networks.
- This includes but is not limited to pruned and quantized YOLOv5s and BERT.
- For networks with subgraphs comprised of low-compute operations.
- Batch size must be a multiple of 16.
- Reduce operators have been further optimized in the Engine.
- C++ API support is available for the DeepSparse Engine.
Changes:
- Performance improvements made for low-precision (8 and 16-bit) datatypes on AVX2.
Resolved Issues:
- Rarely, when several data arrangement operators were in a row, e.g., Reshape, Transpose, or Slice, assertion errors occurred.
- When Pad operators were not followed by convolution or pooling, assertion errors occurred.
- CPU threads migrated between cores when running benchmarks.
Known Issues:
- None