Inference System

System for machine learning inference.

Benchmark

Wanling Gao, Fei Tang, Jianfeng Zhan, et al. "AIBench: A Datacenter AI Benchmark Suite, BenchCouncil". [Paper] [Website]
BaiduBench: Benchmarking Deep Learning operations on different hardware. [Github]
Reddi, Vijay Janapa, et al. "Mlperf inference benchmark." arXiv preprint arXiv:1911.02549 (2019). [Paper] [GitHub]
Bianco, Simone, et al. "Benchmark analysis of representative deep neural network architectures." IEEE Access 6 (2018): 64270-64277. [Paper]
Almeida, Mario, et al. "EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices." The 3rd International Workshop on Deep Learning for Mobile Systems and Applications. 2019. [Paper]

Model Management

Model Card Toolkit. The Model Card Toolkit (MCT) streamlines and automates generation of Model Cards [1], machine learning documents that provide context and transparency into a model's development and performance. [Paper] [GitHub]
DLHub: Model and data serving for science. [Paper]
- Chard, R., Li, Z., Chard, K., Ward, L., Babuji, Y., Woodard, A., Tuecke, S., Blaiszik, B., Franklin, M. and Foster, I., 2019, May.
- In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 283-292). IEEE.
Publishing and Serving Machine Learning Models with DLHub. [Paper]
TRAINS - Auto-Magical Experiment Manager & Version Control for AI [GitHub]
ModelDB: A system to manage ML models [GitHub] [MIT short paper]
iterative/dvc: Data & models versioning for ML projects, make them shareable and reproducible [GitHub]

Model Serving

Announcing RedisAI 1.0: AI Serving Engine for Real-Time Applications [Blog]
Cloudburst: Stateful Functions-as-a-Service. [Paper] [GitHub]
- Vikram Sreekanti, Chenggang Wu, Xiayue Charles Lin, Johann Schleier-Smith, Joseph E. Gonzalez, Joseph M. Hellerstein, Alexey Tumanov
- VLDB 2020
- A stateful FaaS platform. (1) feasibility of general-purpose stateful serverless computing. (2) Autoscaling via logical disaggregation of storage and compute, state management via physical colocation of caches with compute services. (3) LDPC design pattern
Optimizing Prediction Serving on Low-Latency Serverless Dataflow [Paper]
- Sreekanti, Vikram, Harikaran Subbaraj, Chenggang Wu, Joseph E. Gonzalez, and Joseph M. Hellerstein.
- arXiv preprint arXiv:2007.05832 (2020).
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. [Paper]
- Gujarati, A., Karimi, R., Alzayat, S., Kaufmann, A., Vigfusson, Y. and Mace, J., 2020.
- OSDI 2020
Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency [Paper]
- Gujarati, Arpan, Sameh Elnikety, Yuxiong He, Kathryn S. McKinley, and Björn B. Brandenburg.
- In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, pp. 109-120. 2017.
- Summary: a cloud autoscaler. (1) model-based autoscaling that takes into account SLAs and ML inference workload characteristics, (2) a distributed protocol that uses partial load information and prediction at frontends to provi- sion new service instances, and (3) a backend self-decommissioning protocol for service instances
Swift machine learning model serving scheduling: a region based reinforcement learning approach. [Paper] [GitHub]
- Qin, Heyang, Syed Zawad, Yanqi Zhou, Lei Yang, Dongfang Zhao, and Feng Yan.
- In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-23. 2019.
- Summary: The system performances under different similar con- figurations in a region can be accurately estimated by using the system performance under one of these configurations, due to their similarity. Region based DRL is designed for parallelism selection.
TorchServe is a flexible and easy to use tool for serving PyTorch models. [GitHub]
Seldon Core: Blazing Fast, Industry-Ready ML. An open source platform to deploy your machine learning models on Kubernetes at massive scale. [GitHub]
MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving [Paper] [GitHub]
- Zhang, C., Yu, M., Wang, W. and Yan, F., 2019.
- In 2019 {USENIX} Annual Technical Conference ({USENIX}{ATC} 19) (pp. 1049-1062).
- Summary: address the scalability and cost minimization issues for model serving on the public cloud.
Parity Models: Erasure-Coded Resilience for Prediction Serving Systems(SOSP2019) [Paper] [GitHub]
Nexus: Nexus is a scalable and efficient serving system for DNN applications on GPU cluster (SOSP2019) [Paper] [GitHub]
Deep Learning Inference Service at Microsoft [Paper]
- J Soifer, et al. (OptML2019)
{PRETZEL}: Opening the Black Box of Machine Learning Prediction Serving Systems. [Paper]
- Lee, Y., Scolari, A., Chun, B.G., Santambrogio, M.D., Weimer, M. and Interlandi, M., 2018. (OSDI 2018)
Brusta: PyTorch model serving project [GitHub]
Model Server for Apache MXNet: Model Server for Apache MXNet is a tool for serving neural net models for inference [GitHub]
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform [Paper] [Website] [GitHub]
- Baylor, Denis, et al. (KDD 2017)
Tensorflow-serving: Flexible, high-performance ml serving [Paper] [GitHub]
- Olston, Christopher, et al.
IntelAI/OpenVINO-model-server: Inference model server implementation with gRPC interface, compatible with TensorFlow serving API and OpenVINO™ as the execution backend. [GitHub]
Clipper: A Low-Latency Online Prediction Serving System [Paper] [GitHub]
- Crankshaw, Daniel, et al. (NSDI 2017)
- Summary: Adaptive batch
InferLine: ML Inference Pipeline Composition Framework [Paper] [GitHub]
- Crankshaw, Daniel, et al. (SoCC 2020)
- Summary: update version of Clipper
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments [Paper]
- Dakkak, Abdul, et al (Preprint)
- Summary: model cold start problem
Rafiki: machine learning as an analytics service system [Paper] [GitHub]
- Wang, Wei, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad.
- Summary: Contain both training and inference. Auto-Hype-Parameter search for training. Ensemble models for inference. Using DRL to balance trade-off between accuracy and latency.
GraphPipe: Machine Learning Model Deployment Made Simple [GitHub]
Orkhon: ML Inference Framework and Server Runtime [GitHub]
NVIDIA/tensorrt-inference-server: The TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. [GitHub] [Slides: DEEP INTO TRTIS]
torchpipe: Ensemble Pipeline Serving with Pytorch Frontend. Boosting DL Service Throughput 1.5-4x by Ensemble Pipeline Serving with Concurrent CUDA Streams for PyTorch/LibTorch Frontend and TensorRT/CVCUDA, etc., Backends. [GitHub]
INFaaS: Automated Model-less Inference Serving [GitHub], [Paper]
- Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis (ATC 2021)
Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines
- Francisco Romero, Mark Zhao, Neeraja J. Yadwadkar, and Christos Kozyrakis (SoCC 2021)
Scrooge: A Cost-Effective Deep Learning Inference System [Paper]
- Yitao Hu, Rajrup Ghosh, Ramesh Govindan
Apache PredictionIO® is an open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task [Website]

Cache for Inference

Kumar, Adarsh, et al. "Accelerating deep learning inference via freezing." 11th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 19). 2019. [Paper]
Xu, Mengwei, et al. "DeepCache: Principled cache for mobile deep vision." Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 2018. [Paper]
Park, Keunyoung, and Doo-Hyun Kim. "Accelerating image classification using feature map similarity in convolutional neural networks." Applied Sciences 9.1 (2019): 108. [Paper]
Cavigelli, Lukas, and Luca Benini. "CBinfer: Exploiting frame-to-frame locality for faster convolutional network inference on video streams." IEEE Transactions on Circuits and Systems for Video Technology (2019). [Paper]

Inference Optimization

Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics [Paper]
- Daniel Kang, Ankit Mathur, Teja Veeramacheneni, Peter Bailis, Matei Zaharia
- VLDB 2021
Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference. [arxiv][GitHub]
- Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, Matei Zaharia.
- arXiv Preprint. 2019.
TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs and deep learning accelerators. [GitHub]
Dynamic Space-Time Scheduling for GPU Inference [Paper] [GitHub]
- Jain, Paras, et al. (NIPS 18, System for ML)
- Summary: optimization for GPU Multi-tenancy
Dynamic Scheduling For Dynamic Control Flow in Deep Learning Systems [Paper]
- Wei, Jinliang, Garth Gibson, Vijay Vasudevan, and Eric Xing. (On going)
Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. [Paper]
- D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. (NeurIPS Systems for ML Workshop 2018)
- Summary: They assume that their system, HiveMind, is given as input models grouped into model batches that are amenable to co-optimization and co-execution. a compiler, and a runtime.
DeepCPU: Serving RNN-based Deep Learning Models 10x Faster [Paper]
- Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He, Microsoft AI and Research (ATC 2018)

Cluster Management for Inference (now only contain multi-tenant)

Ease. ml: Towards multi-tenant resource sharing for machine learning workloads [Paper] [GitHub] [Demo]
- Li, Tian, et al
- Proceedings of the VLDB Endowment 11.5 (2018): 607-620.
Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models [Paper]
- LeMay, Matthew, Shijian Li, and Tian Guo.
- arXiv preprint arXiv:1912.02322 (2019).

Machine Learning Compiler

Hummingbird: Hummingbird is a library for compiling trained traditional ML models into tensor computations. Hummingbird allows users to seamlessly leverage neural network frameworks (such as PyTorch) to accelerate traditional ML models.[GitHub]
{TVM}: An Automated End-to-End Optimizing Compiler for Deep Learning [Paper] [YouTube] [Project Website]
- Chen, Tianqi, et al. (OSDI 2018)
- Summary: Automated optimization is very impressive: cost model (rank objective function) + schedule explorer (parallel simulated annealing)
Facebook TC: Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. [GitHub]
Tensorflow/mlir: "Multi-Level Intermediate Representation" Compiler Infrastructure [GitHub] [Video]
PyTorch/glow: Compiler for Neural Network hardware accelerators [GitHub]
TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions [Paper] [GitHub]
- Jia, Zhihao, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. (SOSP 2019)
- Experiments tested on TVM and XLA
SGLAng: Manage KV cache through radix attention [Paper] [Github]
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference.md

inference.md

Inference System

Benchmark

Model Management

Model Serving

Cache for Inference

Inference Optimization

Cluster Management for Inference (now only contain multi-tenant)

Machine Learning Compiler

Files

inference.md

Latest commit

History

inference.md

File metadata and controls

Inference System

Benchmark

Model Management

Model Serving

Cache for Inference

Inference Optimization

Cluster Management for Inference (now only contain multi-tenant)

Machine Learning Compiler