System for machine learning inference.
- Wanling Gao, Fei Tang, Jianfeng Zhan, et al. "AIBench: A Datacenter AI Benchmark Suite, BenchCouncil". [Paper] [Website]
- BaiduBench: Benchmarking Deep Learning operations on different hardware. [Github]
- Reddi, Vijay Janapa, et al. "Mlperf inference benchmark." arXiv preprint arXiv:1911.02549 (2019). [Paper] [GitHub]
- Bianco, Simone, et al. "Benchmark analysis of representative deep neural network architectures." IEEE Access 6 (2018): 64270-64277. [Paper]
- Almeida, Mario, et al. "EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices." The 3rd International Workshop on Deep Learning for Mobile Systems and Applications. 2019. [Paper]
- Model Card Toolkit. The Model Card Toolkit (MCT) streamlines and automates generation of Model Cards [1], machine learning documents that provide context and transparency into a model's development and performance. [Paper] [GitHub]
- DLHub: Model and data serving for science. [Paper]
- Chard, R., Li, Z., Chard, K., Ward, L., Babuji, Y., Woodard, A., Tuecke, S., Blaiszik, B., Franklin, M. and Foster, I., 2019, May.
- In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 283-292). IEEE.
- Publishing and Serving Machine Learning Models with DLHub. [Paper]
- TRAINS - Auto-Magical Experiment Manager & Version Control for AI [GitHub]
- ModelDB: A system to manage ML models [GitHub] [MIT short paper]
- iterative/dvc: Data & models versioning for ML projects, make them shareable and reproducible [GitHub]
- Announcing RedisAI 1.0: AI Serving Engine for Real-Time Applications [Blog]
- Cloudburst: Stateful Functions-as-a-Service. [Paper] [GitHub]
- Vikram Sreekanti, Chenggang Wu, Xiayue Charles Lin, Johann Schleier-Smith, Joseph E. Gonzalez, Joseph M. Hellerstein, Alexey Tumanov
- VLDB 2020
- A stateful FaaS platform. (1) feasibility of general-purpose stateful serverless computing. (2) Autoscaling via logical disaggregation of storage and compute, state management via physical colocation of caches with compute services. (3) LDPC design pattern
- Optimizing Prediction Serving on Low-Latency Serverless Dataflow [Paper]
- Sreekanti, Vikram, Harikaran Subbaraj, Chenggang Wu, Joseph E. Gonzalez, and Joseph M. Hellerstein.
- arXiv preprint arXiv:2007.05832 (2020).
- Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. [Paper]
- Gujarati, A., Karimi, R., Alzayat, S., Kaufmann, A., Vigfusson, Y. and Mace, J., 2020.
- OSDI 2020
- Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency [Paper]
- Gujarati, Arpan, Sameh Elnikety, Yuxiong He, Kathryn S. McKinley, and Björn B. Brandenburg.
- In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, pp. 109-120. 2017.
- Summary: a cloud autoscaler. (1) model-based autoscaling that takes into account SLAs and ML inference workload characteristics, (2) a distributed protocol that uses partial load information and prediction at frontends to provi- sion new service instances, and (3) a backend self-decommissioning protocol for service instances
- Swift machine learning model serving scheduling: a region based reinforcement learning approach. [Paper] [GitHub]
- Qin, Heyang, Syed Zawad, Yanqi Zhou, Lei Yang, Dongfang Zhao, and Feng Yan.
- In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-23. 2019.
- Summary: The system performances under different similar con- figurations in a region can be accurately estimated by using the system performance under one of these configurations, due to their similarity. Region based DRL is designed for parallelism selection.
- TorchServe is a flexible and easy to use tool for serving PyTorch models. [GitHub]
- Seldon Core: Blazing Fast, Industry-Ready ML. An open source platform to deploy your machine learning models on Kubernetes at massive scale. [GitHub]
- MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving [Paper] [GitHub]
- Zhang, C., Yu, M., Wang, W. and Yan, F., 2019.
- In 2019 {USENIX} Annual Technical Conference ({USENIX}{ATC} 19) (pp. 1049-1062).
- Summary: address the scalability and cost minimization issues for model serving on the public cloud.
- Parity Models: Erasure-Coded Resilience for Prediction Serving Systems(SOSP2019) [Paper] [GitHub]
- Nexus: Nexus is a scalable and efficient serving system for DNN applications on GPU cluster (SOSP2019) [Paper] [GitHub]
- Deep Learning Inference Service at Microsoft [Paper]
- J Soifer, et al. (OptML2019)
- {PRETZEL}: Opening the Black Box of Machine Learning Prediction Serving Systems. [Paper]
- Lee, Y., Scolari, A., Chun, B.G., Santambrogio, M.D., Weimer, M. and Interlandi, M., 2018. (OSDI 2018)
- Brusta: PyTorch model serving project [GitHub]
- Model Server for Apache MXNet: Model Server for Apache MXNet is a tool for serving neural net models for inference [GitHub]
- TFX: A TensorFlow-Based Production-Scale Machine Learning Platform [Paper] [Website] [GitHub]
- Baylor, Denis, et al. (KDD 2017)
- Tensorflow-serving: Flexible, high-performance ml serving [Paper] [GitHub]
- Olston, Christopher, et al.
- IntelAI/OpenVINO-model-server: Inference model server implementation with gRPC interface, compatible with TensorFlow serving API and OpenVINO™ as the execution backend. [GitHub]
- Clipper: A Low-Latency Online Prediction Serving System [Paper]
[GitHub]
- Crankshaw, Daniel, et al. (NSDI 2017)
- Summary: Adaptive batch
- InferLine: ML Inference Pipeline Composition Framework [Paper] [GitHub]
- Crankshaw, Daniel, et al. (SoCC 2020)
- Summary: update version of Clipper
- TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments [Paper]
- Dakkak, Abdul, et al (Preprint)
- Summary: model cold start problem
- Rafiki: machine learning as an analytics service system [Paper] [GitHub]
- Wang, Wei, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad.
- Summary: Contain both training and inference. Auto-Hype-Parameter search for training. Ensemble models for inference. Using DRL to balance trade-off between accuracy and latency.
- GraphPipe: Machine Learning Model Deployment Made Simple [GitHub]
- Orkhon: ML Inference Framework and Server Runtime [GitHub]
- NVIDIA/tensorrt-inference-server: The TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. [GitHub] [Slides: DEEP INTO TRTIS]
- torchpipe: Ensemble Pipeline Serving with Pytorch Frontend. Boosting DL Service Throughput 1.5-4x by Ensemble Pipeline Serving with Concurrent CUDA Streams for PyTorch/LibTorch Frontend and TensorRT/CVCUDA, etc., Backends. [GitHub]
- INFaaS: Automated Model-less Inference Serving [GitHub], [Paper]
- Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis (ATC 2021)
- Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines
- Francisco Romero, Mark Zhao, Neeraja J. Yadwadkar, and Christos Kozyrakis (SoCC 2021)
- Scrooge: A Cost-Effective Deep Learning Inference System [Paper]
- Yitao Hu, Rajrup Ghosh, Ramesh Govindan
- Apache PredictionIO® is an open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task [Website]
- Kumar, Adarsh, et al. "Accelerating deep learning inference via freezing." 11th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 19). 2019. [Paper]
- Xu, Mengwei, et al. "DeepCache: Principled cache for mobile deep vision." Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 2018. [Paper]
- Park, Keunyoung, and Doo-Hyun Kim. "Accelerating image classification using feature map similarity in convolutional neural networks." Applied Sciences 9.1 (2019): 108. [Paper]
- Cavigelli, Lukas, and Luca Benini. "CBinfer: Exploiting frame-to-frame locality for faster convolutional network inference on video streams." IEEE Transactions on Circuits and Systems for Video Technology (2019). [Paper]
- Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics [Paper]
- Daniel Kang, Ankit Mathur, Teja Veeramacheneni, Peter Bailis, Matei Zaharia
- VLDB 2021
- Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference. [arxiv][GitHub]
- Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, Matei Zaharia.
- arXiv Preprint. 2019.
- TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs and deep learning accelerators. [GitHub]
- Dynamic Space-Time Scheduling for GPU Inference [Paper] [GitHub]
- Jain, Paras, et al. (NIPS 18, System for ML)
- Summary: optimization for GPU Multi-tenancy
- Dynamic Scheduling For Dynamic Control Flow in Deep Learning Systems [Paper]
- Wei, Jinliang, Garth Gibson, Vijay Vasudevan, and Eric Xing. (On going)
- Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. [Paper]
- D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. (NeurIPS Systems for ML Workshop 2018)
- Summary: They assume that their system, HiveMind, is given as input models grouped into model batches that are amenable to co-optimization and co-execution. a compiler, and a runtime.
- DeepCPU: Serving RNN-based Deep Learning Models 10x Faster [Paper]
- Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He, Microsoft AI and Research (ATC 2018)
- Ease. ml: Towards multi-tenant resource sharing for machine learning workloads [Paper] [GitHub] [Demo]
- Li, Tian, et al
- Proceedings of the VLDB Endowment 11.5 (2018): 607-620.
- Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models [Paper]
- LeMay, Matthew, Shijian Li, and Tian Guo.
- arXiv preprint arXiv:1912.02322 (2019).
- Hummingbird: Hummingbird is a library for compiling trained traditional ML models into tensor computations. Hummingbird allows users to seamlessly leverage neural network frameworks (such as PyTorch) to accelerate traditional ML models.[GitHub]
- {TVM}: An Automated End-to-End Optimizing Compiler for Deep Learning [Paper] [YouTube] [Project Website]
- Chen, Tianqi, et al. (OSDI 2018)
- Summary: Automated optimization is very impressive: cost model (rank objective function) + schedule explorer (parallel simulated annealing)
- Facebook TC: Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. [GitHub]
- Tensorflow/mlir: "Multi-Level Intermediate Representation" Compiler Infrastructure [GitHub] [Video]
- PyTorch/glow: Compiler for Neural Network hardware accelerators [GitHub]
- TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions [Paper] [GitHub]
- Jia, Zhihao, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. (SOSP 2019)
- Experiments tested on TVM and XLA
- SGLAng: Manage KV cache through radix attention [Paper] [Github]
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng