Skip to content

Latest commit

 

History

History
140 lines (127 loc) · 15.6 KB

inference.md

File metadata and controls

140 lines (127 loc) · 15.6 KB

Inference System

System for machine learning inference.

Benchmark

  • Wanling Gao, Fei Tang, Jianfeng Zhan, et al. "AIBench: A Datacenter AI Benchmark Suite, BenchCouncil". [Paper] [Website]
  • BaiduBench: Benchmarking Deep Learning operations on different hardware. [Github]
  • Reddi, Vijay Janapa, et al. "Mlperf inference benchmark." arXiv preprint arXiv:1911.02549 (2019). [Paper] [GitHub]
  • Bianco, Simone, et al. "Benchmark analysis of representative deep neural network architectures." IEEE Access 6 (2018): 64270-64277. [Paper]
  • Almeida, Mario, et al. "EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices." The 3rd International Workshop on Deep Learning for Mobile Systems and Applications. 2019. [Paper]

Model Management

  • Model Card Toolkit. The Model Card Toolkit (MCT) streamlines and automates generation of Model Cards [1], machine learning documents that provide context and transparency into a model's development and performance. [Paper] [GitHub]
  • DLHub: Model and data serving for science. [Paper]
    • Chard, R., Li, Z., Chard, K., Ward, L., Babuji, Y., Woodard, A., Tuecke, S., Blaiszik, B., Franklin, M. and Foster, I., 2019, May.
    • In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 283-292). IEEE.
  • Publishing and Serving Machine Learning Models with DLHub. [Paper]
  • TRAINS - Auto-Magical Experiment Manager & Version Control for AI [GitHub]
  • ModelDB: A system to manage ML models [GitHub] [MIT short paper]
  • iterative/dvc: Data & models versioning for ML projects, make them shareable and reproducible [GitHub]

Model Serving

  • Announcing RedisAI 1.0: AI Serving Engine for Real-Time Applications [Blog]
  • Cloudburst: Stateful Functions-as-a-Service. [Paper] [GitHub]
    • Vikram Sreekanti, Chenggang Wu, Xiayue Charles Lin, Johann Schleier-Smith, Joseph E. Gonzalez, Joseph M. Hellerstein, Alexey Tumanov
    • VLDB 2020
    • A stateful FaaS platform. (1) feasibility of general-purpose stateful serverless computing. (2) Autoscaling via logical disaggregation of storage and compute, state management via physical colocation of caches with compute services. (3) LDPC design pattern
  • Optimizing Prediction Serving on Low-Latency Serverless Dataflow [Paper]
    • Sreekanti, Vikram, Harikaran Subbaraj, Chenggang Wu, Joseph E. Gonzalez, and Joseph M. Hellerstein.
    • arXiv preprint arXiv:2007.05832 (2020).
  • Serving DNNs like Clockwork: Performance Predictability from the Bottom Up. [Paper]
    • Gujarati, A., Karimi, R., Alzayat, S., Kaufmann, A., Vigfusson, Y. and Mace, J., 2020.
    • OSDI 2020
  • Swayam: distributed autoscaling to meet SLAs of machine learning inference services with resource efficiency [Paper]
    • Gujarati, Arpan, Sameh Elnikety, Yuxiong He, Kathryn S. McKinley, and Björn B. Brandenburg.
    • In Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, pp. 109-120. 2017.
    • Summary: a cloud autoscaler. (1) model-based autoscaling that takes into account SLAs and ML inference workload characteristics, (2) a distributed protocol that uses partial load information and prediction at frontends to provi- sion new service instances, and (3) a backend self-decommissioning protocol for service instances
  • Swift machine learning model serving scheduling: a region based reinforcement learning approach. [Paper] [GitHub]
    • Qin, Heyang, Syed Zawad, Yanqi Zhou, Lei Yang, Dongfang Zhao, and Feng Yan.
    • In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-23. 2019.
    • Summary: The system performances under different similar con- figurations in a region can be accurately estimated by using the system performance under one of these configurations, due to their similarity. Region based DRL is designed for parallelism selection.
  • TorchServe is a flexible and easy to use tool for serving PyTorch models. [GitHub]
  • Seldon Core: Blazing Fast, Industry-Ready ML. An open source platform to deploy your machine learning models on Kubernetes at massive scale. [GitHub]
  • MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving [Paper] [GitHub]
    • Zhang, C., Yu, M., Wang, W. and Yan, F., 2019.
    • In 2019 {USENIX} Annual Technical Conference ({USENIX}{ATC} 19) (pp. 1049-1062).
    • Summary: address the scalability and cost minimization issues for model serving on the public cloud.
  • Parity Models: Erasure-Coded Resilience for Prediction Serving Systems(SOSP2019) [Paper] [GitHub]
  • Nexus: Nexus is a scalable and efficient serving system for DNN applications on GPU cluster (SOSP2019) [Paper] [GitHub]
  • Deep Learning Inference Service at Microsoft [Paper]
    • J Soifer, et al. (OptML2019)
  • {PRETZEL}: Opening the Black Box of Machine Learning Prediction Serving Systems. [Paper]
    • Lee, Y., Scolari, A., Chun, B.G., Santambrogio, M.D., Weimer, M. and Interlandi, M., 2018. (OSDI 2018)
  • Brusta: PyTorch model serving project [GitHub]
  • Model Server for Apache MXNet: Model Server for Apache MXNet is a tool for serving neural net models for inference [GitHub]
  • TFX: A TensorFlow-Based Production-Scale Machine Learning Platform [Paper] [Website] [GitHub]
    • Baylor, Denis, et al. (KDD 2017)
  • Tensorflow-serving: Flexible, high-performance ml serving [Paper] [GitHub]
    • Olston, Christopher, et al.
  • IntelAI/OpenVINO-model-server: Inference model server implementation with gRPC interface, compatible with TensorFlow serving API and OpenVINO™ as the execution backend. [GitHub]
  • Clipper: A Low-Latency Online Prediction Serving System [Paper] [GitHub]
    • Crankshaw, Daniel, et al. (NSDI 2017)
    • Summary: Adaptive batch
  • InferLine: ML Inference Pipeline Composition Framework [Paper] [GitHub]
    • Crankshaw, Daniel, et al. (SoCC 2020)
    • Summary: update version of Clipper
  • TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep LearningInference in Function as a Service Environments [Paper]
    • Dakkak, Abdul, et al (Preprint)
    • Summary: model cold start problem
  • Rafiki: machine learning as an analytics service system [Paper] [GitHub]
    • Wang, Wei, Jinyang Gao, Meihui Zhang, Sheng Wang, Gang Chen, Teck Khim Ng, Beng Chin Ooi, Jie Shao, and Moaz Reyad.
    • Summary: Contain both training and inference. Auto-Hype-Parameter search for training. Ensemble models for inference. Using DRL to balance trade-off between accuracy and latency.
  • GraphPipe: Machine Learning Model Deployment Made Simple [GitHub]
  • Orkhon: ML Inference Framework and Server Runtime [GitHub]
  • NVIDIA/tensorrt-inference-server: The TensorRT Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. [GitHub] [Slides: DEEP INTO TRTIS]
  • torchpipe: Ensemble Pipeline Serving with Pytorch Frontend. Boosting DL Service Throughput 1.5-4x by Ensemble Pipeline Serving with Concurrent CUDA Streams for PyTorch/LibTorch Frontend and TensorRT/CVCUDA, etc., Backends. [GitHub]
  • INFaaS: Automated Model-less Inference Serving [GitHub], [Paper]
    • Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis (ATC 2021)
  • Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines
    • Francisco Romero, Mark Zhao, Neeraja J. Yadwadkar, and Christos Kozyrakis (SoCC 2021)
  • Scrooge: A Cost-Effective Deep Learning Inference System [Paper]
    • Yitao Hu, Rajrup Ghosh, Ramesh Govindan
  • Apache PredictionIO® is an open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task [Website]

Cache for Inference

  • Kumar, Adarsh, et al. "Accelerating deep learning inference via freezing." 11th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 19). 2019. [Paper]
  • Xu, Mengwei, et al. "DeepCache: Principled cache for mobile deep vision." Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 2018. [Paper]
  • Park, Keunyoung, and Doo-Hyun Kim. "Accelerating image classification using feature map similarity in convolutional neural networks." Applied Sciences 9.1 (2019): 108. [Paper]
  • Cavigelli, Lukas, and Luca Benini. "CBinfer: Exploiting frame-to-frame locality for faster convolutional network inference on video streams." IEEE Transactions on Circuits and Systems for Video Technology (2019). [Paper]

Inference Optimization

  • Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics [Paper]
    • Daniel Kang, Ankit Mathur, Teja Veeramacheneni, Peter Bailis, Matei Zaharia
    • VLDB 2021
  • Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference. [arxiv][GitHub]
    • Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, Matei Zaharia.
    • arXiv Preprint. 2019.
  • TensorRT is a C++ library that facilitates high performance inference on NVIDIA GPUs and deep learning accelerators. [GitHub]
  • Dynamic Space-Time Scheduling for GPU Inference [Paper] [GitHub]
    • Jain, Paras, et al. (NIPS 18, System for ML)
    • Summary: optimization for GPU Multi-tenancy
  • Dynamic Scheduling For Dynamic Control Flow in Deep Learning Systems [Paper]
    • Wei, Jinliang, Garth Gibson, Vijay Vasudevan, and Eric Xing. (On going)
  • Accelerating Deep Learning Workloads through Efficient Multi-Model Execution. [Paper]
    • D. Narayanan, K. Santhanam, A. Phanishayee and M. Zaharia. (NeurIPS Systems for ML Workshop 2018)
    • Summary: They assume that their system, HiveMind, is given as input models grouped into model batches that are amenable to co-optimization and co-execution. a compiler, and a runtime.
  • DeepCPU: Serving RNN-based Deep Learning Models 10x Faster [Paper]
    • Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He, Microsoft AI and Research (ATC 2018)

Cluster Management for Inference (now only contain multi-tenant)

  • Ease. ml: Towards multi-tenant resource sharing for machine learning workloads [Paper] [GitHub] [Demo]
    • Li, Tian, et al
    • Proceedings of the VLDB Endowment 11.5 (2018): 607-620.
  • Perseus: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models [Paper]
    • LeMay, Matthew, Shijian Li, and Tian Guo.
    • arXiv preprint arXiv:1912.02322 (2019).

Machine Learning Compiler

  • Hummingbird: Hummingbird is a library for compiling trained traditional ML models into tensor computations. Hummingbird allows users to seamlessly leverage neural network frameworks (such as PyTorch) to accelerate traditional ML models.[GitHub]
  • {TVM}: An Automated End-to-End Optimizing Compiler for Deep Learning [Paper] [YouTube] [Project Website]
    • Chen, Tianqi, et al. (OSDI 2018)
    • Summary: Automated optimization is very impressive: cost model (rank objective function) + schedule explorer (parallel simulated annealing)
  • Facebook TC: Tensor Comprehensions (TC) is a fully-functional C++ library to automatically synthesize high-performance machine learning kernels using Halide, ISL and NVRTC or LLVM. [GitHub]
  • Tensorflow/mlir: "Multi-Level Intermediate Representation" Compiler Infrastructure [GitHub] [Video]
  • PyTorch/glow: Compiler for Neural Network hardware accelerators [GitHub]
  • TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions [Paper] [GitHub]
    • Jia, Zhihao, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. (SOSP 2019)
    • Experiments tested on TVM and XLA
  • SGLAng: Manage KV cache through radix attention [Paper] [Github]
    • Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng