Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LP - PyTorch model optimisations for mobile #1317

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
title: Create, train and optimise a PyTorch model for digit classification on Android

minutes_to_complete: 80

who_is_this_for: This is an introductory topic for software developers interested in learning how to optimise PyTorch model for inference in Android.

learning_objectives:
- Optimise a neural network architecture using quantization and fusing.
- Use an optimised model in an Android app.

prerequisites:
- A computer that can run Python3, Visual Studio Code, and Android Studio. The OS can be Windows, Linux, or macOS.

author_primary: Dawid Borycki

### Tags
skilllevels: Introductory
subjects: ML
armips:
- Cortex-A
- Cortex-X
- Neoverse
operatingsystems:
- Windows
- Linux
- macOS
tools_software_languages:
- Android Studio
- Coding
shared_path: true
shared_between:
- servers-and-cloud-computing
- laptops-and-desktops
- smartphones-and-mobile

### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
layout: "learningpathall" # All files under learning paths have this same wrapper
learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
---
# ================================================================================
# Edit
# ================================================================================

next_step_guidance: >
Proceed to Use Keras Core with TensorFlow, PyTorch, and JAX backends to continue exploring Machine Learning.

# 1-3 sentence recommendation outlining how the reader can generally keep learning about these topics, and a specific explanation of why the next step is being recommended.

recommended_path: "/learning-paths/servers-and-cloud-computing/keras-core/"

# Link to the next learning path being recommended(For example this could be /learning-paths/servers-and-cloud-computing/mongodb).


# further_reading links to references related to this path. Can be:
# Manuals for a tool / software mentioned (type: documentation)
# Blog about related topics (type: blog)
# General online references (type: website)

further_reading:
- resource:
title: PyTorch
link: https://pytorch.org
type: documentation
- resource:
title: MNIST
link: https://en.wikipedia.org/wiki/MNIST_database
type: website
- resource:
title: Visual Studio Code
link: https://code.visualstudio.com
type: website


# ================================================================================
# FIXED, DO NOT MODIFY
# ================================================================================
weight: 21 # set to always be larger than the content in this path, and one more than 'review'
title: "Next Steps" # Always the same
layout: "learningpathall" # All files under learning paths have this same wrapper
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
# ================================================================================
# Edit
# ================================================================================

# Always 3 questions. Should try to test the reader's knowledge, and reinforce the key points you want them to remember.
# question: A one sentence question
# answers: The correct answers (from 2-4 answer options only). Should be surrounded by quotes.
# correct_answer: An integer indicating what answer is correct (index starts from 0)
# explanation: A short (1-3 sentence) explanation of why the correct answer is correct. Can add additional context if desired


review:
- questions:
question: >
What is the purpose of layer fusion in neural network optimization?
answers:
- To increase the size of the model
- To combine multiple layers into one to reduce computational overhead
- To add dropout to the model
- To improve the training process
correct_answer: 2
explanation: >
Layer fusion combines multiple operations, such as linear layers and activation functions, into one, reducing the computational overhead and improving inference performance.
- questions:
question: >
How does quantization improve the performance of models on mobile devices?
answers:
- By increasing the precision of the model’s weights
- By reducing the size and precision of weights and activations
- By enabling the model to use more layers
- By increasing the batch size
correct_answer: 2
explanation: >
Quantization reduces the precision of the model’s weights and activations (e.g., from float32 to int8), making the model smaller and faster without significant loss in accuracy.
- questions:
question: >
Why are dropout layers typically removed during inference?
answers:
- Dropout layers are only useful for improving memory usage
- Dropout layers are not compatible with mobile devices
- Dropout layers are used during training to prevent overfitting and are unnecessary during inference
- Dropout layers increase the inference time
correct_answer: 3
explanation: >
Dropout layers are used during training to prevent overfitting by randomly deactivating neurons, but they are not necessary during inference, where the entire model is used for prediction.

# ================================================================================
# FIXED, DO NOT MODIFY
# ================================================================================
title: "Review" # Always the same title
weight: 20 # Set to always be larger than the content in this path
layout: "learningpathall" # All files under learning paths have this same wrapper
---
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
# User change
title: "Optimising neural network models in PyTorch"

weight: 2

layout: "learningpathall"
---

In the realm of machine learning (ML) for edge and mobile inference, optimizing models is crucial to achieving efficient performance while minimizing resource consumption. As mobile and edge devices often have limited computational power, memory, and energy availability, various strategies are employed to ensure that ML models can run effectively in these constrained environments.

**Quantization** is one of the most widely used techniques, which reduces the precision of the model's weights and activations from floating-point to lower-bit representations, such as int8 or float16. This not only reduces the model size but also accelerates inference speed on hardware that supports lower precision arithmetic.

Another key optimization strategy is **layer fusion**, where multiple operations, such as combining linear layers with their subsequent activation functions (like ReLU), into a single layer. This reduces the number of operations that need to be executed during inference, minimizing latency and improving throughput.

In addition to these techniques, **pruning**, which involves removing less important weights or neurons from the model, can help in creating a leaner model that requires fewer resources without significantly affecting accuracy.

Finally, leveraging hardware-specific optimizations, such as **using the Android Neural Networks API (NNAPI)** allows developers to take full advantage of the underlying hardware acceleration available on edge devices. By employing these strategies, developers can significantly enhance the efficiency of ML models for deployment on mobile and edge platforms, ensuring a balance between performance and resource utilization.

PyTorch offers robust support for various optimization techniques that enhance the performance of machine learning models for edge and mobile inference. One of the key features is its quantization toolkit, which provides a streamlined workflow for applying quantization to models. PyTorch supports both static and dynamic quantization, allowing developers to reduce model size and improve inference speed without sacrificing accuracy. Additionally, PyTorch enables layer fusion through its torch.quantization module, enabling seamless integration of operations like fusing linear layers with their activation functions, thus optimizing execution by minimizing computational overhead. Furthermore, the TorchScript functionality allows for the creation of serializable and optimizable models that can be efficiently deployed on mobile devices. PyTorch’s integration with hardware acceleration libraries, such as NNAPI for Android, enables developers to leverage specific hardware capabilities, ensuring optimal model performance tailored to the device’s architecture. Overall, PyTorch provides a comprehensive ecosystem that empowers developers to implement effective optimizations for mobile and edge deployment, enhancing both speed and efficiency.

In this Learning Path, we will delve into the techniques of **quantization** and **fusion** using our previously created neural network model for [digit classification](/learning-paths/cross-platform/pytorch-digit-classification-arch-training/). By applying quantization, we will reduce the model's weight precision, transitioning from floating-point representations to lower-bit formats, which not only minimizes the model size but also enhances inference speed. This process is crucial for optimizing our model for deployment on resource-constrained devices.

Additionally, we will explore layer fusion, which combines multiple operations within the model—such as fusing linear layers with their subsequent activation functions—into a single operation. This reduction in operational complexity further streamlines the model, leading to improved performance during inference. By implementing these optimizations, we aim to enhance the efficiency of our digit classification model, making it well-suited for deployment in mobile and edge environments.

First, we will modify our previous Python scripts for [both training and inference](/learning-paths/cross-platform/pytorch-digit-classification-arch-training/)to incorporate model optimizations like quantization and fusion. After adjusting the training pipeline to produce an optimized version of the model, we will also update our inference script to handle both the original and optimized models. Once these changes are made, we will modify the [Android app](pytorch-digit-classification-inference-android-app) to load either the original or optimized model based on user input, allowing us to switch between them dynamically. This setup will enable us to directly compare the inference speed of both models on the device, providing valuable insights into the performance benefits of model optimization techniques in real-world scenarios.
Loading