Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Slowdown Caused by Gelu Fusion Removal #23491

Open
SuhwanSong opened this issue Jan 25, 2025 · 0 comments
Open

[Performance] Slowdown Caused by Gelu Fusion Removal #23491

SuhwanSong opened this issue Jan 25, 2025 · 0 comments
Labels
performance issues related to performance regressions

Comments

@SuhwanSong
Copy link

SuhwanSong commented Jan 25, 2025

Describe the issue

From commit 2cdc05f, ONNX Runtime (ORT) no longer performs Gelu fusion, resulting in a 4X performance slowdown.

Bisect range: de7a02b .. 2cdc05f.

Optimized model of de7a02b

Image

Optimized model of 2cdc05f

Image

Performance Comparison

Key de7a02b 2cdc05f Ratio
model_loading_uri 611 603 0.9869
session_initialization 4256 4236 0.9953
/m4/MatMul_kernel_time 616211 531171 0.8623
/m4/Add_kernel_time   4973509  
BiasGelu_kernel_time 513038    
Gelu_kernel_time   171279  
SequentialExecutor::Execute 1193568 5778856 4.8418
model_run 1223691 5796766 4.7372

To reproduce

  1. Download and unzip "model.zip".
  2. Run the following script.
import time
import onnxruntime
import numpy as np

# Set the random seed
np.random.seed(0)

onnx_model_path = 'model.onnx'

# Load the ONNX model with the CPUExecutionProvider
ort_session = onnxruntime.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])
ort_session.get_modelmeta()
inputs = ort_session.get_inputs()

nth = 100000

# Warm-up inference to cache optimizations

input_data = np.load("input.npy", allow_pickle=True).item()
ort_session.run(None, input_data)

# Measure inference time excluding input creation
total_time_ns = 0
for _ in range(nth):

    start_ns = time.perf_counter_ns()
    ort_session.run(None, input_data)
    end_ns = time.perf_counter_ns()

    total_time_ns += end_ns - start_ns

avg_time_ns = total_time_ns / nth
avg_time_ms = avg_time_ns / 1e6

print(f'[{onnxruntime.__version__}] Average inference time: {avg_time_ms:.5f} ms')

Urgency

No response

Platform

Linux

OS Version

6.8.0

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.20.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

model.zip

Is this a quantized model?

No

@SuhwanSong SuhwanSong added the performance issues related to performance regressions label Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance issues related to performance regressions
Projects
None yet
Development

No branches or pull requests

1 participant