Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary __len__ check by using is None for tokenizer #150

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pjs102793
Copy link

This PR replaces if not tokenizer: with if tokenizer is None: to avoid unnecessary calls to the tokenizer’s len method, improving efficiency.

# PyTorch GPU
>>> sat_sm = SaT("sat-3l-sm")
>>> sat_sm.half().to("cuda") # optional, see above
>>> text = "this is a test this is another test"
>>> %timeit list(sat_sm.split(text))
# 52.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# PR Patch PyTorch GPU
>>> %timeit list(sat_sm.split(text))
# 2.45 ms ± 24.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# ONNX CPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(text))
# 59.3 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# PR Patch ONNX CPU
>>> %timeit list(model_ort.split(text))
# 14.7 ms ± 699 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When executing if not tokenizer:, the len special method of the corresponding class is triggered, incurring a fixed overhead of approximately 40ms.

To avoid this overhead, we use a direct None check to determine the presence of the tokenizer, eliminating this bottleneck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant