Avoid unnecessary len check by using is None for tokenizer #150

pjs102793 · 2025-02-28T06:21:44Z

This PR replaces if not tokenizer: with if tokenizer is None: to avoid unnecessary calls to the tokenizer’s len method, improving efficiency.

# PyTorch GPU
>>> sat_sm = SaT("sat-3l-sm")
>>> sat_sm.half().to("cuda") # optional, see above
>>> text = "this is a test this is another test"
>>> %timeit list(sat_sm.split(text))
# 52.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# PR Patch PyTorch GPU
>>> %timeit list(sat_sm.split(text))
# 2.45 ms ± 24.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# ONNX CPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(text))
# 59.3 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# PR Patch ONNX CPU
>>> %timeit list(model_ort.split(text))
# 14.7 ms ± 699 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When executing if not tokenizer:, the len special method of the corresponding class is triggered, incurring a fixed overhead of approximately 40ms.

To avoid this overhead, we use a direct None check to determine the presence of the tokenizer, eliminating this bottleneck.

Avoid unnecessary __len__ check by using 'is None' for tokenizer

6699d33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unnecessary len check by using is None for tokenizer #150

Avoid unnecessary len check by using is None for tokenizer #150

pjs102793 commented Feb 28, 2025

Avoid unnecessary __len__ check by using is None for tokenizer #150

Are you sure you want to change the base?

Avoid unnecessary __len__ check by using is None for tokenizer #150

Conversation

pjs102793 commented Feb 28, 2025

Avoid unnecessary len check by using is None for tokenizer #150

Avoid unnecessary len check by using is None for tokenizer #150