-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] FIL: 4-5 times slow down for experiment fil compared to old implementation on GPU #6214
Comments
Thanks for the issue @Kaiyang-Chen, the model file would be extremely useful. @wphicks and @hcho3 might be the best people to help here, though I think they are not around until next week. |
Yes, it will help us tremendously if you are able to share the model with us. Note: I am taking time off for the next two weeks, until Jan 25. I will be able to start troubleshooting the performance issue then. |
model_20241106.txt |
I'm happy to dig into this in more depth, but I'm almost certain I can give you an answer based on what we have here already. In addition to more fundamental changes, experimental FIL also updates the way we choose default hyperparameters. Original FIL selected those parameters based largely on implementation details, but experimental FIL defaults to hyperparameters that give the best throughput for large batches. At batch size 64, that's definitely going to give a significant performance degradation. As a quick test, experimental FIL offers the new If you're still seeing a performance degradation, we can dig into this a lot more carefully. Thanks very much for the report! This is exactly the sort of thing we want to catch before promoting experimental FIL to stable. |
yes, i've tried tuning the parameter myself by hand using a range of reasonable chunk_size and also both tree layout (depth / width). It does affect the performance but seemingly within 20-30% range. Cannot produce a result that is even close to the stable version. @wphicks |
Okay, in that case, let's dig into it more systematically. Can you post your benchmarking code so I can try for an exact repro? |
I was able to reproduce the regression with the code below. Very interesting! This is a domain (shallow trees, small batches, wide inputs) where experimental FIL has seen lower performance at times, but I haven't seen any other model where performance has suffered this much. I'll investigate further. Can you confirm that the code below at least generally matches how you performed your own benchmarks? import cupy as cp
import logging
import numpy as np
import treelite
from cuml import ForestInference as FIL
from cuml.experimental.fil import ForestInference as FILEX
from pandas import DataFrame
from time import perf_counter
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def load():
tl_model = treelite.frontend.load_lightgbm_model('model.txt')
return (
FIL().load_from_treelite_model(
tl_model,
precision='float32'
),
FILEX.load(
'model.txt',
precision='float32'
)
)
def run(
fil,
filex,
*,
batch_size=None,
min_batch_size=1,
max_batch_size=131072,
iterations=10,
warmup_iterations=2,
format='cupy',
results=None
):
if batch_size is None:
batch_size = min_batch_size
if results is None:
results = {
'batch_size': [],
'FIL': [],
'FILEX': []
}
results['batch_size'].append(batch_size)
if format == 'cupy':
xpy = cp
elif format == 'numpy':
xpy = np
dtype = filex.forest.get_dtype()
# TODO(wphicks): set range based on model.txt for each feature
warmup_batches = xpy.random.uniform(
xpy.finfo(dtype).min / 2,
xpy.finfo(dtype).max / 2,
size=(warmup_iterations, batch_size, filex.forest.num_features())
)
batches = xpy.random.uniform(
xpy.finfo(dtype).min / 2,
xpy.finfo(dtype).max / 2,
size=(iterations, batch_size, filex.forest.num_features())
)
filex.optimize(batch_size=batch_size)
for name, model in (('FIL', fil), ('FILEX', filex)):
for i in range(warmup_iterations):
model.predict(warmup_batches[i])
start = perf_counter()
for i in range(iterations):
model.predict(batches[i])
elapsed = perf_counter() - start
results[name].append(elapsed)
logger.info(
f'Run at batch size {batch_size} completed in'
f' {elapsed:.2E}s with {name}'
)
if results['FIL'][-1] < results['FILEX'][-1]:
next_batch_size = batch_size + (
(max_batch_size - batch_size) // 2
)
min_batch_size = batch_size
else:
logger.info(f'FILEX outperformed FIL at batch size {batch_size}')
next_batch_size = batch_size - (
(batch_size - min_batch_size) // 2
)
max_batch_size = batch_size
if (
next_batch_size < min_batch_size or
next_batch_size >= max_batch_size or
next_batch_size == batch_size
):
return DataFrame.from_dict(results)
else:
return run(
fil,
filex,
batch_size=next_batch_size,
min_batch_size=min_batch_size,
max_batch_size=max_batch_size,
iterations=iterations,
warmup_iterations=warmup_iterations,
format=format,
results=results
)
if __name__ == '__main__':
fil, filex = load()
df = run(fil, filex)
df.sort_values(by='batch_size')
print(df.to_csv(index=False)) |
Yes the procedure is similar. Two tiny differences are that I am only testing up to batchsize 500 and I am using the cpp backend directly (it should not cause differences). |
And another interesting thing is, as you mentioned the forest has shallow trees relatively, but using width layout generate worse performance compared to depth. |
I'm not too surprised that breadth-first layout would perform worse for this depth. In general, we should get a slightly higher L2 cache hit rate starting around depth 4 for depth-first layout, though that is not always the determinant of performance for a whole model. The overall performance is still a puzzle to me though. I'm working on generating models with a range of parameters similar to the one you provided to help isolate where the issue is. |
This issue is very much of interest to us, but we're a little bandwidth limited on investigating it for the 25.02 release. I'm going to continue to look into it as much as I can prior to 25.02 code freeze, but you should see more movement on it during the 25.04 development cycle. Just wanted you to know what to expect in terms of progress, @Kaiyang-Chen; we really appreciate the report. |
Describe the bug
For a forest with 800 trees and num_leaves=256, input feature dimension=210, the inference job on GPU for multiple batchsize (from 1 to 500 with 10 as step) slower then the old implemenatation for 4-5 times.
For some performance stats:
The non-experimental GPU method took around 110 microseconds for inference batch < 64 samples
The experimental fil took around 450 microseconds for the same batch.
Is this performance degradation reasonable? I think it violates the first design goal of the fil experimental project ('Provide state-of-the-art runtime performance for forest models on GPU, especially for cases where CPU performance will not suffice (e.g. large batches, deep trees, many trees, etc.).')
Any hints on how to improve performance for experiment version? If needed, i can provide the model file.
Expected behavior
experiment FIL inference at a speed at least not much slower then the original version.
Environment details (please complete the following information):
cuml 25.02.00a42 cuda12_py312_250109_g225d0aaa0_42 rapidsai-nightly
libcuml 25.02.00a42 cuda12_250109_g225d0aaa0_42 rapidsai-nightly
libraft 25.02.00a32 cuda12_250109_g8fc988e1_32 rapidsai-nightly
libraft-headers 25.02.00a32 cuda12_250109_g8fc988e1_32 rapidsai-nightly
libraft-headers-only 25.02.00a32 cuda12_250109_g8fc988e1_32 rapidsai-nightly
pylibraft 25.02.00a32 cuda12_py312_250109_g8fc988e1_32 rapidsai-nightly
raft-dask 25.02.00a32 cuda12_py312_250109_g8fc988e1_32 rapidsai-nightly
treelite 4.3.0 py312h01abfbf_0 conda-forge
librmm 25.02.00a37 cuda12_250109_gc1ccdadb_37 rapidsai-nightly
rmm 25.02.00a37 cuda12_py312_250109_gc1ccdadb_37 rapidsai-nightly
The text was updated successfully, but these errors were encountered: