You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to replicate the benchmark results shown in h2o_flexgen/benchmark/h2o and was able to observe the decreases in peak gpu memory and latencies and increase in tokens/sec throughput, but I noticed that the actual text being generated during the inference task was . I wanted to confirm whether or not this is expected behaviour. This is using the default implementation of flex_opt.py provided with facebook/opt-2.7b
As an example output, running inference with the provided example AI prompt produced this as the output:
And running inference again with the --hh-ratio 0.2 --hh-all flags enabling H2O produces the same output (but I do see the higher prefill and decode throughput)
The text was updated successfully, but these errors were encountered:
I was trying to replicate the benchmark results shown in
h2o_flexgen/benchmark/h2o
and was able to observe the decreases in peak gpu memory and latencies and increase in tokens/sec throughput, but I noticed that the actual text being generated during the inference task was . I wanted to confirm whether or not this is expected behaviour. This is using the default implementation offlex_opt.py
provided withfacebook/opt-2.7b
As an example output, running inference with the provided example AI prompt produced this as the output:
And running inference again with the
--hh-ratio 0.2 --hh-all
flags enabling H2O produces the same output (but I do see the higher prefill and decode throughput)The text was updated successfully, but these errors were encountered: