Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common benchmark methodology #92

Open
fhoering opened this issue Dec 12, 2024 · 0 comments
Open

Common benchmark methodology #92

fhoering opened this issue Dec 12, 2024 · 0 comments

Comments

@fhoering
Copy link
Contributor

Criteo effectuated benchmarks on the key-value service.

As a result it seems like there is a serious risk that TEE technology and/or Privacy constraints would significantly increase ad techs infrastructure cost.

Here are some examples of potential drivers of infrastructure cost:

  • TEE ML inference
  • Inter process communication
  • ROMA side effect implementation
  • TEE encryption

We agreed in the WICG call from 04/12/2024 call that it looks important to be able to assess new features in a common way and be able to compare the results later. It would allow to be able to track improvements and/or regressions.

Ideally Chrome would:

  • define the hardware (some AWS/GCP instance type or both) and required server side setup (log level, TEST_MODE, B&A vs KV server)
  • provide a repo that contains the benchmarks (empty bidding script, simple getValue() lookup, large getValue() lookup, etc)
  • some script or config that deploys and executes everything in a standardized way
  • define the way metrics are measured or calculated

This would allow the community to comment and make proposals to the benchmarks and protocol. As it is, it’s every man for himself and chaotic.

As some starting point here is the methodology we used in our benchmarks.

Setup

We deployed 1 instance on the remote server and injected Gatling load tests with a fixed set of requests (~1000 reinjected all the time) until the server starts to fail (from 100 QPS until 100k QPS). We are interested in latency because we need to reply in ms time but we are also mainly interested in QPS because this provides a direct measure on how much hardware we need to pay to be able to handle the daily RTB requests that come in.

In the graph below one can see that at around ~5000 QPS the latency starts to increase significantly and the server starts to fail. So we report this metric in our final results. And then we compare different server side code deployments with this same methodology.

image

Hardware configuration

We used a KV service with 16 hyperthreaded cores and 16 GB of memory on our own on-premise container management platform. It seems preferable to find a reference configuration in the cloud for a comparable AWS and GCP deployment (like AWS c6i.2xlarge).

We deactivated all logs and have set the ROMA worker to the number of available cores (=16). Storage is accessed on disk to prevent an additional S3 dependency.

Here is an example cli on how it has been launched:

./server --delta_directory=/tmp/deltas --realtime_directory=/tmp/realtime --port=$PORT1 --route_v1_to_v2=true \
--logging_verbosity_level=0 --stderrthreshold=0 --udf_update_timeout=120s --udf_timeout=120s --udf_num_workers=16

Metrics

  • QPS queries per second => should be as high as possible
  • Latency in ms => should be under 10ms
  • CPU usage => we are mostly CPU bound , it allows to see that at the failing all CPUs have been really used in a efficient way
  • Memory => it mostly depends on the runtime overhead but also on the payload that is evaluated, in our case as we execute dummy payload this didn’t make a significant difference
  • Request errors => we mostly operate at a level where this number should be negligible
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant