Finding a solution to put all user data in KV store #93

fhoering · 2024-12-12T14:50:35Z

For our production use cases we need to handle datasets that don’t easily fit into the memory of one machine in a cost efficient way.

We can easily distribute them across many machines because the total dataset is big but individual 1st party user data is relatively small and this would allow horizontal scaling of our user data on many KV instances with a reasonable memory/cpu footprint.

Currently the doc on sharding mentions:

for any given kv server read request, downstream requests are always made to all shards. That is necessary to not reveal extra information about the looked up keys, as AdTechs know which keys live on which shards.

for any given kv server read request, when data shards are queried, the payloads of corresponding requests are of the same size, for the same reason.

This seems not scalable. If each server receives all requests it would multiply the infrastructure cost by the number of shards.

The easiest seems to officially deactivate this mechanism and define a timeline to work on a viable long term solution.

The long term solution should include on-device and B&A use cases.

peiwenhu · 2024-12-12T16:27:26Z

For privacy reasons there is some noise necessary to prevent traffic analysis.

Note that the artificial requests to other shards won't be processed by the shards using real logic. the main processing cost still comes from the shards actually doing the lookups and the server running the UDF processing. So the privacy protection overhead is mainly from the network noise and won't be a multiple of the number of shards.

There is a flag to deactivate the chaffing mechanism in non-prod environment (sending artificial downstream lookup requests to irrelevant shards). Could you evaluate the cost difference between switching it on/off so we can get closer to a more quantitative understanding?

fhoering · 2024-12-20T16:20:17Z

Note that the artificial requests to other shards won't be processed by the shards using real logic. the main processing cost still comes from the shards actually doing the lookups and the server running the UDF processing. So the privacy protection overhead is mainly from the network noise and won't be a multiple of the number of shards.

If we have let's say 100 shards 99% of the incoming traffic will be overhead. We also have shown here that there is a real compute impact even with a dummy ROMA script (if the KV server executes logic).

Is our provided code supposed to check for the right shard, or is there some middleware managed by Chrome that would handle this ?

It looks unlikely that this will have no impact. We would welcome some Chrome benchmarks with let's say 100 shards or 1000 shards where one can really see that the expected impact would be negligible.

There is a flag to deactivate the chaffing mechanism in non-prod environment (sending artificial downstream lookup requests to irrelevant shards). Could you evaluate the cost difference between switching it on/off so we can get closer to a more quantitative understanding?

In the meantime this flag to deactivate sharding should apply to PROD mode also waiting for a proper solution.

fhoering mentioned this issue Jan 6, 2025

How to fetch UDF versions from bucket? privacysandbox/bidding-auction-servers#29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding a solution to put all user data in KV store #93

Finding a solution to put all user data in KV store #93

fhoering commented Dec 12, 2024

peiwenhu commented Dec 12, 2024 •

edited

Loading

fhoering commented Dec 20, 2024

Finding a solution to put all user data in KV store #93

Finding a solution to put all user data in KV store #93

Comments

fhoering commented Dec 12, 2024

peiwenhu commented Dec 12, 2024 • edited Loading

fhoering commented Dec 20, 2024

peiwenhu commented Dec 12, 2024 •

edited

Loading