Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding a solution to put all user data in KV store #93

Open
fhoering opened this issue Dec 12, 2024 · 2 comments
Open

Finding a solution to put all user data in KV store #93

fhoering opened this issue Dec 12, 2024 · 2 comments

Comments

@fhoering
Copy link
Contributor

For our production use cases we need to handle datasets that don’t easily fit into the memory of one machine in a cost efficient way.

We can easily distribute them across many machines because the total dataset is big but individual 1st party user data is relatively small and this would allow horizontal scaling of our user data on many KV instances with a reasonable memory/cpu footprint.

Currently the doc on sharding mentions:

  • for any given kv server read request, downstream requests are always made to all shards. That is necessary to not reveal extra information about the looked up keys, as AdTechs know which keys live on which shards.
  • for any given kv server read request, when data shards are queried, the payloads of corresponding requests are of the same size, for the same reason.

This seems not scalable. If each server receives all requests it would multiply the infrastructure cost by the number of shards.

The easiest seems to officially deactivate this mechanism and define a timeline to work on a viable long term solution.

The long term solution should include on-device and B&A use cases.

@peiwenhu
Copy link

peiwenhu commented Dec 12, 2024

For privacy reasons there is some noise necessary to prevent traffic analysis.

Note that the artificial requests to other shards won't be processed by the shards using real logic. the main processing cost still comes from the shards actually doing the lookups and the server running the UDF processing. So the privacy protection overhead is mainly from the network noise and won't be a multiple of the number of shards.

There is a flag to deactivate the chaffing mechanism in non-prod environment (sending artificial downstream lookup requests to irrelevant shards). Could you evaluate the cost difference between switching it on/off so we can get closer to a more quantitative understanding?

@fhoering
Copy link
Contributor Author

Note that the artificial requests to other shards won't be processed by the shards using real logic. the main processing cost still comes from the shards actually doing the lookups and the server running the UDF processing. So the privacy protection overhead is mainly from the network noise and won't be a multiple of the number of shards.

If we have let's say 100 shards 99% of the incoming traffic will be overhead. We also have shown here that there is a real compute impact even with a dummy ROMA script (if the KV server executes logic).

Is our provided code supposed to check for the right shard, or is there some middleware managed by Chrome that would handle this ?

It looks unlikely that this will have no impact. We would welcome some Chrome benchmarks with let's say 100 shards or 1000 shards where one can really see that the expected impact would be negligible.

There is a flag to deactivate the chaffing mechanism in non-prod environment (sending artificial downstream lookup requests to irrelevant shards). Could you evaluate the cost difference between switching it on/off so we can get closer to a more quantitative understanding?

In the meantime this flag to deactivate sharding should apply to PROD mode also waiting for a proper solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants