Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sample population selection for deterministic sampling #699

Open
idreeskhan opened this issue Feb 1, 2024 · 0 comments
Open

Improve sample population selection for deterministic sampling #699

idreeskhan opened this issue Feb 1, 2024 · 0 comments

Comments

@idreeskhan
Copy link
Contributor

idreeskhan commented Feb 1, 2024

Currently BigSampler tends to undersample for a given input ratio when performing a deterministic sample

One avenue to explore is:

Once hashes are created they are normalized in a [0.0, 1.0] range by boundLong. Potentially this function should be updated or modified. One possible way is using the upper/lower bound of the input results instead, however this may be difficult to implement in practice. It could also be removed and replaced, the specifics of this implementation are lost to time and have dropped out of my memory.

Another path instead or in addition to this is:

We primarily use farmhash, which is not a cryptographic hash function. Is the output sufficiently uniform in its distribution? If not, now that additional hashes are available within BigQuery, is there another function with a more appropriate output distribution

@idreeskhan idreeskhan changed the title Improve sample population selection for deterministic hashing Improve sample population selection for deterministic sampling Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant