Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need to SELECT 1 before each query? #198

Closed
max-sixty opened this issue Aug 21, 2018 · 5 comments
Closed

Do we need to SELECT 1 before each query? #198

max-sixty opened this issue Aug 21, 2018 · 5 comments

Comments

@max-sixty
Copy link
Contributor

With the recent clustering release, we're looking at whether we can use BQ for queries which previously required an indexed DB.

For these, the overhead is more important than the throughput. The overhead is high - in the case that the query is cached (if not, add 1s to all these, for a 700 row query on a small cluster):

  • 73ms: query time reported by BQ
  • 1160ms: query time logged by pandas-gbq
  • 3100ms: wall time reported by %timeit -n 1 -r 1 gbq.read_gbq(query)

The latter two are particularly far apart. One of the reasons for this is that we issue a SELECT 1 query before each query the user submits. This happens as part of _try_credentials: https://github.com/pydata/pandas-gbq/blob/e753cc4e61b0089dcea6330eb8c77e93a62984a7/pandas_gbq/auth.py#L281

Is this required? Could we instead attempt the query, and only attempt different auth methods if there's a failure?

@tswast
Copy link
Collaborator

tswast commented Aug 21, 2018

Is this required?

The whole auth chain adds quite a few request round-trips. One think I'd like (#161) is to have a credentials argument (or possibly a client argument) which would bypass all of those requests.

@tswast
Copy link
Collaborator

tswast commented Aug 21, 2018

Also in #161, I'd like to cache the credentials after the first call in a global Context object to avoid subsequent unnecessary auth calls.

@max-sixty
Copy link
Contributor Author

Auth takes a long time! googleapis/google-auth-library-python#285

At the moment, the latency overhead basically means we can't use BQ for our shorter queries (5-10k rows) which we've traditionally run off MSSQL.

How much work do you think caching the creds would be? Given that the non-query part is 75% of the query time, it would be impactful

@tswast
Copy link
Collaborator

tswast commented Aug 25, 2018

Not much work, honestly. I think we can follow the same model as google.cloud.bigquery.magics with a similar Context class to cache the project_id and credentials.

@max-sixty
Copy link
Contributor Author

I think we can close this - now it's a single call per session?

@tswast tswast closed this as completed Sep 5, 2018
tswast added a commit to pydata/pydata-google-auth that referenced this issue Sep 7, 2018
Trim pydata-google-auth package and add tests

This is the initial version of the proposed pydata-google-auth package (to be used by pandas-gbq and ibis). It includes two methods:

* `pydata_google_auth.default()`
  * A function that does the same as pandas-gbq does auth currently. Tries `google.auth.default()` and then falls back to user credentials.
* `pydata_google_auth.get_user_credentials()`
  * A public `get_user_credentials()` function, as proposed in googleapis/python-bigquery-pandas#161. Missing in this implementation is a more configurable way to adjust credentials caching. I currently use the `reauth` logic from pandas-gbq.

I drop `try_credentials()`, as it makes less sense when this module might be used for other APIs besides BigQuery. Plus there were problems with `try_credentials()` even for pandas-gbq (googleapis/python-bigquery-pandas#202, googleapis/python-bigquery-pandas#198).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants