Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing PerSubscriberAggregate #4590

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Exposing PerSubscriberAggregate #4590

wants to merge 11 commits into from

Conversation

Thingus
Copy link
Contributor

@Thingus Thingus commented Nov 22, 2021

Closes #4581

I have:

  • Formatted any Python files with black
  • Brought the branch up to date with master
  • Added any relevant Github labels
  • Added tests for any new additions
  • Added or updated any relevant documentation
  • Added an Architectural Decision Record (ADR), if appropriate
  • Added an MPLv2 License Header if appropriate
  • Updated the Changelog

Description

Exposing PerSubscriberAggregate

@cypress
Copy link

cypress bot commented Nov 22, 2021



Test summary

43 0 0 0Flakiness 1


Run details

Project FlowAuth
Status Passed
Commit b12e04c
Started Nov 24, 2021 11:57 AM
Ended Nov 24, 2021 12:02 PM
Duration 05:40 💡
OS Linux Debian - 10.5
Browser Electron 94

View run in Cypress Dashboard ➡️


Flakiness

cypress/integration/add_new_server.js Flakiness
1 Server management > Add server

This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard

@Thingus Thingus added enhancement New feature or request FlowClient Issues related to FlowClient FlowMachine Issues related to FlowMachine labels Nov 22, 2021
@Thingus Thingus marked this pull request as ready for review November 22, 2021 16:35
@codecov
Copy link

codecov bot commented Nov 22, 2021

Codecov Report

Merging #4590 (ef90f5f) into master (d869364) will decrease coverage by 0.17%.
The diff coverage is 5.00%.

❗ Current head ef90f5f differs from pull request most recent head 0ae659e. Consider uploading reports for the commit 0ae659e to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4590      +/-   ##
==========================================
- Coverage   93.42%   93.25%   -0.18%     
==========================================
  Files         262      263       +1     
  Lines       10356    10376      +20     
  Branches      895      896       +1     
==========================================
+ Hits         9675     9676       +1     
- Misses        549      568      +19     
  Partials      132      132              
Flag Coverage Δ
autoflow_unit_tests 93.03% <ø> (ø)
flowapi_unit_tests 72.76% <ø> (ø)
flowauth_unit_tests 78.41% <ø> (ø)
flowclient_unit_tests 74.65% <50.00%> (-0.08%) ⬇️
flowetl_unit_tests 95.49% <ø> (ø)
flowkit_jwt_generator_unit_tests 95.10% <ø> (ø)
flowmachine_unit_tests 90.53% <0.00%> (-0.24%) ⬇️
integration_tests 66.04% <5.00%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...e/server/query_schemas/per_subscriber_aggregate.py 0.00% <0.00%> (ø)
flowclient/flowclient/aggregates.py 94.89% <50.00%> (-0.94%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d869364...0ae659e. Read the comment docs.

Copy link
Member

@jc-harrison jc-harrison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments here

flowclient/flowclient/aggregates.py Outdated Show resolved Hide resolved
flowclient/flowclient/aggregates.py Outdated Show resolved Hide resolved
flowclient/flowclient/aggregates.py Outdated Show resolved Hide resolved
class PerSubscriberAggregateSchema(BaseSchema):
query_kind = fields.String(validate=OneOf(["per_subscriber_aggregate"]))
# TODO: Can we make this a list or a single query?
subscriber_query = fields.List(SubscriberSubset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubscriberSubset is not what you want here - that's a field that takes a query ID for a pre-defined query, specifically for use as the 'subscriber_subset' parameter. You'll want to define a new schema, something like NumericSubscriberMetricsSchema(OneOfSchema) (see histogram_aggregate.HistogrammableMetrics for an example - in fact, I think you want exactly that here, so it might be worth splitting HistogrammableMetrics into its own file and renaming it, similar to ReferenceLocationSchema.

You can then do subscriber_query = fields.Nested(NumericSubscriberMetricsSchema, many=True, validate=Length(min=1)) to accept a list of numeric subscriber metric queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether fields.Nested(NumericSubscriberMetricsSchema, many=True) enforces that all elements of the list have the same query kind (e.g. "total_active_periods"), or whether this would allow a list of heterogeneous query kinds - I guess we'd want the former, so it's worth checking that.

Copy link
Contributor Author

@Thingus Thingus Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would fields.List(fields.Nested(NumericSubscriberMetrics), validate=Length(min=1)) be better here? I think that this would enforce that every member of the List has the NumericSubscriberMetrics fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I'm not sure what the difference is between fields.List(fields.Nested(...)) and fields.Nested(..., many=True) (or whether there is a difference). Ideally, I think we'd like to enforce that every item in the list is the same type of subscriber metric (because it doesn't necessarily make sense to aggregate over a union of "total_active_periods" and "topup_amount", for example, and in some cases that might not even be a valid operation - e.g. if one of the queries has additional columns, or 'value' has a different data type). I'm not sure whether either List or many=True enforces that, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also not sure whether this choice has an implication for the construction of API scopes for token generation - I'll check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like both approaches are functionally identical - both produce the same API spec, and both allow heterogeneous lists. I'm fairly sure that the permissions-checking will also allow a heterogeneous list, provided the token gives permissions for each of the query kinds present in the list, so we can't get around the problem that way either.

So we can either find a way to enforce homogeneity (perhaps by defining a custom Field, or by using @validates_schema), or just accept heterogeneous lists. I think @validates_schema might be an appropriate approach - I'm not sure the additional constraint will be reflected in the generated API spec, but I suppose it's better to have a non-advertised validation constraint than to accept parameters that will produce a broken query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this respect, is heterogeneous inputs a problem? As long as they all have a subscriber and value column, they should produce a valid result, and may be useful (say you wanted to select the maximum value from two different sets of subscriber metrics)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they both have only 'subscriber' and 'value' columns, and the value columns are of compatible type, then heterogeneous is fine. If either query kind has additional columns then 'UNION' will fail (e.g. we might want to allow LocationVisits here once that's exposed, and that also has location columns). Similarly if the 'value' columns have different types, that cannot be implicitly cast to each other, then the queries cannot be unioned (although given we're explicitly choosing "numeric subscriber metric" queries here, this point is unlikely to be an issue). But you're right, there might be times where a heterogeneous union is actually useful. For now, provided all the currently-allowed metrics have the same columns, and the 'value' columns all have compatible numeric types, let's allow heterogeneous lists - we can revisit the question if we decide to add LocationVisits as an option here.

subscriber_queries = fields.List(
fields.Nested(NumericSubscriberMetricsSchema), validate=Length(min=1)
)
agg_method = fields.String(validate=OneOf(agg_methods))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got a custom field for this (custom_fields.Statistic), which is used in a number of other query schemas. Unfortunately, it doesn't specify quite the same set of aggregation methods as are allowed here - that's something we should look to make more consistent at some point.

@@ -66,6 +67,7 @@ class FlowmachineQuerySchema(OneOfSchema):
"unmoving_counts": UnmovingCountsSchema,
"unmoving_at_reference_location_counts": UnmovingAtReferenceLocationCountsSchema,
"trips_od_matrix": TripsODMatrixSchema,
"per_subscriber_aggregate": PerSubscriberAggregateSchema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any query schema included in FlowmachineQuerySchema is exposed as a top-level query (which users can directly run and get results). We don't want that here (subscriber-level queries shouldn't be exposed directly), so PerSubscriberAggregateSchema should only be exposed as an allowed option for appropriate parameters of other queries - in this case, we'll want to add it to joined_spatial_aggregate's JoinableMetrics. Thinking about it, it should probably be an allowed input to HistogramAggregate as well - so we'll need to re-introduce HistogrammableMetrics after all. If there's a way to nest a OneOfSchema within another OneOfSchema to avoid duplicating the set of schemas in NumericSubscriberMetricsSchema, that would be good.

subscriber_query: SubscriberFeature
A query with a `subscriber` column
agg_method: {"count", "sum", "avg", "max", "min", "median", "stddev", "variance"} default "avg"
The method of aggregation to perform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-adding my comment here following the file move, for visibility:

The descriptions of the parameters here need updating to reflect the parameters of the flowclient function, rather than of the flowmachine class.

subscriber_query = reduce(
# TODO: Replace with Jono's new list input to union
lambda x, y: x._flowmachine_query_obj.union(y._flowmachine_query_obj),
self.subscriber_queries,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this won't work - sorry. You want reduce(lambda x, y: x.union(y), (q._flowmachine_query_obj for q in self.subscriber_queries)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's going to be moot soon - Jono's working on #4593 , we may as well wait for that and incorporate it into this PR.

@greenape
Copy link
Member

Is this still alive, or would it need to be reworked now?

@jc-harrison
Copy link
Member

Is this still alive, or would it need to be reworked now?

Would need to be reworked a bit, to use OneOfQuerySchema, but I don't think there are any fundamental changes required.

@greenape greenape removed the P-Later label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request FlowClient Issues related to FlowClient FlowMachine Issues related to FlowMachine
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose PerSubscriberAggregate
3 participants