-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposing PerSubscriberAggregate #4590
base: master
Are you sure you want to change the base?
Conversation
Test summaryRun details
View run in Cypress Dashboard ➡️ Flakiness
This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard |
Codecov Report
@@ Coverage Diff @@
## master #4590 +/- ##
==========================================
- Coverage 93.42% 93.25% -0.18%
==========================================
Files 262 263 +1
Lines 10356 10376 +20
Branches 895 896 +1
==========================================
+ Hits 9675 9676 +1
- Misses 549 568 +19
Partials 132 132
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments here
flowmachine/flowmachine/core/server/query_schemas/per_subscriber_aggregate.py
Outdated
Show resolved
Hide resolved
flowmachine/flowmachine/core/server/query_schemas/per_subscriber_aggregate.py
Outdated
Show resolved
Hide resolved
class PerSubscriberAggregateSchema(BaseSchema): | ||
query_kind = fields.String(validate=OneOf(["per_subscriber_aggregate"])) | ||
# TODO: Can we make this a list or a single query? | ||
subscriber_query = fields.List(SubscriberSubset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SubscriberSubset
is not what you want here - that's a field that takes a query ID for a pre-defined query, specifically for use as the 'subscriber_subset' parameter. You'll want to define a new schema, something like NumericSubscriberMetricsSchema(OneOfSchema)
(see histogram_aggregate.HistogrammableMetrics
for an example - in fact, I think you want exactly that here, so it might be worth splitting HistogrammableMetrics
into its own file and renaming it, similar to ReferenceLocationSchema
.
You can then do subscriber_query = fields.Nested(NumericSubscriberMetricsSchema, many=True, validate=Length(min=1))
to accept a list of numeric subscriber metric queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether fields.Nested(NumericSubscriberMetricsSchema, many=True)
enforces that all elements of the list have the same query kind (e.g. "total_active_periods"), or whether this would allow a list of heterogeneous query kinds - I guess we'd want the former, so it's worth checking that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would fields.List(fields.Nested(NumericSubscriberMetrics), validate=Length(min=1))
be better here? I think that this would enforce that every member of the List
has the NumericSubscriberMetrics
fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I'm not sure what the difference is between fields.List(fields.Nested(...))
and fields.Nested(..., many=True)
(or whether there is a difference). Ideally, I think we'd like to enforce that every item in the list is the same type of subscriber metric (because it doesn't necessarily make sense to aggregate over a union of "total_active_periods" and "topup_amount", for example, and in some cases that might not even be a valid operation - e.g. if one of the queries has additional columns, or 'value' has a different data type). I'm not sure whether either List
or many=True
enforces that, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also not sure whether this choice has an implication for the construction of API scopes for token generation - I'll check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like both approaches are functionally identical - both produce the same API spec, and both allow heterogeneous lists. I'm fairly sure that the permissions-checking will also allow a heterogeneous list, provided the token gives permissions for each of the query kinds present in the list, so we can't get around the problem that way either.
So we can either find a way to enforce homogeneity (perhaps by defining a custom Field, or by using @validates_schema
), or just accept heterogeneous lists. I think @validates_schema
might be an appropriate approach - I'm not sure the additional constraint will be reflected in the generated API spec, but I suppose it's better to have a non-advertised validation constraint than to accept parameters that will produce a broken query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this respect, is heterogeneous inputs a problem? As long as they all have a subscriber
and value
column, they should produce a valid result, and may be useful (say you wanted to select the maximum value from two different sets of subscriber metrics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they both have only 'subscriber' and 'value' columns, and the value columns are of compatible type, then heterogeneous is fine. If either query kind has additional columns then 'UNION' will fail (e.g. we might want to allow LocationVisits here once that's exposed, and that also has location columns). Similarly if the 'value' columns have different types, that cannot be implicitly cast to each other, then the queries cannot be unioned (although given we're explicitly choosing "numeric subscriber metric" queries here, this point is unlikely to be an issue). But you're right, there might be times where a heterogeneous union is actually useful. For now, provided all the currently-allowed metrics have the same columns, and the 'value' columns all have compatible numeric types, let's allow heterogeneous lists - we can revisit the question if we decide to add LocationVisits as an option here.
Co-authored-by: James Harrison <[email protected]>
subscriber_queries = fields.List( | ||
fields.Nested(NumericSubscriberMetricsSchema), validate=Length(min=1) | ||
) | ||
agg_method = fields.String(validate=OneOf(agg_methods)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've got a custom field for this (custom_fields.Statistic
), which is used in a number of other query schemas. Unfortunately, it doesn't specify quite the same set of aggregation methods as are allowed here - that's something we should look to make more consistent at some point.
@@ -66,6 +67,7 @@ class FlowmachineQuerySchema(OneOfSchema): | |||
"unmoving_counts": UnmovingCountsSchema, | |||
"unmoving_at_reference_location_counts": UnmovingAtReferenceLocationCountsSchema, | |||
"trips_od_matrix": TripsODMatrixSchema, | |||
"per_subscriber_aggregate": PerSubscriberAggregateSchema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any query schema included in FlowmachineQuerySchema
is exposed as a top-level query (which users can directly run and get results). We don't want that here (subscriber-level queries shouldn't be exposed directly), so PerSubscriberAggregateSchema should only be exposed as an allowed option for appropriate parameters of other queries - in this case, we'll want to add it to joined_spatial_aggregate's JoinableMetrics
. Thinking about it, it should probably be an allowed input to HistogramAggregate
as well - so we'll need to re-introduce HistogrammableMetrics
after all. If there's a way to nest a OneOfSchema
within another OneOfSchema
to avoid duplicating the set of schemas in NumericSubscriberMetricsSchema
, that would be good.
subscriber_query: SubscriberFeature | ||
A query with a `subscriber` column | ||
agg_method: {"count", "sum", "avg", "max", "min", "median", "stddev", "variance"} default "avg" | ||
The method of aggregation to perform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-adding my comment here following the file move, for visibility:
The descriptions of the parameters here need updating to reflect the parameters of the flowclient function, rather than of the flowmachine class.
subscriber_query = reduce( | ||
# TODO: Replace with Jono's new list input to union | ||
lambda x, y: x._flowmachine_query_obj.union(y._flowmachine_query_obj), | ||
self.subscriber_queries, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this won't work - sorry. You want reduce(lambda x, y: x.union(y), (q._flowmachine_query_obj for q in self.subscriber_queries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's going to be moot soon - Jono's working on #4593 , we may as well wait for that and incorporate it into this PR.
Is this still alive, or would it need to be reworked now? |
Would need to be reworked a bit, to use |
Closes #4581
I have:
Description
Exposing
PerSubscriberAggregate