Exposing PerSubscriberAggregate #4590

Thingus · 2021-11-22T16:18:39Z

Closes #4581

I have:

Formatted any Python files with black
Brought the branch up to date with master
Added any relevant Github labels
Added tests for any new additions
Added or updated any relevant documentation
Added an Architectural Decision Record (ADR), if appropriate
Added an MPLv2 License Header if appropriate
Updated the Changelog

Description

Exposing PerSubscriberAggregate

cypress · 2021-11-22T16:31:30Z

Test summary

43 • 0 • 0 • 0 • 1

Run details


Project	FlowAuth
Status	Passed
Commit	`b12e04c`
Started	Nov 24, 2021 11:57 AM
Ended	Nov 24, 2021 12:02 PM
Duration	05:40 💡
OS	Linux Debian - 10.5
Browser	Electron 94

View run in Cypress Dashboard ➡️

Flakiness

	cypress/integration/add_new_server.js		1
1	Server management > Add server

This comment has been generated by cypress-bot as a result of this project's GitHub integration settings. You can manage this integration in this project's settings in the Cypress Dashboard

codecov · 2021-11-22T17:31:17Z

Codecov Report

Merging #4590 (ef90f5f) into master (d869364) will decrease coverage by 0.17%.
The diff coverage is 5.00%.

❗ Current head ef90f5f differs from pull request most recent head 0ae659e. Consider uploading reports for the commit 0ae659e to get more accurate results

@@            Coverage Diff             @@
##           master    #4590      +/-   ##
==========================================
- Coverage   93.42%   93.25%   -0.18%     
==========================================
  Files         262      263       +1     
  Lines       10356    10376      +20     
  Branches      895      896       +1     
==========================================
+ Hits         9675     9676       +1     
- Misses        549      568      +19     
  Partials      132      132

Flag	Coverage Δ
autoflow_unit_tests	`93.03% <ø> (ø)`
flowapi_unit_tests	`72.76% <ø> (ø)`
flowauth_unit_tests	`78.41% <ø> (ø)`
flowclient_unit_tests	`74.65% <50.00%> (-0.08%)`	⬇️
flowetl_unit_tests	`95.49% <ø> (ø)`
flowkit_jwt_generator_unit_tests	`95.10% <ø> (ø)`
flowmachine_unit_tests	`90.53% <0.00%> (-0.24%)`	⬇️
integration_tests	`66.04% <5.00%> (-0.14%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...e/server/query_schemas/per_subscriber_aggregate.py	`0.00% <0.00%> (ø)`
flowclient/flowclient/aggregates.py	`94.89% <50.00%> (-0.94%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d869364...0ae659e. Read the comment docs.

jc-harrison

A few comments here

flowclient/flowclient/aggregates.py

flowmachine/flowmachine/core/server/query_schemas/per_subscriber_aggregate.py

jc-harrison · 2021-11-23T12:59:48Z

flowmachine/flowmachine/core/server/query_schemas/per_subscriber_aggregate.py

+class PerSubscriberAggregateSchema(BaseSchema):
+    query_kind = fields.String(validate=OneOf(["per_subscriber_aggregate"]))
+    # TODO: Can we make this a list or a single query?
+    subscriber_query = fields.List(SubscriberSubset)


SubscriberSubset is not what you want here - that's a field that takes a query ID for a pre-defined query, specifically for use as the 'subscriber_subset' parameter. You'll want to define a new schema, something like NumericSubscriberMetricsSchema(OneOfSchema) (see histogram_aggregate.HistogrammableMetrics for an example - in fact, I think you want exactly that here, so it might be worth splitting HistogrammableMetrics into its own file and renaming it, similar to ReferenceLocationSchema.

You can then do subscriber_query = fields.Nested(NumericSubscriberMetricsSchema, many=True, validate=Length(min=1)) to accept a list of numeric subscriber metric queries.

I'm not sure whether fields.Nested(NumericSubscriberMetricsSchema, many=True) enforces that all elements of the list have the same query kind (e.g. "total_active_periods"), or whether this would allow a list of heterogeneous query kinds - I guess we'd want the former, so it's worth checking that.

Would fields.List(fields.Nested(NumericSubscriberMetrics), validate=Length(min=1)) be better here? I think that this would enforce that every member of the List has the NumericSubscriberMetrics fields.

To be honest, I'm not sure what the difference is between fields.List(fields.Nested(...)) and fields.Nested(..., many=True) (or whether there is a difference). Ideally, I think we'd like to enforce that every item in the list is the same type of subscriber metric (because it doesn't necessarily make sense to aggregate over a union of "total_active_periods" and "topup_amount", for example, and in some cases that might not even be a valid operation - e.g. if one of the queries has additional columns, or 'value' has a different data type). I'm not sure whether either List or many=True enforces that, though.

Also not sure whether this choice has an implication for the construction of API scopes for token generation - I'll check.

Looks like both approaches are functionally identical - both produce the same API spec, and both allow heterogeneous lists. I'm fairly sure that the permissions-checking will also allow a heterogeneous list, provided the token gives permissions for each of the query kinds present in the list, so we can't get around the problem that way either.

So we can either find a way to enforce homogeneity (perhaps by defining a custom Field, or by using @validates_schema), or just accept heterogeneous lists. I think @validates_schema might be an appropriate approach - I'm not sure the additional constraint will be reflected in the generated API spec, but I suppose it's better to have a non-advertised validation constraint than to accept parameters that will produce a broken query.

In this respect, is heterogeneous inputs a problem? As long as they all have a subscriber and value column, they should produce a valid result, and may be useful (say you wanted to select the maximum value from two different sets of subscriber metrics)

If they both have only 'subscriber' and 'value' columns, and the value columns are of compatible type, then heterogeneous is fine. If either query kind has additional columns then 'UNION' will fail (e.g. we might want to allow LocationVisits here once that's exposed, and that also has location columns). Similarly if the 'value' columns have different types, that cannot be implicitly cast to each other, then the queries cannot be unioned (although given we're explicitly choosing "numeric subscriber metric" queries here, this point is unlikely to be an issue). But you're right, there might be times where a heterogeneous union is actually useful. For now, provided all the currently-allowed metrics have the same columns, and the 'value' columns all have compatible numeric types, let's allow heterogeneous lists - we can revisit the question if we decide to add LocationVisits as an option here.

Co-authored-by: James Harrison <[email protected]>

…g_api

… PR comments

jc-harrison · 2021-11-23T18:39:39Z

flowmachine/flowmachine/core/server/query_schemas/per_subscriber_aggregate.py

+    subscriber_queries = fields.List(
+        fields.Nested(NumericSubscriberMetricsSchema), validate=Length(min=1)
+    )
+    agg_method = fields.String(validate=OneOf(agg_methods))


We've got a custom field for this (custom_fields.Statistic), which is used in a number of other query schemas. Unfortunately, it doesn't specify quite the same set of aggregation methods as are allowed here - that's something we should look to make more consistent at some point.

jc-harrison · 2021-11-23T19:04:48Z

flowmachine/flowmachine/core/server/query_schemas/flowmachine_query.py

@@ -66,6 +67,7 @@ class FlowmachineQuerySchema(OneOfSchema):
        "unmoving_counts": UnmovingCountsSchema,
        "unmoving_at_reference_location_counts": UnmovingAtReferenceLocationCountsSchema,
        "trips_od_matrix": TripsODMatrixSchema,
+        "per_subscriber_aggregate": PerSubscriberAggregateSchema,


Any query schema included in FlowmachineQuerySchema is exposed as a top-level query (which users can directly run and get results). We don't want that here (subscriber-level queries shouldn't be exposed directly), so PerSubscriberAggregateSchema should only be exposed as an allowed option for appropriate parameters of other queries - in this case, we'll want to add it to joined_spatial_aggregate's JoinableMetrics. Thinking about it, it should probably be an allowed input to HistogramAggregate as well - so we'll need to re-introduce HistogrammableMetrics after all. If there's a way to nest a OneOfSchema within another OneOfSchema to avoid duplicating the set of schemas in NumericSubscriberMetricsSchema, that would be good.

jc-harrison · 2021-11-23T19:12:46Z

flowclient/flowclient/query_specs.py

+        subscriber_query: SubscriberFeature
+            A query with a `subscriber` column
+        agg_method: {"count", "sum", "avg", "max", "min", "median", "stddev", "variance"} default "avg"
+            The method of aggregation to perform


Re-adding my comment here following the file move, for visibility:

The descriptions of the parameters here need updating to reflect the parameters of the flowclient function, rather than of the flowmachine class.

jc-harrison · 2021-11-23T19:19:44Z

flowmachine/flowmachine/core/server/query_schemas/per_subscriber_aggregate.py

+        subscriber_query = reduce(
+            # TODO: Replace with Jono's new list input to union
+            lambda x, y: x._flowmachine_query_obj.union(y._flowmachine_query_obj),
+            self.subscriber_queries,


Actually, this won't work - sorry. You want reduce(lambda x, y: x.union(y), (q._flowmachine_query_obj for q in self.subscriber_queries)

It's going to be moot soon - Jono's working on #4593 , we may as well wait for that and incorporate it into this PR.

greenape · 2023-02-10T12:58:31Z

Is this still alive, or would it need to be reworked now?

jc-harrison · 2023-03-03T10:25:01Z

Is this still alive, or would it need to be reworked now?

Would need to be reworked a bit, to use OneOfQuerySchema, but I don't think there are any fundamental changes required.

Exposing PerSubscriberAggregate

6ee584d

Adding to flowclient

a568b85

Thingus requested a review from jc-harrison November 22, 2021 16:34

Merge branch 'master' into per_sub_agg_api

e2fa1cd

Thingus added enhancement New feature or request FlowClient Issues related to FlowClient FlowMachine Issues related to FlowMachine labels Nov 22, 2021

Thingus marked this pull request as ready for review November 22, 2021 16:35

Thingus added 2 commits November 22, 2021 16:49

Moving to aggregates

916789d

Moving to aggregates

e5dcba3

in_progress

858f53d

jc-harrison requested changes Nov 23, 2021

View reviewed changes

Thingus and others added 4 commits November 23, 2021 13:57

Update flowclient/flowclient/aggregates.py

ef90f5f

Co-authored-by: James Harrison <[email protected]>

Merge remote-tracking branch 'origin/per_sub_agg_api' into per_sub_ag…

333b460

…g_api

Refactoring HistogramMetics to NumericSubscriberMetricsSchema + other…

0ae659e

… PR comments

per_susbscriber_aggregate now in approvals for flowmachine

36d4811

jc-harrison reviewed Nov 23, 2021

View reviewed changes

Merge branch 'master' into per_sub_agg_api

b12e04c

Thingus added the P-Later label Nov 25, 2021

greenape removed the P-Later label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exposing PerSubscriberAggregate #4590

Exposing PerSubscriberAggregate #4590

Thingus commented Nov 22, 2021 •

edited

Loading

cypress bot commented Nov 22, 2021 •

edited

Loading

codecov bot commented Nov 22, 2021 •

edited

Loading

jc-harrison left a comment

jc-harrison Nov 23, 2021

jc-harrison Nov 23, 2021

Thingus Nov 23, 2021 •

edited

Loading

jc-harrison Nov 23, 2021

jc-harrison Nov 23, 2021

jc-harrison Nov 23, 2021

Thingus Nov 23, 2021

jc-harrison Nov 23, 2021

jc-harrison Nov 23, 2021

jc-harrison Nov 23, 2021

jc-harrison Nov 23, 2021

jc-harrison Nov 23, 2021

Thingus Nov 23, 2021

greenape commented Feb 10, 2023

jc-harrison commented Mar 3, 2023

Exposing PerSubscriberAggregate #4590

Are you sure you want to change the base?

Exposing PerSubscriberAggregate #4590

Conversation

Thingus commented Nov 22, 2021 • edited Loading

I have:

Description

cypress bot commented Nov 22, 2021 • edited Loading

Test summary

Run details

Flakiness

codecov bot commented Nov 22, 2021 • edited Loading

Codecov Report

jc-harrison left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Thingus Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

greenape commented Feb 10, 2023

jc-harrison commented Mar 3, 2023

Thingus commented Nov 22, 2021 •

edited

Loading

cypress bot commented Nov 22, 2021 •

edited

Loading

codecov bot commented Nov 22, 2021 •

edited

Loading

Thingus Nov 23, 2021 •

edited

Loading