Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 Perform overpass queries in batches #1026

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

JGreenlee
Copy link
Contributor

The existing implementation predicted modes for sections one by one (if the section's sensed mode was IN_VEHICLE or UNKNOWN, we queried transit stops at the start and end locations respectively)

This was inefficient in 2 ways:

  • in the case of consecutive IN_VEHICLE / UNKNOWN sections, we queried twice for the same point because the first section's end is the same as the second section's start
  • we queried everything one-by-one, which led to a lot of small queries in succession

This can be optimized by doing the predictions in 2 steps.
On the first pass, we make the predictions we can without any transit stops.
Then we query overpass for stops near all the remaining sections' locations
Then we do a second pass on those sections, with the transit stops at each location having now been retrieved

in match_stops.py, get_stops_near now accepts a list of locations rather than just one location. Updated TestOverpass to reflect this, and also made test_get_stops_near a little more comprehensive

Testing done:
Tests pass. I also manually inspected the created analysis/inferred_section entries for shankari_2016-07-27 and shankari_2016-08-04 on this branch vs. master to ensure the sensed_modes were the same

@JGreenlee
Copy link
Contributor Author

Results are promising so far:

Pasted Graphic 1 Pipeline stage runtimes for MODE_INFERENCE

@JGreenlee
Copy link
Contributor Author

TODO:

  • Add limit to how many locations are in one query
    • While batching multiple locations into one query does appear to be faster overall, the current implementation does all sections at once and we risk creating a query so large that it consistently times out
  • Address configurability regressions (if they are important)
    • section.endStopRadius is not used. Can it be removed or might it be used in the future?
    • The query template is now defined inline instead of a separate file in conf/. Is it important for this to be configurable?

@TeachMeTW
Copy link
Contributor

  • Add limit to how many locations are in one query

Added MAX_BBOXES_PER_QUERY

  • The query template is now defined inline instead of a separate file in conf/. Is it important for this to be configurable?

Added optionality to load from config else use default inline



def get_query_for_bboxes(bboxes):
query = '[out:json][timeout:25];\n'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hard part about making the query configurable is that the query now has 2 distinct parts: the header which is defined here, and the body which is repeated for X number of locations
The current solution only makes the body configurable. Now that I am thinking about it, I am not even sure if we need the body to be configurable because the code expects those specific features when it parses the result. If someone wants to change what features are queried, they will have to make code changes anyway

If there is one thing that definitely should be configurable, I think it is the timeout threshold which goes in the header and is currently 25

enetm.MAX_BBOXES_PER_QUERY = original_max
enetm.make_request_and_catch = original_make_request_and_catch

def test_get_predicted_transit_mode_different_sizes(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this one is supposed to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is to verify that the function behaves correctly regardless of the number of stops provided. It checks that the function scales correctly and returns the expected result for a range of input sizes. Though it could be argued that is is redundant since we have test_get_predicted_transit_mode_many_chunks

Comment on lines 113 to 115
stops = enetm.get_stops_near(coords, 150.0)
# Expect one chunk per coordinate = 20 chunks.
self.assertEqual(len(stops), 20)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works but not for the reason you think it does.

Looking at your implementation of get_stops_near, the return value is not separated by chunk; the chunks have already been merged together.
The length of this list will match the length of coords (in this case 20), regardless of what MAX_BBOXES_PER_QUERY is.
So this isn't really validating the chunking at all.

Come up with a better way to test the chunking and do it with MAX_BBOXES_PER_QUERY=10

Hint: maybe you can measure how many API calls were made?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JGreenlee So you suggest that with MAX_BBOXES_PER_QUERY=10 and 20 dummy coordinates, we expect 2 API calls?

Comment on lines 73 to 91
def test_chunk_list(self):
# Case 1: List of 10 elements with chunk size of 3.
data = list(range(1, 11)) # [1, 2, ..., 10]
chunk_size = 3
chunks = list(enetm.chunk_list(data, chunk_size))
expected_chunks = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
self.assertEqual(chunks, expected_chunks)

# Case 2: Exact division
data_exact = list(range(1, 10)) # [1, 2, ..., 9]
chunk_size = 3
chunks_exact = list(enetm.chunk_list(data_exact, chunk_size))
expected_chunks_exact = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
self.assertEqual(chunks_exact, expected_chunks_exact)

# Case 3: Empty list
data_empty = []
chunks_empty = list(enetm.chunk_list(data_empty, chunk_size))
self.assertEqual(chunks_empty, [])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much value in testing this function by itself.
Validate the behavior of get_public_transit_stops and that will tell us whether the chunking is working correctly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are going to be a lot of new tests and they are specific to functions in match_stops.py, then we should probably move those to a new file called TestMatchStops or similar.

However, I don't think we really need all these new tests; we just need one that ensures the correct number of Overpass requests are made (e.g. 1 call for <=10 locations, 2 calls for <=20 locations, etc)

JGreenlee and others added 4 commits February 20, 2025 15:17
The existing implementation predicted modes for sections one by one (if the section's sensed mode was IN_VEHICLE or UNKNOWN, we queried transit stops at the start and end locations respectively)

This was inefficient in 2 ways:
- in the case of consecutive IN_VEHICLE / UNKNOWN sections, we queried twice for the same point because the first section's end is the same as the second section's start
- we queried everything one-by-one, which led to a lot of small queries in succession

This can be optimized by doing the predictions in 2 steps.
On the first pass, we make the predictions we can without any transit stops.
Then we query overpass for stops near all the remaining sections' locations
Then we do a second pass on those sections, with the transit stops at each location having now been retrieved

in match_stops.py, get_stops_near now accepts a list of locations rather than just one location. Updated TestOverpass to reflect this, and also made test_get_stops_near a little more comprehensive

Testing done:
Tests pass. I also manually inspected the created `analysis/inferred_section` entries for `shankari_2016-07-27` and `shankari_2016-08-04` on this branch vs. master to ensure the sensed_modes were the same
Modified test overpass to skip IFF geofabrik_overpass_key is not configured (which is for local testing and should pass in gh)

Modified match_stops to add a limit to bboxes per query (MAX_BBOXES_PER_QUERY)

Modified get_query_for_bbox to have the default inline query but add option for future configuration.
Update emission/individual_tests/TestOverpass.py

Co-authored-by: Jack Greenlee <[email protected]>
Updated tests to one that ensures the correct number of Overpass requests are made (e.g. 1 call for <=10 locations, 2 calls for <=20 locations, etc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants