✂️ Optimize `BuiltinTimeSeries` to reduce DB calls #1032

JGreenlee · 2025-02-26T17:16:57Z

For find_entries and find_entries_count we split the queried keys by which collection we expect them to exist in. If there are no keys for one of orig_ts_db or analysis_ts_db, we end up making an INVALID query: "{'metadata.key': 'invalid'} to that DB

Unsure why this was the case, I dug into the blame and found 14aa503

Return a blank cursor instead of an empty iterator if there are no matches
If there are no matches for a particular query, perform an invalid query and return the resulting empty cursor instead of an empty list iterator.
This allows us to use cursor methods such as .count() on the result of get_entries_for_timeseries.

However after a few years of code churn I believe this is no longer necessary; the methods that call _get_entries_for_timeseries, which are find_entries and find_entries_count do not use Cursor methods or return a Cursor type. _get_entries_for_timeseries now includes both the count and the Cursor in its return value. find_entries_count just uses the count and find_entries returns an iterator created from the Cursor via itertools.chain, not the Cursor itself

Although these queries will theoretically never return any matches, they are still queries and could involve some overhead (particularly on production where we've seen the DB not using indices properly). If they are pointless, they should be eliminated, as that could reduce the total number of queries significantly (almost by half in many cases)

shankari · 2025-02-26T18:20:18Z

@JGreenlee this is great, but I think we can go a bit further.

See my comment here:
e-mission/e-mission-docs#1109 (comment)
in particular

Regardless, the count of the entries is a nice to have, and is extremely useful for debugging, but we don't even return it externally.
I also looked through all uses of find_entries and we almost always wrap the values returned in the cursor in a list, so just converting the cursor to a list right here and getting the count should also work.

This should drop the DB queries by another 50%

For the record, we used to return the cursor because it allows us to iterate over entries one by one instead of reading the entire list into memory and introducing memory pressure. But right now, it is better for us to optimize for DB calls than for memory usage. And we wrap the cursor in a list or a dataframe anyway.

I did take a look at all uses of find_entries, but didn't feel confident enough to make the change at 11pm the weekend before going on reduced hours. But now that we have more time, I think we can optimize that as well.

I am happy to merge this and address that as a separate optimization, or just wait and merge both together.

LMK what you think

JGreenlee · 2025-02-26T18:32:40Z

@shankari In the current implementation, does the construction of the iterator (return itertools.chain(orig_ts_db_result, analysis_ts_db_result)) cause all entries to be read into memory?

If yes then the iterator seems pointless

For `find_entries` and `find_entries_count` we split the queried keys by which collection we expect them to exist in. If there are no keys for one of orig_ts_db or analysis_ts_db, we end up making an `INVALID` query: `"{'metadata.key': 'invalid'}` to that DB Unsure why this was the case, I dug into the blame and found e-mission@14aa503 > Return a blank cursor instead of an empty iterator if there are no matches > If there are no matches for a particular query, perform an invalid query and return the resulting empty cursor instead of an empty list iterator. > This allows us to use cursor methods such as `.count()` on the result of `get_entries_for_timeseries`. However after a few years of code churn I believe this is no longer necessary; the methods that call `_get_entries_for_timeseries`, which are `find_entries` and `find_entries_count` do not use Cursor methods or return a Cursor type. `_get_entries_for_timeseries` now includes both the count and the Cursor in its return value. `find_entries_count` just uses the count and `find_entries` returns an iterator created from the Cursor via itertools.chain, not the Cursor itself Although these queries will theoretically never return any matches, they are still queries and could involve some overhead (particularly on production where we've seen the DB not using indices properly). If they are pointless, they should be eliminated, as that could reduce the total number of queries significantly (almost by half in many cases)

pymongo queries return a Cursor, which we had been converting to an iterator before returning it from `find_entries`. To get the count of entries, we were making a separate `count_documents` query. In most places where `find_entries` is called, we were consuming the iterator into a list, so we might as well do that upfront in `_get_entries_for_timeseries`. That way, we can use the `len` of the list rather than making a second `count_documents` query. The main changes are in builtin_timeseries; the rest of the changes are just removing the numerous `list(...)` conversions in places where `find_entries` is called, as well as renaming variables to reflect that `find_entries` returns a list, not an iterator > # This list() conversion causes the cursor to be consumed, which is memory-expensive, > # but faster as it allows us to get the count without having to make an additional > "count_documents" DB call. > # In a future situation where memory is constrained or DB calls are less expensive, > # we could skip the list() conversion and return the cursor directly. > # e-mission#1032 (comment)

JGreenlee · 2025-02-27T18:52:33Z

After 290a48d,

have timeseries find_entries return entries as list, not iterable

pymongo queries return a Cursor, which we had been converting to an iterator before returning it from find_entries. To get the count of entries, we were making a separate count_documents query.
In most places where find_entries is called, we were consuming the iterator into a list, so we might as well do that upfront in _get_entries_for_timeseries. That way, we can use the len of the list rather than making a second count_documents query.

The main changes are in builtin_timeseries; the rest of the changes are just removing the numerous list(...) conversions in places where find_entries is called, as well as renaming variables to reflect that find_entries returns a list, not an iterator

# This list() conversion causes the cursor to be consumed, which is memory-expensive,
# but faster as it allows us to get the count without having to make an additional
# "count_documents" DB call.
# In a future situation where memory is constrained or DB calls are less expensive,
# we could skip the list() conversion and return the cursor directly.
# #1032 (comment)

the number of DB calls appears to be noticeably reduced (but see disclaimer below)

Execution time (locally) is similar or slightly reduced

JGreenlee · 2025-02-27T19:15:22Z

Disclaimer: the above counts were taken by monitoring the following commands to the DB: find, aggregate, insert, update, delete

There are also createIndexes and listIndexes commands, which I had been excluding because I thought they were only called on database initialization; however they appear to be called throughout the pipeline as indices are updated

When we include those, our improvements appear a lot more modest:

Because createIndexes account for a very high proportion of the total commands

and our optimizations only reduce find and aggregate commands

JGreenlee · 2025-02-27T19:55:22Z

There are also createIndexes and listIndexes commands, which I had been excluding because I thought they were only called on database initialization; however they appear to be called throughout the pipeline as indices are updated

After investigation, the createIndexes commands are not just a byproduct of the indices being updated; we actually do call create_index repeatedly (every time get_timeseries_db or get_analysis_timeseries_db are called)

According to Mongo docs, duplicate calls to create_index are ignored: https://www.mongodb.com/docs/manual/reference/method/db.collection.createIndex/#recreating-an-existing-index

But I do wonder if they are introducing some overhead. We could easily prevent the duplicate calls with something like this:

diff --git a/emission/core/get_database.py b/emission/core/get_database.py
index b1585ba2..3b7dcc5f 100644
--- a/emission/core/get_database.py
+++ b/emission/core/get_database.py
@@ -193,16 +193,19 @@ def _migrate_sparse_to_dense(collection, geo_index):
         collection.drop_index(geo_index)
         print("Found sparse geosphere index, dropping %s, index list after=%s" % (geo_index, collection.index_information().keys()))
 
+TimeSeries = None
 def get_timeseries_db():
-    #current_db = MongoClient().Stage_database
-    TimeSeries = _get_current_db().Stage_timeseries
-    TimeSeries.create_index([("user_id", pymongo.ASCENDING)])
-    TimeSeries.create_index([("metadata.key", pymongo.ASCENDING)])
-    TimeSeries.create_index([("metadata.write_ts", pymongo.DESCENDING)])
-    TimeSeries.create_index([("data.ts", pymongo.DESCENDING)], sparse=True)
-    TimeSeries.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
-    _migrate_sparse_to_dense(TimeSeries, "data.loc_2dsphere")
-    TimeSeries.create_index([("data.loc", pymongo.GEOSPHERE)])
+    global TimeSeries
+    if TimeSeries is None:
+        TimeSeries = _get_current_db().Stage_timeseries
+        print('called get_timeseries_db')
+        TimeSeries.create_index([("user_id", pymongo.ASCENDING)])
+        TimeSeries.create_index([("metadata.key", pymongo.ASCENDING)])
+        TimeSeries.create_index([("metadata.write_ts", pymongo.DESCENDING)])
+        TimeSeries.create_index([("data.ts", pymongo.DESCENDING)], sparse=True)
+        TimeSeries.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
+        _migrate_sparse_to_dense(TimeSeries, "data.loc_2dsphere")
+        TimeSeries.create_index([("data.loc", pymongo.GEOSPHERE)])
     return TimeSeries
 
 def get_timeseries_error_db():
@@ -210,14 +213,16 @@ def get_timeseries_error_db():
     TimeSeriesError = _get_current_db().Stage_timeseries_error
     return TimeSeriesError
 
+AnalysisTimeSeries = None
 def get_analysis_timeseries_db():
     """
     " Stores the results of the analysis performed on the raw timeseries
     """
-    #current_db = MongoClient().Stage_database
-    AnalysisTimeSeries = _get_current_db().Stage_analysis_timeseries
-    AnalysisTimeSeries.create_index([("user_id", pymongo.ASCENDING)])
-    _create_analysis_result_indices(AnalysisTimeSeries)
+    global AnalysisTimeSeries
+    if AnalysisTimeSeries is None:
+        AnalysisTimeSeries = _get_current_db().Stage_analysis_timeseries
+        AnalysisTimeSeries.create_index([("user_id", pymongo.ASCENDING)])
+        _create_analysis_result_indices(AnalysisTimeSeries)
     return AnalysisTimeSeries
 
 def get_non_user_timeseries_db():

JGreenlee · 2025-02-27T20:03:16Z

That does appear to have helped significantly, especially for CLEAN_RESAMPLING, CREATE_CONFIRMED_OBJECTS, and CREATE_COMPOSITE_OBJECTS:

shankari · 2025-02-27T20:08:18Z

According to Mongo docs, duplicate calls to create_index are ignored

Right, this is the reason that I chose to call create_index every time. We can take a look at removing it and see what the impact on production is.

To test the impact on dev, I would suggest:

making sure that you are on mongo 8.0.4
you have limited CPU (1-2 vCPU) and memory (2 GB)
use a real-world snapshot in addition to a single run

shankari · 2025-02-27T20:10:10Z

what is the segmentation on the bars? Is it a split per user?

JGreenlee · 2025-02-27T20:20:06Z

It's split by the location in the codebase where the DB call happened.
This was all test data; I'll try a real world snapshot next

JGreenlee · 2025-02-28T04:39:17Z

With:

openpath_prod_stm_community_feb_13.tar.gz
Docker set to 2 CPUs and 2 GB RAM
MongoDB 8.0.4
USE_HINTS=True

Running the pipeline on master and remove_invalid_query took a little over an hour total

MODE_INFERENCE and TRIP_SEGMENTATION remain the slowest stages by far under these conditions, and they do not benefit much from these changes (neither of these branches include the optimizations from #1017 or #1026)

This chart is split by user.

But excluding MODE_INFERENCE and TRIP_SEGMENTATION, we can see some performance gains on the other stages:

I am going to leave ccebikes running overnight. I am not sure if it will finish because this one took over an hour and ccebikes has 90x as many trips

For collections where we add indexes, we do so inside the get_*_db function, resulting in thousands of duplicate calls to create_index e-mission#1032 (comment) This change persists references to these collections in memory and ensures the indexes are only created the first time the get_*_db function is called

JGreenlee · 2025-02-28T21:40:07Z

Tests failed because pip dependencies failed: web-server-1 | CondaEnvException: Pip failed

I don't know why but I don't think it's related to these changes.

They are passing locally:

Ran 373 tests in 552.994s

OK

shankari · 2025-03-01T20:29:57Z

I wondered if the version of pymongo that we use (pymongo==4.3.3) is too old, but it seems like it is still a valid release
https://pypi.org/project/pymongo/4.3.3/

Retrying in case there was a temporary glitch while downloading

shankari · 2025-03-01T20:35:45Z

It failed again, so trying to install it locally in an environment which didn't have it; succeeded

$ pip install pymongo==4.3.3
Collecting pymongo==4.3.3
  Downloading pymongo-4.3.3-cp310-cp310-macosx_10_9_universal2.whl.metadata (8.6 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo==4.3.3)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.3.3-cp310-cp310-macosx_10_9_universal2.whl (413 kB)
Downloading dnspython-2.7.0-py3-none-any.whl (313 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.3.3

But checking the install on osx-ubuntu, I see an error

Collecting pymongo==4.3.3
  Downloading pymongo-4.3.3-cp39-cp39-macosx_10_9_universal2.whl (413 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 413.2/413.2 kB 16.9 MB/s eta 0:00:00

failed

CondaEnvException: Pip failed

About to activate environment
# conda environments:
#
base                     /Users/runner/miniconda-23.5.2
emission              *  /Users/runner/miniconda-23.5.2/envs/emission

This appears to be a 3.9 versus 3.10 issue. Note that we should also actually fail if the install fails!!

shankari · 2025-03-01T23:08:29Z

Although the subsequent failures were for pymongo; pip is actually able to install pymongo. The failure is while installing pykov (note that the pymongo download succeeds)

Collecting pymongo==4.3.3
  Downloading pymongo-4.3.3-cp39-cp39-macosx_10_9_universal2.whl (413 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 413.2/413.2 kB 959.5 kB/s eta 0:00:00

Pip subprocess error:
  Running command git clone --filter=blob:none --quiet https://github.com/JGreenlee/e-mission-common /private/var/folders/y5/cx3cfzrd2q116myv9ly86sw1rnlmdj/T/pip-req-build-8rjf6qd9
  Running command git checkout -q eb37a5c36fd72c30926770b8784c4479eb20520c
ERROR: Could not find a version that satisfies the requirement pykov==0.1 (from versions: none)
ERROR: No matching distribution found for pykov==0.1

failed

CondaEnvException: Pip failed

It looks like pykovi may be a replacement, but let's start with seeing whether pykov is even needed in the first place

We had originally introduced pykov as a dependency because it was used the markov-based tour models. However, we never integrated the `user_model_josh` into the system, and while we integrated the tour model into common places and trips, we have since replaced it with the trip model. $ grep -r pykov emission/ emission//analysis/modelling/user_model_josh/utility_model.py:import pykov as pk emission//storage/decorations/common_place_queries.py:import pykov as pk `pykov` is no longer available through pip e-mission#1032 (comment) So this can be a forcing function to finally delete it. To fix the usages, we also delete the entire `user_model_josh` directory since it has never been used And comment out the one one reference to it in `common_place_queries` That one reference is in `get_successor`, which is only used in `emission//incomplete_tests/TestCommonPlaceQueries.py` so it should not cause any tests to fail. Testing done: - Automated tests pass (checked via CI)

…into remove_invalid_query

shankari · 2025-03-02T03:02:41Z

@JGreenlee I fixed the pymongo (which was really the pykov issue)

The new error does seem to be related to this change, and in fact, reminds me of 21bea50

 planner returned error :: caused by :: hint provided does not correspond to an existing index, full error: {'ok': 0.0, 'errmsg': 'error processing query: ns=Stage_database.Stage_timeseries limit=250000Tree: $and\n    user_id $eq BinData(3, F5A05F15C0BF4E2488233A375F026D19)\n    metadata.write_ts $lte 1740881739.028104\n    metadata.key $in [ "manual/mode_confirm" "manual/purpose_confirm" "manual/replaced_mode" "manual/trip_user_input" "manual/place_user_input" "manual/trip_addition_input" "manual/place_addition_input"]\n    $not\n        invalid exists\nSort: { metadata.write_ts: 1 }\nProj: {}\n planner returned error :: caused by :: hint provided does not correspond to an existing index', 'code': 2, 'codeName': 'BadValue'}

You should look to see how you can incorporate something similar into your fix. If that is too complex, you may want to move the create_index optimization to a separate PR. It is not the main focus of this change, and it is unclear how much it helps given

According to Mongo docs, duplicate calls to create_index are ignored: https://www.mongodb.com/docs/manual/reference/method/db.collection.createIndex/#recreating-an-existing-index

shankari

For the record, I have reviewed everything other than the index creation changes, and all of them are fine. I am tempted to just remove that change and merge this

shankari · 2025-03-02T22:20:03Z

emission/storage/timeseries/builtin_timeseries.py

+        ts_db_entries = list(ts_db_result)
+        ts_db_count = len(ts_db_entries)


We can move this into the if block above since the else already has a list and a 0 count

JGreenlee mentioned this pull request Feb 26, 2025

Section Optimization #1031

Draft

JGreenlee force-pushed the remove_invalid_query branch from 1f0f347 to 4a4c815 Compare February 27, 2025 16:50

JGreenlee changed the title ~~✂️ remove INVALID_QUERY as it is no longer necessary~~ ✂️ Optimize BuiltinTimeSeries to reduce DB calls Feb 27, 2025

JGreenlee mentioned this pull request Feb 28, 2025

Pipeline Optimization Strategies e-mission/e-mission-docs#1105

Open

shankari mentioned this pull request Mar 2, 2025

🔥 Remove the pykov dependency #1034

Merged

Merge branch 'master' of https://github.com/e-mission/e-mission-server …

55c4f91

…into remove_invalid_query

shankari requested changes Mar 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✂️ Optimize `BuiltinTimeSeries` to reduce DB calls #1032

✂️ Optimize `BuiltinTimeSeries` to reduce DB calls #1032

JGreenlee commented Feb 26, 2025

shankari commented Feb 26, 2025

JGreenlee commented Feb 26, 2025

JGreenlee commented Feb 27, 2025 •

edited

Loading

JGreenlee commented Feb 27, 2025 •

edited

Loading

JGreenlee commented Feb 27, 2025 •

edited

Loading

JGreenlee commented Feb 27, 2025 •

edited

Loading

shankari commented Feb 27, 2025 •

edited

Loading

shankari commented Feb 27, 2025

JGreenlee commented Feb 27, 2025

JGreenlee commented Feb 28, 2025

JGreenlee commented Feb 28, 2025

shankari commented Mar 1, 2025

shankari commented Mar 1, 2025

shankari commented Mar 1, 2025

shankari commented Mar 2, 2025

shankari left a comment

shankari Mar 2, 2025

		ts_db_entries = list(ts_db_result)
		ts_db_count = len(ts_db_entries)

✂️ Optimize BuiltinTimeSeries to reduce DB calls #1032

Are you sure you want to change the base?

✂️ Optimize BuiltinTimeSeries to reduce DB calls #1032

Conversation

JGreenlee commented Feb 26, 2025

shankari commented Feb 26, 2025

JGreenlee commented Feb 26, 2025

JGreenlee commented Feb 27, 2025 • edited Loading

JGreenlee commented Feb 27, 2025 • edited Loading

JGreenlee commented Feb 27, 2025 • edited Loading

JGreenlee commented Feb 27, 2025 • edited Loading

shankari commented Feb 27, 2025 • edited Loading

shankari commented Feb 27, 2025

JGreenlee commented Feb 27, 2025

JGreenlee commented Feb 28, 2025

JGreenlee commented Feb 28, 2025

shankari commented Mar 1, 2025

shankari commented Mar 1, 2025

shankari commented Mar 1, 2025

shankari commented Mar 2, 2025

shankari left a comment

Choose a reason for hiding this comment

shankari Mar 2, 2025

Choose a reason for hiding this comment

✂️ Optimize `BuiltinTimeSeries` to reduce DB calls #1032

✂️ Optimize `BuiltinTimeSeries` to reduce DB calls #1032

JGreenlee commented Feb 27, 2025 •

edited

Loading

JGreenlee commented Feb 27, 2025 •

edited

Loading

JGreenlee commented Feb 27, 2025 •

edited

Loading

JGreenlee commented Feb 27, 2025 •

edited

Loading

shankari commented Feb 27, 2025 •

edited

Loading