Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✂️ Optimize BuiltinTimeSeries to reduce DB calls #1032

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

JGreenlee
Copy link
Contributor

For find_entries and find_entries_count we split the queried keys by which collection we expect them to exist in. If there are no keys for one of orig_ts_db or analysis_ts_db, we end up making an INVALID query: "{'metadata.key': 'invalid'} to that DB

Unsure why this was the case, I dug into the blame and found 14aa503

Return a blank cursor instead of an empty iterator if there are no matches
If there are no matches for a particular query, perform an invalid query and return the resulting empty cursor instead of an empty list iterator.
This allows us to use cursor methods such as .count() on the result of get_entries_for_timeseries.

However after a few years of code churn I believe this is no longer necessary; the methods that call _get_entries_for_timeseries, which are find_entries and find_entries_count do not use Cursor methods or return a Cursor type. _get_entries_for_timeseries now includes both the count and the Cursor in its return value. find_entries_count just uses the count and find_entries returns an iterator created from the Cursor via itertools.chain, not the Cursor itself

Although these queries will theoretically never return any matches, they are still queries and could involve some overhead (particularly on production where we've seen the DB not using indices properly). If they are pointless, they should be eliminated, as that could reduce the total number of queries significantly (almost by half in many cases)

@shankari
Copy link
Contributor

@JGreenlee this is great, but I think we can go a bit further.

See my comment here:
e-mission/e-mission-docs#1109 (comment)
in particular

Regardless, the count of the entries is a nice to have, and is extremely useful for debugging, but we don't even return it externally.
I also looked through all uses of find_entries and we almost always wrap the values returned in the cursor in a list, so just converting the cursor to a list right here and getting the count should also work.

This should drop the DB queries by another 50%

For the record, we used to return the cursor because it allows us to iterate over entries one by one instead of reading the entire list into memory and introducing memory pressure. But right now, it is better for us to optimize for DB calls than for memory usage. And we wrap the cursor in a list or a dataframe anyway.

I did take a look at all uses of find_entries, but didn't feel confident enough to make the change at 11pm the weekend before going on reduced hours. But now that we have more time, I think we can optimize that as well.

I am happy to merge this and address that as a separate optimization, or just wait and merge both together.

LMK what you think

@JGreenlee
Copy link
Contributor Author

@shankari In the current implementation, does the construction of the iterator (return itertools.chain(orig_ts_db_result, analysis_ts_db_result)) cause all entries to be read into memory?

If yes then the iterator seems pointless

@JGreenlee JGreenlee mentioned this pull request Feb 26, 2025
For `find_entries` and `find_entries_count` we split the queried keys by which collection we expect them to exist in. If there are no keys for one of orig_ts_db or analysis_ts_db, we end up making an `INVALID` query: `"{'metadata.key': 'invalid'}` to that DB

Unsure why this was the case, I dug into the blame and found e-mission@14aa503
> Return a blank cursor instead of an empty iterator if there are no matches
> If there are no matches for a particular query, perform an invalid query and return the resulting empty cursor instead of an empty list iterator.
> This allows us to use cursor methods such as `.count()` on the result of `get_entries_for_timeseries`.

However after a few years of code churn I believe this is no longer necessary; the methods that call `_get_entries_for_timeseries`, which are `find_entries` and `find_entries_count` do not use Cursor methods or return a Cursor type.
`_get_entries_for_timeseries` now includes both the count and the Cursor in its return value. `find_entries_count` just uses the count and `find_entries` returns an iterator created from the Cursor via itertools.chain, not the Cursor itself

Although these queries will theoretically never return any matches, they are still queries and could involve some overhead (particularly on production where we've seen the DB not using indices properly). If they are pointless, they should be eliminated, as that could reduce the total number of queries significantly (almost by half in many cases)
@JGreenlee JGreenlee force-pushed the remove_invalid_query branch from 1f0f347 to 4a4c815 Compare February 27, 2025 16:50
pymongo queries return a Cursor, which we had been converting to an iterator before returning it from `find_entries`. To get the count of entries, we were making a separate `count_documents` query.
In most places where `find_entries` is called, we were consuming the iterator into a list, so we might as well do that upfront in `_get_entries_for_timeseries`. That way, we can use the `len` of the list rather than making a second `count_documents` query.

The main changes are in builtin_timeseries; the rest of the changes are just removing the numerous `list(...)` conversions in places where `find_entries` is called, as well as renaming variables to reflect that `find_entries` returns a list, not an iterator

> # This list() conversion causes the cursor to be consumed, which is memory-expensive,
> # but faster as it allows us to get the count without having to make an additional
> "count_documents" DB call.
> # In a future situation where memory is constrained or DB calls are less expensive,
> # we could skip the list() conversion and return the cursor directly.
> # e-mission#1032 (comment)
@JGreenlee JGreenlee changed the title ✂️ remove INVALID_QUERY as it is no longer necessary ✂️ Optimize BuiltinTimeSeries to reduce DB calls Feb 27, 2025
@JGreenlee
Copy link
Contributor Author

JGreenlee commented Feb 27, 2025

After 290a48d,

have timeseries find_entries return entries as list, not iterable

pymongo queries return a Cursor, which we had been converting to an iterator before returning it from find_entries. To get the count of entries, we were making a separate count_documents query.
In most places where find_entries is called, we were consuming the iterator into a list, so we might as well do that upfront in _get_entries_for_timeseries. That way, we can use the len of the list rather than making a second count_documents query.

The main changes are in builtin_timeseries; the rest of the changes are just removing the numerous list(...) conversions in places where find_entries is called, as well as renaming variables to reflect that find_entries returns a list, not an iterator

# This list() conversion causes the cursor to be consumed, which is memory-expensive,
# but faster as it allows us to get the count without having to make an additional
# "count_documents" DB call.
# In a future situation where memory is constrained or DB calls are less expensive,
# we could skip the list() conversion and return the cursor directly.
# #1032 (comment)

the number of DB calls appears to be noticeably reduced (but see disclaimer below)

image

Execution time (locally) is similar or slightly reduced

image

@JGreenlee
Copy link
Contributor Author

JGreenlee commented Feb 27, 2025

Disclaimer: the above counts were taken by monitoring the following commands to the DB: find, aggregate, insert, update, delete

There are also createIndexes and listIndexes commands, which I had been excluding because I thought they were only called on database initialization; however they appear to be called throughout the pipeline as indices are updated

When we include those, our improvements appear a lot more modest:
image

Because createIndexes account for a very high proportion of the total commands
image

and our optimizations only reduce find and aggregate commands

@JGreenlee
Copy link
Contributor Author

JGreenlee commented Feb 27, 2025

There are also createIndexes and listIndexes commands, which I had been excluding because I thought they were only called on database initialization; however they appear to be called throughout the pipeline as indices are updated

After investigation, the createIndexes commands are not just a byproduct of the indices being updated; we actually do call create_index repeatedly (every time get_timeseries_db or get_analysis_timeseries_db are called)

According to Mongo docs, duplicate calls to create_index are ignored: https://www.mongodb.com/docs/manual/reference/method/db.collection.createIndex/#recreating-an-existing-index

But I do wonder if they are introducing some overhead. We could easily prevent the duplicate calls with something like this:

diff --git a/emission/core/get_database.py b/emission/core/get_database.py
index b1585ba2..3b7dcc5f 100644
--- a/emission/core/get_database.py
+++ b/emission/core/get_database.py
@@ -193,16 +193,19 @@ def _migrate_sparse_to_dense(collection, geo_index):
         collection.drop_index(geo_index)
         print("Found sparse geosphere index, dropping %s, index list after=%s" % (geo_index, collection.index_information().keys()))
 
+TimeSeries = None
 def get_timeseries_db():
-    #current_db = MongoClient().Stage_database
-    TimeSeries = _get_current_db().Stage_timeseries
-    TimeSeries.create_index([("user_id", pymongo.ASCENDING)])
-    TimeSeries.create_index([("metadata.key", pymongo.ASCENDING)])
-    TimeSeries.create_index([("metadata.write_ts", pymongo.DESCENDING)])
-    TimeSeries.create_index([("data.ts", pymongo.DESCENDING)], sparse=True)
-    TimeSeries.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
-    _migrate_sparse_to_dense(TimeSeries, "data.loc_2dsphere")
-    TimeSeries.create_index([("data.loc", pymongo.GEOSPHERE)])
+    global TimeSeries
+    if TimeSeries is None:
+        TimeSeries = _get_current_db().Stage_timeseries
+        print('called get_timeseries_db')
+        TimeSeries.create_index([("user_id", pymongo.ASCENDING)])
+        TimeSeries.create_index([("metadata.key", pymongo.ASCENDING)])
+        TimeSeries.create_index([("metadata.write_ts", pymongo.DESCENDING)])
+        TimeSeries.create_index([("data.ts", pymongo.DESCENDING)], sparse=True)
+        TimeSeries.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
+        _migrate_sparse_to_dense(TimeSeries, "data.loc_2dsphere")
+        TimeSeries.create_index([("data.loc", pymongo.GEOSPHERE)])
     return TimeSeries
 
 def get_timeseries_error_db():
@@ -210,14 +213,16 @@ def get_timeseries_error_db():
     TimeSeriesError = _get_current_db().Stage_timeseries_error
     return TimeSeriesError
 
+AnalysisTimeSeries = None
 def get_analysis_timeseries_db():
     """
     " Stores the results of the analysis performed on the raw timeseries
     """
-    #current_db = MongoClient().Stage_database
-    AnalysisTimeSeries = _get_current_db().Stage_analysis_timeseries
-    AnalysisTimeSeries.create_index([("user_id", pymongo.ASCENDING)])
-    _create_analysis_result_indices(AnalysisTimeSeries)
+    global AnalysisTimeSeries
+    if AnalysisTimeSeries is None:
+        AnalysisTimeSeries = _get_current_db().Stage_analysis_timeseries
+        AnalysisTimeSeries.create_index([("user_id", pymongo.ASCENDING)])
+        _create_analysis_result_indices(AnalysisTimeSeries)
     return AnalysisTimeSeries
 
 def get_non_user_timeseries_db():

@JGreenlee
Copy link
Contributor Author

JGreenlee commented Feb 27, 2025

That does appear to have helped significantly, especially for CLEAN_RESAMPLING, CREATE_CONFIRMED_OBJECTS, and CREATE_COMPOSITE_OBJECTS:

image image image image

@shankari
Copy link
Contributor

shankari commented Feb 27, 2025

According to Mongo docs, duplicate calls to create_index are ignored

Right, this is the reason that I chose to call create_index every time. We can take a look at removing it and see what the impact on production is.

To test the impact on dev, I would suggest:

  1. making sure that you are on mongo 8.0.4
  2. you have limited CPU (1-2 vCPU) and memory (2 GB)
  3. use a real-world snapshot in addition to a single run

@shankari
Copy link
Contributor

what is the segmentation on the bars? Is it a split per user?

@JGreenlee
Copy link
Contributor Author

It's split by the location in the codebase where the DB call happened.
This was all test data; I'll try a real world snapshot next

@JGreenlee
Copy link
Contributor Author

With:

  • openpath_prod_stm_community_feb_13.tar.gz
  • Docker set to 2 CPUs and 2 GB RAM
  • MongoDB 8.0.4
  • USE_HINTS=True

Running the pipeline on master and remove_invalid_query took a little over an hour total

MODE_INFERENCE and TRIP_SEGMENTATION remain the slowest stages by far under these conditions, and they do not benefit much from these changes (neither of these branches include the optimizations from #1017 or #1026)

This chart is split by user.

image

But excluding MODE_INFERENCE and TRIP_SEGMENTATION, we can see some performance gains on the other stages:

image

I am going to leave ccebikes running overnight. I am not sure if it will finish because this one took over an hour and ccebikes has 90x as many trips

For collections where we add indexes, we do so inside the get_*_db function, resulting in thousands of duplicate calls to create_index e-mission#1032 (comment)
This change persists references to these collections in memory and ensures the indexes are only created the first time the get_*_db function is called
@JGreenlee
Copy link
Contributor Author

Tests failed because pip dependencies failed: web-server-1 | CondaEnvException: Pip failed

I don't know why but I don't think it's related to these changes.

They are passing locally:

Ran 373 tests in 552.994s

OK

@shankari
Copy link
Contributor

shankari commented Mar 1, 2025

I wondered if the version of pymongo that we use (pymongo==4.3.3) is too old, but it seems like it is still a valid release
https://pypi.org/project/pymongo/4.3.3/
Screenshot 2025-03-01 at 12 29 33 PM

Retrying in case there was a temporary glitch while downloading

@shankari
Copy link
Contributor

shankari commented Mar 1, 2025

It failed again, so trying to install it locally in an environment which didn't have it; succeeded

$ pip install pymongo==4.3.3
Collecting pymongo==4.3.3
  Downloading pymongo-4.3.3-cp310-cp310-macosx_10_9_universal2.whl.metadata (8.6 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo==4.3.3)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.3.3-cp310-cp310-macosx_10_9_universal2.whl (413 kB)
Downloading dnspython-2.7.0-py3-none-any.whl (313 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.3.3

But checking the install on osx-ubuntu, I see an error

Collecting pymongo==4.3.3
  Downloading pymongo-4.3.3-cp39-cp39-macosx_10_9_universal2.whl (413 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 413.2/413.2 kB 16.9 MB/s eta 0:00:00

failed

CondaEnvException: Pip failed

About to activate environment
# conda environments:
#
base                     /Users/runner/miniconda-23.5.2
emission              *  /Users/runner/miniconda-23.5.2/envs/emission

This appears to be a 3.9 versus 3.10 issue. Note that we should also actually fail if the install fails!!

@shankari
Copy link
Contributor

shankari commented Mar 1, 2025

Although the subsequent failures were for pymongo; pip is actually able to install pymongo. The failure is while installing pykov (note that the pymongo download succeeds)

Collecting pymongo==4.3.3
  Downloading pymongo-4.3.3-cp39-cp39-macosx_10_9_universal2.whl (413 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 413.2/413.2 kB 959.5 kB/s eta 0:00:00

Pip subprocess error:
  Running command git clone --filter=blob:none --quiet https://github.com/JGreenlee/e-mission-common /private/var/folders/y5/cx3cfzrd2q116myv9ly86sw1rnlmdj/T/pip-req-build-8rjf6qd9
  Running command git checkout -q eb37a5c36fd72c30926770b8784c4479eb20520c
ERROR: Could not find a version that satisfies the requirement pykov==0.1 (from versions: none)
ERROR: No matching distribution found for pykov==0.1

failed

CondaEnvException: Pip failed
Screenshot 2025-03-01 at 3 07 47 PM

It looks like pykovi may be a replacement, but let's start with seeing whether pykov is even needed in the first place

shankari added a commit to shankari/e-mission-server that referenced this pull request Mar 2, 2025
We had originally introduced pykov as a dependency because it was used the
markov-based tour models. However, we never integrated the `user_model_josh`
into the system, and while we integrated the tour model into common places and
trips, we have since replaced it with the trip model.

$ grep -r pykov emission/
emission//analysis/modelling/user_model_josh/utility_model.py:import pykov as pk
emission//storage/decorations/common_place_queries.py:import pykov as pk

`pykov` is no longer available through pip
e-mission#1032 (comment)

So this can be a forcing function to finally delete it.

To fix the usages, we also delete the entire `user_model_josh` directory since
it has never been used
And comment out the one one reference to it in `common_place_queries`

That one reference is in `get_successor`, which is only used in
`emission//incomplete_tests/TestCommonPlaceQueries.py` so it should not cause
any tests to fail.

Testing done:
- Automated tests pass (checked via CI)
shankari added a commit to shankari/e-mission-server that referenced this pull request Mar 2, 2025
We had originally introduced pykov as a dependency because it was used the
markov-based tour models. However, we never integrated the `user_model_josh`
into the system, and while we integrated the tour model into common places and
trips, we have since replaced it with the trip model.

$ grep -r pykov emission/
emission//analysis/modelling/user_model_josh/utility_model.py:import pykov as pk
emission//storage/decorations/common_place_queries.py:import pykov as pk

`pykov` is no longer available through pip
e-mission#1032 (comment)

So this can be a forcing function to finally delete it.

To fix the usages, we also delete the entire `user_model_josh` directory since
it has never been used
And comment out the one one reference to it in `common_place_queries`

That one reference is in `get_successor`, which is only used in
`emission//incomplete_tests/TestCommonPlaceQueries.py` so it should not cause
any tests to fail.

Testing done:
- Automated tests pass (checked via CI)
@shankari
Copy link
Contributor

shankari commented Mar 2, 2025

@JGreenlee I fixed the pymongo (which was really the pykov issue)

The new error does seem to be related to this change, and in fact, reminds me of 21bea50

 planner returned error :: caused by :: hint provided does not correspond to an existing index, full error: {'ok': 0.0, 'errmsg': 'error processing query: ns=Stage_database.Stage_timeseries limit=250000Tree: $and\n    user_id $eq BinData(3, F5A05F15C0BF4E2488233A375F026D19)\n    metadata.write_ts $lte 1740881739.028104\n    metadata.key $in [ "manual/mode_confirm" "manual/purpose_confirm" "manual/replaced_mode" "manual/trip_user_input" "manual/place_user_input" "manual/trip_addition_input" "manual/place_addition_input"]\n    $not\n        invalid exists\nSort: { metadata.write_ts: 1 }\nProj: {}\n planner returned error :: caused by :: hint provided does not correspond to an existing index', 'code': 2, 'codeName': 'BadValue'}

You should look to see how you can incorporate something similar into your fix. If that is too complex, you may want to move the create_index optimization to a separate PR. It is not the main focus of this change, and it is unclear how much it helps given

According to Mongo docs, duplicate calls to create_index are ignored: https://www.mongodb.com/docs/manual/reference/method/db.collection.createIndex/#recreating-an-existing-index

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I have reviewed everything other than the index creation changes, and all of them are fine. I am tempted to just remove that change and merge this

Comment on lines +266 to +267
ts_db_entries = list(ts_db_result)
ts_db_count = len(ts_db_entries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this into the if block above since the else already has a list and a 0 count

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready for review by Shankari
Development

Successfully merging this pull request may close these issues.

2 participants