-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✂️ Optimize BuiltinTimeSeries
to reduce DB calls
#1032
base: master
Are you sure you want to change the base?
Conversation
@JGreenlee this is great, but I think we can go a bit further. See my comment here:
This should drop the DB queries by another 50% For the record, we used to return the cursor because it allows us to iterate over entries one by one instead of reading the entire list into memory and introducing memory pressure. But right now, it is better for us to optimize for DB calls than for memory usage. And we wrap the cursor in a list or a dataframe anyway. I did take a look at all uses of find_entries, but didn't feel confident enough to make the change at 11pm the weekend before going on reduced hours. But now that we have more time, I think we can optimize that as well. I am happy to merge this and address that as a separate optimization, or just wait and merge both together. LMK what you think |
@shankari In the current implementation, does the construction of the iterator ( If yes then the iterator seems pointless |
For `find_entries` and `find_entries_count` we split the queried keys by which collection we expect them to exist in. If there are no keys for one of orig_ts_db or analysis_ts_db, we end up making an `INVALID` query: `"{'metadata.key': 'invalid'}` to that DB Unsure why this was the case, I dug into the blame and found e-mission@14aa503 > Return a blank cursor instead of an empty iterator if there are no matches > If there are no matches for a particular query, perform an invalid query and return the resulting empty cursor instead of an empty list iterator. > This allows us to use cursor methods such as `.count()` on the result of `get_entries_for_timeseries`. However after a few years of code churn I believe this is no longer necessary; the methods that call `_get_entries_for_timeseries`, which are `find_entries` and `find_entries_count` do not use Cursor methods or return a Cursor type. `_get_entries_for_timeseries` now includes both the count and the Cursor in its return value. `find_entries_count` just uses the count and `find_entries` returns an iterator created from the Cursor via itertools.chain, not the Cursor itself Although these queries will theoretically never return any matches, they are still queries and could involve some overhead (particularly on production where we've seen the DB not using indices properly). If they are pointless, they should be eliminated, as that could reduce the total number of queries significantly (almost by half in many cases)
1f0f347
to
4a4c815
Compare
pymongo queries return a Cursor, which we had been converting to an iterator before returning it from `find_entries`. To get the count of entries, we were making a separate `count_documents` query. In most places where `find_entries` is called, we were consuming the iterator into a list, so we might as well do that upfront in `_get_entries_for_timeseries`. That way, we can use the `len` of the list rather than making a second `count_documents` query. The main changes are in builtin_timeseries; the rest of the changes are just removing the numerous `list(...)` conversions in places where `find_entries` is called, as well as renaming variables to reflect that `find_entries` returns a list, not an iterator > # This list() conversion causes the cursor to be consumed, which is memory-expensive, > # but faster as it allows us to get the count without having to make an additional > "count_documents" DB call. > # In a future situation where memory is constrained or DB calls are less expensive, > # we could skip the list() conversion and return the cursor directly. > # e-mission#1032 (comment)
BuiltinTimeSeries
to reduce DB calls
After 290a48d,
the number of DB calls appears to be noticeably reduced (but see disclaimer below) ![]() Execution time (locally) is similar or slightly reduced ![]() |
After investigation, the According to Mongo docs, duplicate calls to But I do wonder if they are introducing some overhead. We could easily prevent the duplicate calls with something like this: diff --git a/emission/core/get_database.py b/emission/core/get_database.py
index b1585ba2..3b7dcc5f 100644
--- a/emission/core/get_database.py
+++ b/emission/core/get_database.py
@@ -193,16 +193,19 @@ def _migrate_sparse_to_dense(collection, geo_index):
collection.drop_index(geo_index)
print("Found sparse geosphere index, dropping %s, index list after=%s" % (geo_index, collection.index_information().keys()))
+TimeSeries = None
def get_timeseries_db():
- #current_db = MongoClient().Stage_database
- TimeSeries = _get_current_db().Stage_timeseries
- TimeSeries.create_index([("user_id", pymongo.ASCENDING)])
- TimeSeries.create_index([("metadata.key", pymongo.ASCENDING)])
- TimeSeries.create_index([("metadata.write_ts", pymongo.DESCENDING)])
- TimeSeries.create_index([("data.ts", pymongo.DESCENDING)], sparse=True)
- TimeSeries.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
- _migrate_sparse_to_dense(TimeSeries, "data.loc_2dsphere")
- TimeSeries.create_index([("data.loc", pymongo.GEOSPHERE)])
+ global TimeSeries
+ if TimeSeries is None:
+ TimeSeries = _get_current_db().Stage_timeseries
+ print('called get_timeseries_db')
+ TimeSeries.create_index([("user_id", pymongo.ASCENDING)])
+ TimeSeries.create_index([("metadata.key", pymongo.ASCENDING)])
+ TimeSeries.create_index([("metadata.write_ts", pymongo.DESCENDING)])
+ TimeSeries.create_index([("data.ts", pymongo.DESCENDING)], sparse=True)
+ TimeSeries.create_index([("data.start_ts", pymongo.DESCENDING)], sparse=True)
+ _migrate_sparse_to_dense(TimeSeries, "data.loc_2dsphere")
+ TimeSeries.create_index([("data.loc", pymongo.GEOSPHERE)])
return TimeSeries
def get_timeseries_error_db():
@@ -210,14 +213,16 @@ def get_timeseries_error_db():
TimeSeriesError = _get_current_db().Stage_timeseries_error
return TimeSeriesError
+AnalysisTimeSeries = None
def get_analysis_timeseries_db():
"""
" Stores the results of the analysis performed on the raw timeseries
"""
- #current_db = MongoClient().Stage_database
- AnalysisTimeSeries = _get_current_db().Stage_analysis_timeseries
- AnalysisTimeSeries.create_index([("user_id", pymongo.ASCENDING)])
- _create_analysis_result_indices(AnalysisTimeSeries)
+ global AnalysisTimeSeries
+ if AnalysisTimeSeries is None:
+ AnalysisTimeSeries = _get_current_db().Stage_analysis_timeseries
+ AnalysisTimeSeries.create_index([("user_id", pymongo.ASCENDING)])
+ _create_analysis_result_indices(AnalysisTimeSeries)
return AnalysisTimeSeries
def get_non_user_timeseries_db(): |
Right, this is the reason that I chose to call To test the impact on dev, I would suggest:
|
what is the segmentation on the bars? Is it a split per user? |
It's split by the location in the codebase where the DB call happened. |
With:
Running the pipeline on
This chart is split by user. ![]() But excluding ![]() I am going to leave |
For collections where we add indexes, we do so inside the get_*_db function, resulting in thousands of duplicate calls to create_index e-mission#1032 (comment) This change persists references to these collections in memory and ensures the indexes are only created the first time the get_*_db function is called
Tests failed because pip dependencies failed: I don't know why but I don't think it's related to these changes. They are passing locally:
|
I wondered if the version of pymongo that we use ( Retrying in case there was a temporary glitch while downloading |
It failed again, so trying to install it locally in an environment which didn't have it; succeeded
But checking the install on
This appears to be a 3.9 versus 3.10 issue. Note that we should also actually fail if the install fails!! |
We had originally introduced pykov as a dependency because it was used the markov-based tour models. However, we never integrated the `user_model_josh` into the system, and while we integrated the tour model into common places and trips, we have since replaced it with the trip model. $ grep -r pykov emission/ emission//analysis/modelling/user_model_josh/utility_model.py:import pykov as pk emission//storage/decorations/common_place_queries.py:import pykov as pk `pykov` is no longer available through pip e-mission#1032 (comment) So this can be a forcing function to finally delete it. To fix the usages, we also delete the entire `user_model_josh` directory since it has never been used And comment out the one one reference to it in `common_place_queries` That one reference is in `get_successor`, which is only used in `emission//incomplete_tests/TestCommonPlaceQueries.py` so it should not cause any tests to fail. Testing done: - Automated tests pass (checked via CI)
We had originally introduced pykov as a dependency because it was used the markov-based tour models. However, we never integrated the `user_model_josh` into the system, and while we integrated the tour model into common places and trips, we have since replaced it with the trip model. $ grep -r pykov emission/ emission//analysis/modelling/user_model_josh/utility_model.py:import pykov as pk emission//storage/decorations/common_place_queries.py:import pykov as pk `pykov` is no longer available through pip e-mission#1032 (comment) So this can be a forcing function to finally delete it. To fix the usages, we also delete the entire `user_model_josh` directory since it has never been used And comment out the one one reference to it in `common_place_queries` That one reference is in `get_successor`, which is only used in `emission//incomplete_tests/TestCommonPlaceQueries.py` so it should not cause any tests to fail. Testing done: - Automated tests pass (checked via CI)
…into remove_invalid_query
@JGreenlee I fixed the The new error does seem to be related to this change, and in fact, reminds me of 21bea50
You should look to see how you can incorporate something similar into your fix. If that is too complex, you may want to move the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I have reviewed everything other than the index creation changes, and all of them are fine. I am tempted to just remove that change and merge this
ts_db_entries = list(ts_db_result) | ||
ts_db_count = len(ts_db_entries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this into the if
block above since the else
already has a list and a 0 count
For
find_entries
andfind_entries_count
we split the queried keys by which collection we expect them to exist in. If there are no keys for one of orig_ts_db or analysis_ts_db, we end up making anINVALID
query:"{'metadata.key': 'invalid'}
to that DBUnsure why this was the case, I dug into the blame and found 14aa503
However after a few years of code churn I believe this is no longer necessary; the methods that call
_get_entries_for_timeseries
, which arefind_entries
andfind_entries_count
do not use Cursor methods or return a Cursor type._get_entries_for_timeseries
now includes both the count and the Cursor in its return value.find_entries_count
just uses the count andfind_entries
returns an iterator created from the Cursor via itertools.chain, not the Cursor itselfAlthough these queries will theoretically never return any matches, they are still queries and could involve some overhead (particularly on production where we've seen the DB not using indices properly). If they are pointless, they should be eliminated, as that could reduce the total number of queries significantly (almost by half in many cases)