Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry logic to datastore.get_zipfile_resource. #3658

Merged
merged 5 commits into from
Jun 13, 2024

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented May 28, 2024

Overview

Closes #3657.

What problem does this address?

One more step towards figuring out that weird BadZipFile error that shows up sporadically.

Testing

How did you make sure this worked? How can a reviewer verify this?

Bunch of unit tests!

Suppose I could run the pytest-coverage, too, but it might not run into those BadZipFile errors that seem to only pop up very sporadically.

To-do list

Preview Give feedback

@jdangerx jdangerx force-pushed the 3657-add-zipfile-retry branch from 4e2d311 to 1b1bc48 Compare June 10, 2024 14:44
@@ -0,0 +1,37 @@
"""A place for helper functions that *don't* import PUDL and thus can be imported anywhere."""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because I started running into circular import problems.

However, most of the functions in pudl.helpers don't import PUDL. Only these four do:

  • create_datasette_metadata_yaml
  • check_tables_have_metadata
  • expand_timeseries
  • convert_cols_dtypes

Should we move those to a separate file instead of making a fresh file for all the functions where we definitely won't run into circular imports?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the import error coming up? Is it possible that the call to import pudl plus the multiple from calls in pudl.workspace.datastore.py is the source of the import error and could be fixed through removing the first call, rather than splitting helpers into two not clearly defined separate files?

If not, my preference would be to do what you suggest (move these four into their own file for helpers that rely on other PUDL functions), rather than start a new file with a single function and expect us to migrate over time.

Copy link
Member Author

@jdangerx jdangerx Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. The error was:

> pytest test/unit/helpers_test.py                                 
ImportError while loading conftest '/Users/dazhong-catalyst/work/pudl/test/conftest.py'.                                        
test/conftest.py:15: in <module>                                                                                                
    import pudl                                                                                                                 
src/pudl/__init__.py:5: in <module>                                                                                             
    from . import (                                                                                                             
src/pudl/analysis/__init__.py:9: in <module>                                                                                    
    from . import (                                                                                                             
src/pudl/analysis/allocate_gen_fuel.py:145: in <module>                                                                         
    from pudl.metadata.fields import apply_pudl_dtypes          
src/pudl/metadata/__init__.py:3: in <module>                    
    from . import (             
src/pudl/metadata/classes.py:63: in <module>                    
    from pudl.workspace.datastore import Datastore, ZenodoDoi   
src/pudl/workspace/__init__.py:11: in <module>                  
    from . import datastore, resource_cache, setup              
src/pudl/workspace/datastore.py:26: in <module>                 
    from pudl.helpers import retry                              
src/pudl/helpers.py:34: in <module>                             
    from pudl.metadata.classes import DatasetteMetadata         
E   ImportError: cannot import name 'DatasetteMetadata' from partially initialized module 'pudl.metadata.classes' (most likely d
ue to a circular import) (/Users/dazhong-catalyst/work/pudl/src/pudl/metadata/classes.py)

And for some reason this diff fixed it - currently reading through the import system docs to try to figure out what the heck was going on.

diff --git a/src/pudl/helpers.py b/src/pudl/helpers.py
index 87d9fc41a..f97a27c10 100644
--- a/src/pudl/helpers.py
+++ b/src/pudl/helpers.py
@@ -30,7 +31,7 @@ from dagster import AssetKey, AssetsDefinition, AssetSelection, SourceAsset
 from pandas._libs.missing import NAType
 
 import pudl.logging_helpers
-from pudl.metadata.classes import DatasetteMetadata
+import pudl.metadata.classes
 from pudl.metadata.fields import apply_pudl_dtypes, get_pudl_dtypes
 from pudl.workspace.setup import PudlPaths
 
@@ -2029,7 +2030,7 @@ def create_datasette_metadata_yaml() -> str:
         derived from their datapackage.json as YAML.
     """
     pudl_output = PudlPaths().pudl_output
-    metadata = DatasetteMetadata.from_data_source_ids(pudl_output)
+    metadata = pudl.metadata.classes.DatasetteMetadata.from_data_source_ids(pudl_output)
     return metadata.to_yaml()

Copy link
Member Author

@jdangerx jdangerx Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my crude mental model is that:

At this specific import time, DatasetteMetadata hasn't been defined in pudl.metadata.classes yet, since pudl.metadata.classes imports pudl.workspace.datastore which imports pudl.helpers. But we can import the pudl.metadata.classes module directly since the import statement doesn't actually check that DatasetteMetadata is defined in the module.

The reason that we can import attributes from pudl.metadata.fields and pudl.workspace.setup is because they don't circularly depend on pudl.helpers.

I think long term we should probably aim to have "truly independent helpers" and "helpers that actually need the rest of PUDL" as separate files, to avoid but (a) I'm not sure what to name the two files and (b) this works - so I'm happy to punt on that until the next circular import happens >_<

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Glad you managed to figure it out, this seems like the much preferable solution.

test/unit/workspace/datastore_test.py Outdated Show resolved Hide resolved
test/unit/workspace/datastore_test.py Outdated Show resolved Hide resolved
@jdangerx jdangerx marked this pull request as ready for review June 10, 2024 16:57
@jdangerx jdangerx requested review from a team and zschira and removed request for a team June 10, 2024 16:57
@e-belfer e-belfer self-requested a review June 11, 2024 13:10
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and a request re: the reorganization of the pudl.helpers.py file. Otherwise, all looks good on my end.

@@ -0,0 +1,37 @@
"""A place for helper functions that *don't* import PUDL and thus can be imported anywhere."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the import error coming up? Is it possible that the call to import pudl plus the multiple from calls in pudl.workspace.datastore.py is the source of the import error and could be fixed through removing the first call, rather than splitting helpers into two not clearly defined separate files?

If not, my preference would be to do what you suggest (move these four into their own file for helpers that rely on other PUDL functions), rather than start a new file with a single function and expect us to migrate over time.

assert retry(func=func, retry_on=(RuntimeError,)) == 1

assert sleep_mock.call_count == 3
sleep_mock.assert_has_calls([mocker.call(2**x) for x in range(3)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will MagicMock add a 7 second delay to our unit tests here? Doesn't seem to be the case when I run unit tests, but maybe I'm missing something. If so, we could reduce the number of RuntimeErrror calls by one to gain a bit of speed back here if this becomes a noticeable lag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! We're bypassing the delay by replacing time.sleep with just an empty MagicMock() - see that with mocker.patch(...) line above.

@jdangerx jdangerx force-pushed the 3657-add-zipfile-retry branch from 645b68a to ebae71d Compare June 11, 2024 22:11
@jdangerx jdangerx requested a review from e-belfer June 11, 2024 22:12
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much preferable solution and a helpful fix that should help make the nightly builds a hair more robust. All works as expected locally.

@jdangerx jdangerx force-pushed the 3657-add-zipfile-retry branch from ebae71d to 4c4ff39 Compare June 13, 2024 14:12
@jdangerx jdangerx enabled auto-merge June 13, 2024 14:12
@jdangerx jdangerx added this pull request to the merge queue Jun 13, 2024
Merged via the queue into main with commit 2c0e8c4 Jun 13, 2024
14 checks passed
@jdangerx jdangerx deleted the 3657-add-zipfile-retry branch June 13, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Nightly Build Failure 2024-05-28
2 participants