-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logging overhaul #137
base: main
Are you sure you want to change the base?
Logging overhaul #137
Conversation
🍹
|
🍹
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a great body of work you've produced here!
The updates to structured logging, the addition of the SIGTERM handler, and the improvements to the notebooks for querying logs are all significant steps forward in making the system more robust and easier to debug. The consistent use of structured logs will undoubtedly enhance traceability and provide more actionable insights from our Cloud Run deployments.
That said, I’ve left a few comments regarding areas where the changes could be refined further. These include avoiding global variables to improve concurrency safety, clarifying log messages for better debugging, ensuring variable naming consistency in logs, and decoupling the model logic from specific scenes. Additionally, there are a couple of places where performance optimizations (e.g., vectorization) and better modularization (e.g., moving shared utilities to a common folder) could further improve the codebase.
Overall, this is a well-thought-out PR, and with a few adjustments, it will be an excellent improvement to our workflow. Great job, and thank you! 🙌
cerulean_cloud/models.py
Outdated
@@ -42,7 +41,7 @@ class BaseModel: | |||
making predictions, stacking results, and stitching outputs. | |||
""" | |||
|
|||
def __init__(self, model_dict=None, model_path_local=None): | |||
def __init__(self, model_dict=None, model_path_local=None, scene_id=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to earlier feedback: the scene_id argument in the model constructor tightly couples the model to a single scene. Is there a way to pass the scene identifier elsewhere, perhaps through the inference call or a separate context object, so the model stays scene-agnostic?
For instance via a new property added to the InferenceInput which is passed into model.predict()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be done that way if you think that's better, it is just the difference between changing the input args in one place and changing it in several others, but no reason it can't be changed. Is there a reason for keeping the model scene agnostic? One model class will be used for one scene so I thought that would be ok, but I might not be following best practices 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely correct that "One model class will be used for one scene." The core of the suggestion is more about maintaining a clean separation of responsibilities in the code. The model itself doesn't need to know about the scene_id, since it doesn't impact its internal operation. Instead, any issues in the model's operation can bubble up to a higher-level handler that has access to contextual metadata like scene_id, timestamps, etc., and can enrich the error messages accordingly.
I'm NOT super knowledgable in this area, but GPT suggests this approach may align with common software design principles like separation of concerns (https://chatgpt.com/share/6793e0b8-8fbc-800d-b857-dc9049c90737). That said, I completely understand your point about the tradeoff between consolidating changes in one place versus propagating them across multiple areas.
From what I've seen, two solutions could work here:
- Add scene_id to the InferenceInput object since it directly relates to the context of inference.
- Implement exception handling in a hierarchical structure, where higher-level components that are aware of the context (like scene_id) enrich error information, without requiring the model itself to track that.
Given that this PR is focused on logging and error handling, option #2 might better align with the current goals. But ultimately, I'm happy to hear your perspective if you think option #1 simplifies things more for this project!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I have fixed this - instead of passing to the model object, I am passing to the methods, like predict, postprocess, etc.
I do want to flag that I updated the PredictPayload
schema to include a scene_id string, and wanted to double check with you that this is a good way to log the scene_id on model.predict (invoked by CloudRunInferenceClient, which hits cloud_run_offset_tiles/handler.py I believe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider moving utils.py into a common/ folder or another shared module? Centralizing utility code can help keep the project organized and easier to maintain in the future.
This feels like it might be an opportunity to start moving in the direction recommended by Anthony: #134
Please check out his primary suggestions, and see if any of them make sense for where this file could live despite being distinct PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the refactoring suggestion, but that I feel that would warrant its own branch. Maybe let's keep a note to do this next, which I'm happy to take on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for thinking through this.
The setup as implemented here feels like it breaks the design intent established by devseed, where folders are meant to fully encapsulate the GCP assets they support—except for shared files, which are typically stored in a common or higher-level location.
Since this file is being created from scratch in this PR, would it be possible to simply place it directly in a shared space, like one directory up or in a common/
folder? This could help align with the existing structure and make things more maintainable in the long run.
Let me know what you think, and I’m happy to discuss further if needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I'm open to doing this however you feel is best. If you think it's ok, I'll just move it up one directory (cerulean_cloud/utils.py). Let me know what you think before I move it because I just have to update in several places.
(note to self: must update each file utils is called in and in the github actions yaml once this is resolved)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds appropriate.
app = FastAPI(title="Cloud Run orchestrator", dependencies=[Depends(api_key_auth)]) | ||
# Allow CORS for local debugging | ||
app.add_middleware(CORSMiddleware, allow_origins=["*"]) | ||
|
||
landmask_gdf = None | ||
|
||
# Set current step globally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Cloud Run deployments, global variables can cause tricky concurrency issues and may not behave as expected when multiple instances are scaled up or down. Could you please refactor the code to avoid using global variables? This will help keep the system more robust and easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few points here
- current_step isn't used (I was messing around and then never removed this) so I will take this out.
- However, keeping scene_id as a global variable is needed for retaining the scene_id when handling sigterms. As far as I can gather, there's really no other way to be able to track the scene_id in the sigterm_handler, and this helps understand which scenes timed out (and potentially other sigterms down the line). The only time this variable is used is in the context of sigterm, but I see how it might get messed up if we have a variable called scene_id elsewhere. Maybe we can name global variables in some convention that would not be confused with other variables (i.e. ALL_CAPS, ___triple_underscore, ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the context and explaining your reasoning—I appreciate the thought you’ve put into handling sigterm effectively!
That said, my main concern isn’t with current_step
or capitalization conventions but rather with the use of global variables in concurrency-dependent environments like Cloud Run.
When Cloud Run concurrency is set to more than 1, a single container instance can process multiple requests in parallel. If you store
scene_id
in a module-level global variable, any concurrent request that updates it will step on the others’ toes.
It sounds like retaining scene_id
for sigterm_handler
is critical, and I can see why you’re leaning toward globals here. Have you had a chance to try using ContextVar
for this? It’s designed for managing variables with different states in concurrent workflows and could provide a cleaner alternative. Here's an example that might be helpful: https://chatgpt.com/share/6793e7f8-cf98-800d-8fc2-83e5fbb7c58a.
Let me know your thoughts—happy to brainstorm solutions if ContextVar doesn’t meet your needs!
… scene_id deep into every single function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that maybe all of this file should instead be moved to be inside utils.py?
That way any time something fundamental changes in the way the structured logs are built, or deciphered, the changes all happen next to each other?
Then any notebooks that need to use these will import whatever functions of these it needs from the utils.py file.
cerulean_cloud/utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we should rename this file to be "structured_logger.py", and eventually it will sit INSIDE a directory called utils/
or common/
.
What do you think?
@sstill88 I've tried illustrating a potential solution using ContextVars in this commit [ed1c774]. This is explicitly concurrency-safe, so I am inclined to use it over Global Variables. Because this alternative solution requires a Please check it out, and let me know what you think. If you like it, can you try it on your machine to make sure the behavior is consistent with what you expect? |
"service_name = \"cerulean-cloud-test-cr-orchestrator\"\n", | ||
"\n", | ||
"# Update to current revision name\n", | ||
"revision_name=\"cerulean-cloud-test-cr-orchestrator-00072-9x7\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be REALLY nice if you could make this automatically grab the latest, if not specified.
"\n", | ||
"# set start time to 1 day ago\n", | ||
"utc = datetime.timezone.utc\n", | ||
"start_time = datetime.datetime.now(utc) - datetime.timedelta(days=1)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what start time is, and what window of time this will capture logs from...
Also, start_time is a variable name that is used across the code to indicate when the picture was taken, so it would be nicer if this didn't clash.
Description
This updates all of the logs to be structured (json_payload) logs. It also includes a notebook that demonstrates how to query the structured logs.
Changes
method="GET"
(this test wasn't catching errors because it didn't have the right method)Reviewers