Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging overhaul #137

Merged
merged 47 commits into from
Jan 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
027da3f
added structured_log util and updated github actions file
sstill88 Jan 2, 2025
dae4832
updated structured logs in orchestrator handler
sstill88 Jan 2, 2025
f445f5e
Merge branch 'main' into feat/logging-overhaul
sstill88 Jan 2, 2025
840d4c2
towards adding logs to clients.py
sstill88 Jan 2, 2025
833f002
add scene_id to model initiation for logging
sstill88 Jan 8, 2025
a486d20
Merge branch 'main' into feat/logging-overhaul
sstill88 Jan 8, 2025
7d16538
more logs
sstill88 Jan 8, 2025
850d769
few more logs
sstill88 Jan 8, 2025
271151a
typo
sstill88 Jan 8, 2025
f845076
maxscale=1
sstill88 Jan 8, 2025
656b459
added notebook and utils to filter logs
sstill88 Jan 8, 2025
a39c2e6
maxScale back to 45
sstill88 Jan 8, 2025
41c14f5
testing
sstill88 Jan 9, 2025
ffe0c4a
attempt to fix database schema...
sstill88 Jan 9, 2025
606d6c9
attempt to fix database schema take 2...
sstill88 Jan 9, 2025
99fe3df
add sigterm handler with logs
sstill88 Jan 9, 2025
d442060
flush logs
sstill88 Jan 9, 2025
3c94513
merge/conflict resolution
sstill88 Jan 9, 2025
021bd85
make scene_id global
sstill88 Jan 9, 2025
fd329ee
debugging...
sstill88 Jan 9, 2025
3a52dd0
removed life cycle event logs
sstill88 Jan 9, 2025
7d274cf
cleaned up logging in send_inference_request_and_handle_response retries
sstill88 Jan 9, 2025
99fb200
clean up notebook and add SIGTERM examples
sstill88 Jan 9, 2025
eb25f7a
review comments
sstill88 Jan 10, 2025
d5c7911
Merge branch 'main' into feat/logging-overhaul
sstill88 Jan 23, 2025
c48af62
update logger notebook/utils to give state when no logs exist instead…
sstill88 Jan 24, 2025
511b326
update logging message for when there are no oceanic tiles
sstill88 Jan 24, 2025
9e69ceb
warning-->info
sstill88 Jan 24, 2025
5c19ff6
moved utils up one directory and adjusted code and action yaml accord…
sstill88 Jan 24, 2025
0522a0b
make model scene agnostic
sstill88 Jan 25, 2025
ed1c774
Does this contextVar and loggingFilters work instead of injecting the…
jonaraphael Jan 25, 2025
83054bb
oops!
jonaraphael Jan 26, 2025
827e391
Ugh. sceneid vs scene_id mixup. I've started a new branch to get rid …
jonaraphael Jan 26, 2025
38f8333
Oh no. forgot to edit the function definition
jonaraphael Jan 26, 2025
3ba7fa9
Shoot. Missed two more.
jonaraphael Jan 26, 2025
c14a5a6
try consistent naming
jonaraphael Jan 26, 2025
ee1e0fc
Linting
jonaraphael Jan 28, 2025
65e433d
Automatically choose last revision if none provided.
jonaraphael Jan 28, 2025
12f28b0
Rename start_time
jonaraphael Jan 28, 2025
3028925
Combine logging utils into one file
jonaraphael Jan 28, 2025
05507ba
Rename utils.py to structured_logger.py
jonaraphael Jan 28, 2025
2e5fa4f
Try fixing the google-cloud-logging import issue in the tests
jonaraphael Jan 28, 2025
6a703a0
Oops. The auto-renaming tool missed this one.
jonaraphael Jan 28, 2025
ece53aa
test removing logger import
sstill88 Jan 28, 2025
bc7c24f
add google-cloud-logging cloud_run_offset_tiles requirements and undo…
sstill88 Jan 28, 2025
8d9538b
also add logging to the cloud_run_orchestrator requirements (eye roll)
sstill88 Jan 28, 2025
48d76ca
updated utils --> structured_logger in CLoudRunLogs.ipynb
sstill88 Jan 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/actions/deploy_infrastructure/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ runs:
mkdir -p cerulean_cloud/cloud_function_ais_analysis/cerulean_cloud/
cp cerulean_cloud/database_client.py cerulean_cloud/cloud_function_ais_analysis/cerulean_cloud/database_client.py
cp cerulean_cloud/database_schema.py cerulean_cloud/cloud_function_ais_analysis/cerulean_cloud/database_schema.py
cp cerulean_cloud/structured_logger.py cerulean_cloud/cloud_function_ais_analysis/cerulean_cloud/structured_logger.py
cp cerulean_cloud/__init__.py cerulean_cloud/cloud_function_ais_analysis/cerulean_cloud/__init__.py

- name: Deploy Infrastructure
Expand Down
8 changes: 8 additions & 0 deletions cerulean_cloud/cloud_run_offset_tiles/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@
PredictPayload,
)
from cerulean_cloud.models import get_model
from cerulean_cloud.structured_logger import (
configure_structured_logger,
context_dict_var,
)

# mypy: ignore-errors

Expand All @@ -23,6 +27,9 @@
app.add_middleware(CORSMiddleware, allow_origins=["*"])
add_timing_middleware(app, prefix="app")

# Configure the logger once at startup
configure_structured_logger("cerulean_cloud")


@app.get("/", description="Health Check", tags=["Health Check"])
def ping() -> Dict:
Expand All @@ -39,6 +46,7 @@ def ping() -> Dict:
def predict(request: Request, payload: PredictPayload) -> InferenceResultStack:
"""Run prediction using the loaded model."""
record_timing(request, note="Started")
context_dict_var.set({"scene_id": payload.scene_id}) # Set the scene_id for logging

model = get_model(payload.model_dict)
record_timing(request, note="Model loaded")
Expand Down
3 changes: 2 additions & 1 deletion cerulean_cloud/cloud_run_offset_tiles/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ geojson
shapely==2.0.1
scipy
geopandas
networkx
networkx
google-cloud-logging==3.11.3
1 change: 1 addition & 0 deletions cerulean_cloud/cloud_run_offset_tiles/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ class PredictPayload(BaseModel):

inf_stack: List[InferenceInput]
model_dict: Dict[str, Any]
scene_id: str


class InferenceResult(BaseModel):
Expand Down
61 changes: 47 additions & 14 deletions cerulean_cloud/cloud_run_orchestrator/clients.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

import asyncio
import json
import logging
import os
import traceback
import zipfile
from base64 import b64encode
from datetime import datetime
Expand All @@ -24,6 +26,8 @@
PredictPayload,
)

logger = logging.getLogger("cerulean_cloud")


def img_array_to_b64_image(img_array: np.ndarray, to_uint8=False) -> str:
"""convert input b64image to torch tensor"""
Expand Down Expand Up @@ -139,35 +143,48 @@ async def send_inference_request_and_handle_response(self, http_client, img_arra
- This function constructs the inference payload by encoding the image and specifying the geographic bounds and any additional inference parameters through `self.model_dict`.
"""

def _status_code_warning(attempt, status_code):
return {
"message": "Error getting inference; Retrying . . .",
"attempt": attempt,
"status_code": status_code,
}

def _exception_warning(attempt, exception):
return {
"message": "Error getting inference; Retrying . . .",
"attempt": attempt,
"exception": str(exception),
"traceback": traceback.format_exc(),
}

encoded = img_array_to_b64_image(img_array, to_uint8=True)
inf_stack = [InferenceInput(image=encoded)]
payload = PredictPayload(inf_stack=inf_stack, model_dict=self.model_dict)
payload = PredictPayload(
inf_stack=inf_stack, model_dict=self.model_dict, scene_id=self.sceneid
)

max_retries = 2 # Total attempts including the first try
retry_delay = 5 # Delay in seconds between retries

for attempt in range(max_retries):
for attempt in range(1, max_retries + 1):
try:
res = await http_client.post(
self.url + "/predict", json=payload.dict(), timeout=None
)
if res.status_code == 200:
return InferenceResultStack(**res.json())
else:
print(
f"Attempt {attempt + 1}: Failed with status code {res.status_code}. Retrying..."
)
if attempt < max_retries - 1:
await asyncio.sleep(retry_delay) # Wait before retrying
logger.warning(_status_code_warning(attempt, res.status_code))
except Exception as e:
print(f"Attempt {attempt + 1}: Exception occurred: {e}")
if attempt < max_retries - 1:
await asyncio.sleep(retry_delay) # Wait before retrying
logger.warning(_exception_warning(attempt, e))

if attempt < max_retries:
await asyncio.sleep(retry_delay) # Wait before retrying
else:
logger.error("Failed to get inference")

# If all attempts fail, raise an exception
raise Exception(
f"All attempts failed after {max_retries} retries. Last known error: {res.content}"
)
raise Exception(f"All attempts failed after {max_retries} retries.")

async def get_tile_inference(self, http_client, tile_bounds, rescale=(0, 255)):
"""
Expand All @@ -190,6 +207,14 @@ async def get_tile_inference(self, http_client, tile_bounds, rescale=(0, 255)):
return InferenceResultStack(stack=[])
if self.aux_datasets:
img_array = await self.process_auxiliary_datasets(img_array, tile_bounds)

logger.info(
{
"message": "Generated image",
"tile_bounds": json.dumps(tile_bounds),
}
)

return await self.send_inference_request_and_handle_response(
http_client, img_array
)
Expand All @@ -204,6 +229,14 @@ async def run_parallel_inference(self, tileset):
Returns:
- list: List of inference results, with exceptions filtered out.
"""

logger.info(
{
"message": "Starting parallel inference",
"n_tiles": len(tileset),
}
)

async with httpx.AsyncClient(
headers={"Authorization": f"Bearer {os.getenv('API_KEY')}"}
) as async_http_client:
Expand Down
Loading