Add exploration notebooks #13

hannahker · 2024-10-31T20:33:04Z

Adding some work to QA the raster stats outputs by comparing against other calculation methods. See #12 for one of the outcomes of this testing.

zackarno · 2024-11-05T19:53:45Z

exploration/test_grid_alignment.md

+
+```python
+# Or replace with other locally saved shapefile
+gdf = gpd.read_file("data/eth_adm2_simplified.shp")


can we put this on blob? or if you want to share data folder we can also do that

zackarno · 2024-11-05T20:02:45Z

exploration/validate_outputs.md

+import numpy as np
+
+# os.chdir("..")
+from src.utils.cloud_utils import get_cog_url


maybe something weird happening here w/ the methods we are mixing to import local modules... getting the error:

{ "name": "ImportError", "message": "cannot import name 'get_cog_url' from 'src.utils.cloud_utils' (/Users/zackarno/Documents/CHD/repos/hdx-floodscan/src/utils/cloud_utils.py)", "stack": "--------------------------------------------------------------------------- ImportError Traceback (most recent call last) Cell In[5], line 12 9 import numpy as np 11 # os.chdir(\"..\") ---> 12 from src.utils.cloud_utils import get_cog_url 13 from src.utils.raster_utils import upsample_raster, prep_raster 15 load_dotenv() ImportError: cannot import name 'get_cog_url' from 'src.utils.cloud_utils' (/Users/zackarno/Documents/CHD/repos/hdx-floodscan/src/utils/cloud_utils.py)" }

To fix it I had to get

pip install -e .

working. However, to do this I had to copy over pyproject.toml from my other branch. I guess I have it tracked whereas you maybe don't? However, once it's in the repo I the pip install -e. works and then the error above goes away.

Is it better to gitignore the pyproject.toml or not ? I don't see a good reason we would, but IDK the convention

Oh sorry that's on me for not making these fully reproducible! I put these up as some quick-and-dirty notebooks and didn't put a ton of thought into the setup there. The os.chdir("..") is just my hack for making these work since I don't love the overhead of requiring pyproject.toml/setup.cfg files and doing the install.

no worries! i hear you on the pyproject.toml/setup.cfg overhead point - but I guess good if python team can choose a nice standard way to deal with it. I think if the files were just generated and git tracked there would not really be any overhead...

i believe some of this was handled by the cookiecutter repo (which you helped set up long ago!), however cookiecutter might be outdated now as i think it also gives you a bunch of anticipy configs which we don't necessarily want? Also perhaps the files were added to gitignore instead of tracked? I think gitignoring them is kind of annoying because if someone clones your repo then they don't have them!

Yes I agree that if we're using that setup then those files should definitely be tracked. It's a great point to think about standard setups as our different kinds of python projects change -- eg. between pipelines and AA projects and dashboards. I remember the cookiecutter was set up in the context of AA projects so there are probably a number of things that don't really apply to pipelines. I think notebooks are a good example of how some of these use cases are different. For example, in AA projects we usually use notebooks as "the" output, so it makes sense we have a good repo structure around that. In this repo, I'm (perhaps a bit lazily..) less concerned with notebook setup since the main pipelines code is the main output here.

Although not a great excuse since it's still important to have others able to run the code -- especially when I ask them to review it! But at least in the former case with AA projects I think it's absolutely important to have a robust setup, whereas here I'm more comfortable with a bit more of a lightweight hack (eg. using os.chdir(..) to manage dependencies).

zackarno · 2024-11-05T20:10:04Z

exploration/validate_outputs.md

+engine = create_engine(
+    f"postgresql+psycopg2://chdadmin:{AZURE_DB_PW_PROD}@chd-rasterstats-prod.postgres.database.azure.com/postgres"
+)
+gdf = gpd.read_file(f"data/{iso3.lower()}/{iso3.lower()}_adm{adm_level}.shp")


would need this data folder

Also how is this script controlling the working directory? Because for me this path won't work (even with the data) folder because the cwd is in the directory of code running the scripts (exploration in this case) not root directory of project. Therefore in my setup the correct path is "../data/{iso3.lower()}/{iso3.lower()}_adm{adm_level}.shp" ?

Using the os.chdir("..") sets the working directory one level higher, so that this path works.

zackarno · 2024-11-05T20:11:07Z

exploration/test_grid_alignment.md

+
+```python
+# Or replace with other locally saved shapefile
+gdf = gpd.read_file("data/eth_adm2_simplified.shp")


need the data folder- also flagged here: https://github.com/OCHA-DAP/ds-raster-stats/pull/13/files#r1829942399

zackarno · 2024-11-07T22:27:49Z

I think i figured out why the results are different in the test_grid_alignment script. I think something is going awry in the upsample function. I believe if you correctly (not sure if my example is perfect- but it works) supply a transform argument to .rio.reproject() function the results will be be the same.

I got the results to be the same by adding this to the up_sample function

 bounds = ds.rio.bounds()

    # Calculate the target transform using bounds and new dimensions
    target_transform = rasterio.transform.from_bounds(
        bounds[0],  # left
        bounds[1],  # bottom
        bounds[2],  # right
        bounds[3],  # top
        new_width,
        new_height,
    )

just before this part:

    # Forecast data will have 4 dims, since we have a leadtime
    nd = len(list(ds.dims))

and then later supplying target_transform like:

 ds_ = ds_.rio.reproject(
                ds_.rio.crs,
                shape=(new_height, new_width),
                resampling=Resampling.nearest,
                transform=target_transform,
                nodata=np.nan,
            )

not sure if I implemented 100% correctly since still wrapping my head around python xarray, rioxarray more generally, but either way i think it might be a good lead

Add exploration notebooks

cc5fa59

hannahker requested a review from zackarno October 31, 2024 20:33

zackarno reviewed Nov 5, 2024

View reviewed changes

hannahker mentioned this pull request Nov 8, 2024

Grid changes when resampling after clip #12

Open

Add comment

ea20a2c

hannahker merged commit e621481 into main Nov 15, 2024
1 check passed

hannahker deleted the exploration-notebooks branch November 15, 2024 16:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exploration notebooks #13

Add exploration notebooks #13

hannahker commented Oct 31, 2024

zackarno Nov 5, 2024

zackarno Nov 5, 2024 •

edited

Loading

hannahker Nov 5, 2024

zackarno Nov 6, 2024

hannahker Nov 6, 2024

zackarno Nov 5, 2024 •

edited

Loading

zackarno Nov 6, 2024 •

edited

Loading

hannahker Nov 6, 2024

zackarno Nov 5, 2024

zackarno commented Nov 7, 2024 •

edited

Loading

Add exploration notebooks #13

Add exploration notebooks #13

Conversation

hannahker commented Oct 31, 2024

zackarno Nov 5, 2024

Choose a reason for hiding this comment

zackarno Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

hannahker Nov 5, 2024

Choose a reason for hiding this comment

zackarno Nov 6, 2024

Choose a reason for hiding this comment

hannahker Nov 6, 2024

Choose a reason for hiding this comment

zackarno Nov 5, 2024 • edited Loading

Choose a reason for hiding this comment

zackarno Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

hannahker Nov 6, 2024

Choose a reason for hiding this comment

zackarno Nov 5, 2024

Choose a reason for hiding this comment

zackarno commented Nov 7, 2024 • edited Loading

zackarno Nov 5, 2024 •

edited

Loading

zackarno Nov 5, 2024 •

edited

Loading

zackarno Nov 6, 2024 •

edited

Loading

zackarno commented Nov 7, 2024 •

edited

Loading