Rp calcs #7

zackarno · 2024-10-17T16:56:56Z

Implement Log Pearson III RP/RV calculations which we will use for FloodScan tabular dataset.

I wrote a return_periods.py module to hold the functions and then created an exploratory .py script that can be opened w/ jupytext to show how one might implement the functions.

More theoretical background on RP methods can be found here.

I'm not sure if this will end up being put on this repo in the final pipeline, but will still do the PR so that the calculations themselves can be reviewed by @t-downing . Also tagging @hannahker & @isatotun to think/recommend how we'd actually orchestrate implementation.

hannahker · 2024-10-18T17:33:12Z

exploration/check_return_period_funcs.py

+# %autoreload 2
+
+# %%
+from src.utils import cloud_utils


Would also add

import os os.chdir('..')

So that the src module is imported properly

Can you explain this to me?

Is this required on top of the setup.cfg,pyproject.toml , _init_.py files and pip install -e. ?

or does this replace the need for some combination of the above?

The modules seem to be importing properly on my machine using the top bullet point

I think running pip install -e . obviates the need for os.chdir('..'). Then you can import the src module the same way regardless of what directory you're in, with a notebook or a .py. Then I also put pip install -e . in whatever GH Action workflow file so it runs the same in there too. I am however not sure what the best practice is, it's just based on our coding guidelines.

Oh nice! I'm not very familiar with using pip install -e. Looks great then.

hannahker · 2024-10-18T18:19:26Z

exploration/check_return_period_funcs.py

+# We can also run them all at once with this wrapper
+
+# %%
+rp.lp3_params_all2(df_max[df_max["adm1_en"]==sample_admin].value)


Getting an error since the lp3_params_all2 function doesn't exist. Is this just lp3_params_all?

yes sorry - leftover dev code - fixed 846210a

hannahker · 2024-10-18T18:23:29Z

src/utils/return_periods.py

+from lmoments3 import distr, lmom_ratios
+
+
+def lp3_params(x, est_method="lmoments"):


Is there a use case for having these params as an intermediate output? Otherwise I wonder if it might make sense to just put this function inside lp3_rp so the user only has to specify the method.

This could also avoid some potential confusion as the output format for each method is different too.

yeah for the purpose of the actual data set pipeline i think its fine to wrap them together.

However, they are more useful separate for general analysis as lp3_params() is calculated on an aggregated data set where the block maxima is calculated over any desired time block. The lp3_rp() command then takes the calculated params as inputs and applies them to any random vector -- which for our purpose is the values from the "raw" data set (not aggregated).

You could combine them together in a more sophisticated wrapper where you specify the time-step for the first aggregation, but it seems like it might end up not as flexible as we often want in various analyses

hannahker · 2024-10-18T18:24:21Z

setup.cfg

@@ -0,0 +1,5 @@
+[metadata]


Would .gitignore this?

sure, im not sure what of these setup files are typically ignored? If we remove someone else has to recreate?

anyways i ignored it: b6729ab

I've always left it in to be able to run pip install -e . (and allow a GH Action to run it). So I just ended up recreating this file locally. But if there's a better way of doing this I am definitely open to it!

Ok yes this makes sense! I didn't understand the connection with pip install -e

hannahker · 2024-10-18T18:31:41Z

exploration/check_return_period_funcs.py

+# Now we can loop through each admin and calculate RPs associated with each value
+
+# %%
+unique_adm1s = df_params_long["adm1_en"].unique()


Also df_params_long isn't defined!

woopsy - left over code from exploration - fixed. 846210a

hannahker · 2024-10-18T18:33:02Z

src/utils/return_periods.py

+def lp3_rp(x, params, est_method="lmoments"):
+    x = np.asarray(x)
+    x_sorted = np.sort(x)
+    x_log = np.log10(x_sorted)


This is throwing a lot of runtime warnings from the demo notebook: RuntimeWarning: divide by zero encountered in log10 -- is that expected?

Good call. Should deal with that.

I think for the purpose of calculating RP values for FloodScan the best and easiest way to deal w/ it would be to remove all 0 values for the log-transform step and then re-insert them at the end with an RP value of 1. What do you think @t-downing ? Note this is different treatment of 0s than lp3_param() func which runs only on the block maxima aggregate. In that calculation we just set 0 equal to the next lowest value (this is less of an issue because there are not many 0s in the annual maxima aggregate).

Maybe one of you can explain to me all the other warnings that chunk generates, I think I am doing something weird in pandas thats not recommended, but works?

Yeah, there's also the classic

/var/folders/rv/xmclt0vn5y7cqt46s5xq3h080000gn/T/ipykernel_90173/1646057511.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

Although from this stack trace it's hard to tell where this is coming from... Doesn't actually look like it's your code? So might be ok to ignore that for now.

It's from here! #7 (comment)

hannahker · 2024-10-18T19:18:19Z

Added some minor comments around the implementation approach and reproducibility of the notebook code. What are your thoughts on which method is most appropriate here?

zackarno · 2024-10-22T13:06:04Z

Added some minor comments around the implementation approach and reproducibility of the notebook code. What are your thoughts on which method is most appropriate here?

thank you very much!! i think i've addressed them all, but still need to implement #7 (comment) - also curious if you or Tristan would be able demystify the other warning I mentioned on this comment.

But yeah i'm not 100% sure on which LP3 method yet. i think we will be able to make a more informed decision once we have the internal zonal stats pipelline complete where we can do a more thorough comparison of the implications of the decision (some already done on just a subset here). This is kind of why I provided the argument so that we can play with both for now. This will also be useful to inform how we want to treat the 0s as we will be able to see how much of an issue this is. I think its actually a bit more important in the lp3_params() calc which will be much more sensitive to this treatment (if any substantial amount of 0s).

t-downing

Looks great! Just quickly submitting this review now so you can see my comments as they're relevant to the question you just asked about pandas warnings. Will do another review in later today to address the distribution modeling.

t-downing · 2024-10-21T22:36:35Z

setup.cfg

@@ -0,0 +1,5 @@
+[metadata]


I've always left it in to be able to run pip install -e . (and allow a GH Action to run it). So I just ended up recreating this file locally. But if there's a better way of doing this I am definitely open to it!

t-downing · 2024-10-21T22:42:40Z

src/utils/return_periods.py

@@ -0,0 +1,123 @@
+import pandas as pd
+import numpy as np
+from scipy.stats import skew, norm, pearson3


May want to include the libraries:

scipy

matplotlib

lmoments

pyarrow

that I had to install manually either to requirements.txt or some other requirements file.

t-downing · 2024-10-21T22:50:34Z

exploration/check_return_period_funcs.py

+
+for adm in unique_adm1s:
+    df_params_adm_filt = df_params[df_params["adm1_en"] == adm]
+    df_adm_filt = df[df["adm1_en"] == adm]


If you replace this line with df_adm_filt = df[df["adm1_en"] == adm].copy() you can avoid all the

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

(which to be totally honest I never really understood, but I do know that you can get rid of them by putting a .copy() somewhere)

t-downing · 2024-10-21T22:56:38Z

exploration/check_return_period_funcs.py

+# %%
+# %matplotlib inline
+# %load_ext autoreload
+# %autoreload 2


Could chuck in a %load_ext jupyter_black here if you want to black-format the cells as you run them- up to you.

t-downing · 2024-10-21T23:05:53Z

src/utils/return_periods.py

+import pandas as pd
+import numpy as np
+from scipy.stats import skew, norm, pearson3
+from lmoments3 import distr, lmom_ratios


Slightly reluctant to use lmoments3 as it only has 7 stars on GH. If there's a simpler way we can calculate the distributions it may be better to use in production. But for now no prob.

hmm good point on the stars! let's just leave in for the experimental stage and potentially remove on main pipeline

hannahker · 2024-10-22T15:42:20Z

Also -- as just discussed with @zackarno we'll create a separate repo for the Floodscan HDX outputs, so let's not merge this and leave as an exploratory PR we can pull from.

zackarno · 2024-10-22T16:55:57Z

Also -- as just discussed with @zackarno we'll create a separate repo for the Floodscan HDX outputs, so let's not merge this and leave as an exploratory PR we can pull from.

sounds good - let's keep the PR going here though so convo is all in one place. Once approved -- I'll open a PR in separate repo and link it

t-downing · 2024-10-22T21:18:19Z

I think my main concern for the fits is handling values that get close to 1 (i.e., totally flooded). For the areas that we've checked already (Somalia), this isn't really an issue because the fraction of the distribution that is >1 is negligible. But it might be more of an issue elsewhere, and if we want to set LP3 as our standard methodology for RP, we should just make sure it's robust to these situations. I haven't done a thorough investigation, but this is just an example I've picked from some recent analysis in Logone-et-Chari department (admin2) in northern Cameroon. Here's the maximum SFED_AREA value for each pixel (admin3 outlines in black):

Basically, values near 1 definitely happen (and there was one pixel that actually hit 1). Now, taking the pixel with the highest median value, we can do the various LP3 fits to its maximum yearly value (treating the pixel exactly the same as the previous sample_admin of Awdal). Plotting the RP curve:

We see that we exceed a value of 100% at a return period just above 10 years (whereas of course the empirical levels off). We can also see this by looking at the PDF (histogram of the actual yearly maxima in blue):

For all distributions there is a decent chunk above 100%.

Now, of course, this is a totally fabricated example by cherry-picking a single very wet pixel. But I was able to find this in a tiny part of the world I happened to be looking at already, so it's possible that there will be an admin somewhere else that has more of an overlap with very wet pixels, and we get the same problem with the actual admin bounds.

So basically I think it's worth running the RP calcs for more admins to see whether this is actually a problem or whether I'm just being fussy.

zackarno · 2024-10-23T00:10:31Z

@t-downing good points, It makes sense to me that we should try to understand implications of how and when an RP calculated from the modelled LP3 distribution would exceed 100% flood fraction. However, I'm not sure I understand why we'd be concerned about a single pixel being at 100%. The calculations are run on SFED zonal means so even in your maximum composite example (fig 1) , the zonal mean of any of those polygons would be far below 100%.

I'd think it makes sense to consider at the zonal (admin 2) level and tthat once we get the admin2 stats for the entire FloodScan AOI, we could examine more with something like:

For each admin 2 fit the LP3
For each admin 2 find the RP that corresponds to 100% flood of that admin (i think this should be extremely high)
Depending how this looks... potentially get the minimum RP that this occurs at and add a step to set any returned RPs that exceed this to Inf and include a method note on this in the readme.

But maybe im missing something about why we should be concerned about the pixel level RP so please explain! I guess it would be nice to have method that handles that well, but I'm not sure that calculating at pixel level is all that useful (or even comparable to zonal)-- especially not for this data set.

t-downing · 2024-10-23T03:37:38Z

No no I don't think we need to do anything at a pixel level, I just grabbed a pixel because it was an easy way to illustrate my point.

I think your proposed way to go forward makes sense. I suspect that in practice this won't really be an issue. Re: point 3 though- I think the issue is more that bounding the distribution to [0, 1] would actually change the shape of the whole distribution (slightly), not just return periods above a certain threshold.

Overall though, in this case, it's probably being overly pedantic to bound the distributions like this given it most likely will make no difference. More just a case of keeping this in mind to make sure we don't use distributions that theoretically include impossible values, for future return period modeling.

zackarno · 2024-10-23T13:54:21Z

No no I don't think we need to do anything at a pixel level, I just grabbed a pixel because it was an easy way to illustrate my point.

I think your proposed way to go forward makes sense. I suspect that in practice this won't really be an issue. Re: point 3 though- I think the issue is more that bounding the distribution to [0, 1] would actually change the shape of the whole distribution (slightly), not just return periods above a certain threshold.

Overall though, in this case, it's probably being overly pedantic to bound the distributions like this given it most likely will make no difference. More just a case of keeping this in mind to make sure we don't use distributions that theoretically include impossible values, for future return period modeling.

I understand. Thats a good point on how the overall shape would change. I don't think it's overly pedantic ;-).

A couple more thoughts on dealing with the bounding:

I was curious what would happen if we just did a logit transform rather than log in the pearson III processing log(x / (1 - x)). However, not really sure about any implication
Something like the above or an adaptation of the beta distr method (like you have done) seem like potentially good ways to handle, I just am a bit concerned about making up a new method to deal with a standard calculation. Overall, I can't really find much literature on either. However, last night I did find this paper: Regional Index Insurance Using Satellite-Based Fractional Flooded Area which seems to deal w/ the exact problem. Perhaps we can look into a bit... still would like to base a method on more than just this single paper, but seems promising anyways.

zackarno · 2024-10-23T15:56:35Z

One additional point on bounding [0,1]:

there are upper bounds in many places based on many factors (phys geography, hydro, etc.) and we won't be able to factor these in. Not sure that a conceptual bath tub model that can just fill any terrain to 100% makes sense as it wouldn't be physically possible?
Wondering if the above point pushes more toward a 4 param beta distr where upper bounds are calculated rather than assumed or rather just not dealing with the upper bound and following something like steps here: Rp calcs #7 (comment) acknowledging that the shape is not perfect... because i think the shape might actually also get distorted even with an enforced 1 bound

t-downing · 2024-10-25T20:21:47Z

Ah yeah that paper looks to be doing exactly what we need to do! But yes, it's only one paper.

I think we would avoid trying to explicitly estimate an upper bound, because we still know the only hard upper bound is 100%. According to their documentation, that is the fraction of unmaksed area, so even areas where much of the pixel is "persistent water" can still theoretically reach 100% from their algorithm (even if this is physically impossible). This does, I've just realized, beg the question- should we take into account the fraction of persistent water? Because in a pixel that is e.g. 50% persistent water, a flood fraction of 25% really only means 12.5% of the pixel is flooded. For flood exposure I think it's ok to ignore, because we can assume that no people live in persistent water areas. But for taking raster stats of purely the SFED this might mean we should weight pixels based on their fraction of persistent water. Possibly splitting hairs here though- would probably only make a difference on coastlines where there would be alot of persistent water (less so with rivers).

Re: logit- maybe this is a good middle ground that still uses the "industry-standard" LP3 distribution, but bounded to [0, 1]. But of course it would change the distribution shape alot. Maybe we see what it looks like?

Or- I am still good with just moving forward with standard LP3, and just double checking it doesn't put a non-negligible fraction of the distribution above 100% for any admins.

zackarno · 2024-10-30T17:31:44Z

@t-downing - yeah i think its definitely tricky to figure out what the exact best method was. I'd say the upper bounds in some pixels can be 100% but in many admins it cannot be as the hydrological modeling algorithm used for floodscan would have made some areas not-floodable at the high resolution. As some areas could not be 100% when they are aggregated to SFED_AREA areas that were excluded from flooding would automatically lower the % below 100.

This being said, I'm sure we could justify an upper limit of 100 if we really wanted... but it doesn't appear to necessary as I am seeing the LP3 curve flatten out quite quickly: heres a screenshot from this rendered notebook. Interestingly some odd cases where GEV exceeds 100%, but LP3 and Gumbel flatten out quite alot.

it's true that the upper limit would potentially change the shape to a degree, but if these are leveling out so far below 100% im not sure how necessary it is or if it could have even an unintended effect. Also btw, i extended the plots to 100,000 years and they look the same, just harder to read....

still think it will be worth check on the full admin level stats

t-downing · 2024-10-30T18:00:10Z

Interesting where GEV really goes off the charts. Guess that's a good reason not to use it!

And yeah I think comparison for all admins (or at least ones that are more different than Somalia) will be good. Based on their algorithm performance doc there's definitely variation in flood detection based on land cover.

Like, the Cameroon area I was looking at has alot of "herbaceous wetland", and looks like this (from WorldCover):

Whereas the most "herbaceous wetland"-y part of Somalia only looks like this (to scale):

Basically I think in addition to any difference in actual flooding, there'll be a systematic difference in flooding detection that would push those SFED values even higher.

Anyways sorry enough speculation on my side! We can just check the actual values when we've fitted them for all admins.

zackarno · 2024-11-05T16:34:38Z

So what do we say we consider this PR "finished" and:

migrate current branch to the new repo: hdx-floodscan
Open a PR and merge in new repo
close the PR here

I think we can still continue the interesting discussion on methods in the other repo as the raster-stats become available for testing, but as long as methods are implemented correctly I don't see any reason to hold up merging there as we are just writing function to implement different method options, not actual pipeline code yet.

@t-downing @hannahker - thoughts on the proposal?

hannahker · 2024-11-05T16:59:10Z

@zackarno good from my end!

t-downing · 2024-11-06T19:49:49Z

@zackarno yes sounds good to me! We can keep talking about which RP distribution model to actually use for the HDX dataset in hdx_floodscan. Agreed that the models are implemented correctly, just a matter now of deciding which one to go with based on how the RP distributions look for all the admins.

zackarno · 2024-11-14T23:48:18Z

closing PR as it has been merged in new repo: OCHA-DAP/hdx-floodscan#2

zackarno added 5 commits October 2, 2024 19:14

first stab at converting R RP calcs to python

6650369

add lp3 funcs

953e152

add doc strings to rp funcs

dee9dd1

add exploratory script to show implementation of LP3 funcs

ecce067

Merge branch 'main' into rp-calcs

95390c2

zackarno requested review from t-downing, hannahker and isatotun October 17, 2024 16:56

zackarno self-assigned this Oct 17, 2024

hannahker reviewed Oct 18, 2024

View reviewed changes

zackarno added 2 commits October 18, 2024 16:44

fix outdated code objs in notebook

846210a

update gitignore

b6729ab

t-downing reviewed Oct 22, 2024

View reviewed changes

zackarno mentioned this pull request Nov 5, 2024

migrate rp calcs/exploratory implementation code OCHA-DAP/hdx-floodscan#2

Merged

zackarno closed this Nov 14, 2024

		from lmoments3 import distr, lmom_ratios


		def lp3_params(x, est_method="lmoments"):

Rp calcs #7

Rp calcs #7

Conversation

zackarno commented Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahker Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

zackarno Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zackarno Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zackarno Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahker commented Oct 18, 2024

zackarno commented Oct 22, 2024 • edited Loading

t-downing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahker commented Oct 22, 2024

zackarno commented Oct 22, 2024 • edited Loading

t-downing commented Oct 22, 2024

zackarno commented Oct 23, 2024 • edited Loading

t-downing commented Oct 23, 2024

zackarno commented Oct 23, 2024 • edited Loading

zackarno commented Oct 23, 2024 • edited Loading

t-downing commented Oct 25, 2024

zackarno commented Oct 30, 2024 • edited Loading

t-downing commented Oct 30, 2024

zackarno commented Nov 5, 2024

hannahker commented Nov 5, 2024

t-downing commented Nov 6, 2024

zackarno commented Nov 14, 2024

zackarno commented Oct 17, 2024 •

edited

Loading

hannahker Oct 18, 2024 •

edited

Loading

zackarno Oct 21, 2024 •

edited

Loading

zackarno Oct 21, 2024 •

edited

Loading

zackarno Oct 21, 2024 •

edited

Loading

zackarno commented Oct 22, 2024 •

edited

Loading

zackarno commented Oct 22, 2024 •

edited

Loading

zackarno commented Oct 23, 2024 •

edited

Loading

zackarno commented Oct 23, 2024 •

edited

Loading

zackarno commented Oct 23, 2024 •

edited

Loading

zackarno commented Oct 30, 2024 •

edited

Loading