Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rp calcs #7

Closed
wants to merge 7 commits into from
Closed

Rp calcs #7

wants to merge 7 commits into from

Conversation

zackarno
Copy link

@zackarno zackarno commented Oct 17, 2024

Implement Log Pearson III RP/RV calculations which we will use for FloodScan tabular dataset.

I wrote a return_periods.py module to hold the functions and then created an exploratory .py script that can be opened w/ jupytext to show how one might implement the functions.

More theoretical background on RP methods can be found here.

I'm not sure if this will end up being put on this repo in the final pipeline, but will still do the PR so that the calculations themselves can be reviewed by @t-downing . Also tagging @hannahker & @isatotun to think/recommend how we'd actually orchestrate implementation.

@zackarno zackarno self-assigned this Oct 17, 2024
# %autoreload 2

# %%
from src.utils import cloud_utils
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also add

import os
os.chdir('..')

So that the src module is imported properly

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this to me?

  • Is this required on top of the setup.cfg,pyproject.toml , _init_.py files and pip install -e. ?
  • or does this replace the need for some combination of the above?

The modules seem to be importing properly on my machine using the top bullet point

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think running pip install -e . obviates the need for os.chdir('..'). Then you can import the src module the same way regardless of what directory you're in, with a notebook or a .py. Then I also put pip install -e . in whatever GH Action workflow file so it runs the same in there too. I am however not sure what the best practice is, it's just based on our coding guidelines.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice! I'm not very familiar with using pip install -e. Looks great then.

# We can also run them all at once with this wrapper

# %%
rp.lp3_params_all2(df_max[df_max["adm1_en"]==sample_admin].value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting an error since the lp3_params_all2 function doesn't exist. Is this just lp3_params_all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sorry - leftover dev code - fixed 846210a

from lmoments3 import distr, lmom_ratios


def lp3_params(x, est_method="lmoments"):
Copy link
Collaborator

@hannahker hannahker Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case for having these params as an intermediate output? Otherwise I wonder if it might make sense to just put this function inside lp3_rp so the user only has to specify the method.

This could also avoid some potential confusion as the output format for each method is different too.

Copy link
Author

@zackarno zackarno Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for the purpose of the actual data set pipeline i think its fine to wrap them together.

However, they are more useful separate for general analysis as lp3_params() is calculated on an aggregated data set where the block maxima is calculated over any desired time block. The lp3_rp() command then takes the calculated params as inputs and applies them to any random vector -- which for our purpose is the values from the "raw" data set (not aggregated).

You could combine them together in a more sophisticated wrapper where you specify the time-step for the first aggregation, but it seems like it might end up not as flexible as we often want in various analyses

setup.cfg Outdated
@@ -0,0 +1,5 @@
[metadata]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would .gitignore this?

Copy link
Author

@zackarno zackarno Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, im not sure what of these setup files are typically ignored? If we remove someone else has to recreate?

anyways i ignored it: b6729ab

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always left it in to be able to run pip install -e . (and allow a GH Action to run it). So I just ended up recreating this file locally. But if there's a better way of doing this I am definitely open to it!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yes this makes sense! I didn't understand the connection with pip install -e

# Now we can loop through each admin and calculate RPs associated with each value

# %%
unique_adm1s = df_params_long["adm1_en"].unique()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also df_params_long isn't defined!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woopsy - left over code from exploration - fixed. 846210a

def lp3_rp(x, params, est_method="lmoments"):
x = np.asarray(x)
x_sorted = np.sort(x)
x_log = np.log10(x_sorted)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is throwing a lot of runtime warnings from the demo notebook: RuntimeWarning: divide by zero encountered in log10 -- is that expected?

Copy link
Author

@zackarno zackarno Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Should deal with that.

I think for the purpose of calculating RP values for FloodScan the best and easiest way to deal w/ it would be to remove all 0 values for the log-transform step and then re-insert them at the end with an RP value of 1. What do you think @t-downing ? Note this is different treatment of 0s than lp3_param() func which runs only on the block maxima aggregate. In that calculation we just set 0 equal to the next lowest value (this is less of an issue because there are not many 0s in the annual maxima aggregate).

Maybe one of you can explain to me all the other warnings that chunk generates, I think I am doing something weird in pandas thats not recommended, but works?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's also the classic

/var/folders/rv/xmclt0vn5y7cqt46s5xq3h080000gn/T/ipykernel_90173/1646057511.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Although from this stack trace it's hard to tell where this is coming from... Doesn't actually look like it's your code? So might be ok to ignore that for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from here! #7 (comment)

@hannahker
Copy link
Collaborator

Added some minor comments around the implementation approach and reproducibility of the notebook code. What are your thoughts on which method is most appropriate here?

@zackarno
Copy link
Author

zackarno commented Oct 22, 2024

Added some minor comments around the implementation approach and reproducibility of the notebook code. What are your thoughts on which method is most appropriate here?

thank you very much!! i think i've addressed them all, but still need to implement #7 (comment) - also curious if you or Tristan would be able demystify the other warning I mentioned on this comment.

But yeah i'm not 100% sure on which LP3 method yet. i think we will be able to make a more informed decision once we have the internal zonal stats pipelline complete where we can do a more thorough comparison of the implications of the decision (some already done on just a subset here). This is kind of why I provided the argument so that we can play with both for now. This will also be useful to inform how we want to treat the 0s as we will be able to see how much of an issue this is. I think its actually a bit more important in the lp3_params() calc which will be much more sensitive to this treatment (if any substantial amount of 0s).

Copy link

@t-downing t-downing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just quickly submitting this review now so you can see my comments as they're relevant to the question you just asked about pandas warnings. Will do another review in later today to address the distribution modeling.

setup.cfg Outdated
@@ -0,0 +1,5 @@
[metadata]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always left it in to be able to run pip install -e . (and allow a GH Action to run it). So I just ended up recreating this file locally. But if there's a better way of doing this I am definitely open to it!

@@ -0,0 +1,123 @@
import pandas as pd
import numpy as np
from scipy.stats import skew, norm, pearson3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to include the libraries:

  • scipy
  • matplotlib
  • lmoments
  • pyarrow

that I had to install manually either to requirements.txt or some other requirements file.


for adm in unique_adm1s:
df_params_adm_filt = df_params[df_params["adm1_en"] == adm]
df_adm_filt = df[df["adm1_en"] == adm]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you replace this line with df_adm_filt = df[df["adm1_en"] == adm].copy() you can avoid all the

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

(which to be totally honest I never really understood, but I do know that you can get rid of them by putting a .copy() somewhere)

# %%
# %matplotlib inline
# %load_ext autoreload
# %autoreload 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could chuck in a %load_ext jupyter_black here if you want to black-format the cells as you run them- up to you.

import pandas as pd
import numpy as np
from scipy.stats import skew, norm, pearson3
from lmoments3 import distr, lmom_ratios

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly reluctant to use lmoments3 as it only has 7 stars on GH. If there's a simpler way we can calculate the distributions it may be better to use in production. But for now no prob.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good point on the stars! let's just leave in for the experimental stage and potentially remove on main pipeline

@hannahker
Copy link
Collaborator

Also -- as just discussed with @zackarno we'll create a separate repo for the Floodscan HDX outputs, so let's not merge this and leave as an exploratory PR we can pull from.

@zackarno
Copy link
Author

zackarno commented Oct 22, 2024

Also -- as just discussed with @zackarno we'll create a separate repo for the Floodscan HDX outputs, so let's not merge this and leave as an exploratory PR we can pull from.

sounds good - let's keep the PR going here though so convo is all in one place. Once approved -- I'll open a PR in separate repo and link it

@t-downing
Copy link

I think my main concern for the fits is handling values that get close to 1 (i.e., totally flooded). For the areas that we've checked already (Somalia), this isn't really an issue because the fraction of the distribution that is >1 is negligible. But it might be more of an issue elsewhere, and if we want to set LP3 as our standard methodology for RP, we should just make sure it's robust to these situations. I haven't done a thorough investigation, but this is just an example I've picked from some recent analysis in Logone-et-Chari department (admin2) in northern Cameroon. Here's the maximum SFED_AREA value for each pixel (admin3 outlines in black):

image

Basically, values near 1 definitely happen (and there was one pixel that actually hit 1). Now, taking the pixel with the highest median value, we can do the various LP3 fits to its maximum yearly value (treating the pixel exactly the same as the previous sample_admin of Awdal). Plotting the RP curve:

image

We see that we exceed a value of 100% at a return period just above 10 years (whereas of course the empirical levels off). We can also see this by looking at the PDF (histogram of the actual yearly maxima in blue):

image

For all distributions there is a decent chunk above 100%.

Now, of course, this is a totally fabricated example by cherry-picking a single very wet pixel. But I was able to find this in a tiny part of the world I happened to be looking at already, so it's possible that there will be an admin somewhere else that has more of an overlap with very wet pixels, and we get the same problem with the actual admin bounds.

So basically I think it's worth running the RP calcs for more admins to see whether this is actually a problem or whether I'm just being fussy.

@zackarno
Copy link
Author

zackarno commented Oct 23, 2024

@t-downing good points, It makes sense to me that we should try to understand implications of how and when an RP calculated from the modelled LP3 distribution would exceed 100% flood fraction. However, I'm not sure I understand why we'd be concerned about a single pixel being at 100%. The calculations are run on SFED zonal means so even in your maximum composite example (fig 1) , the zonal mean of any of those polygons would be far below 100%.

I'd think it makes sense to consider at the zonal (admin 2) level and tthat once we get the admin2 stats for the entire FloodScan AOI, we could examine more with something like:

  1. For each admin 2 fit the LP3
  2. For each admin 2 find the RP that corresponds to 100% flood of that admin (i think this should be extremely high)
  3. Depending how this looks... potentially get the minimum RP that this occurs at and add a step to set any returned RPs that exceed this to Inf and include a method note on this in the readme.

But maybe im missing something about why we should be concerned about the pixel level RP so please explain! I guess it would be nice to have method that handles that well, but I'm not sure that calculating at pixel level is all that useful (or even comparable to zonal)-- especially not for this data set.

@t-downing
Copy link

No no I don't think we need to do anything at a pixel level, I just grabbed a pixel because it was an easy way to illustrate my point.

I think your proposed way to go forward makes sense. I suspect that in practice this won't really be an issue. Re: point 3 though- I think the issue is more that bounding the distribution to [0, 1] would actually change the shape of the whole distribution (slightly), not just return periods above a certain threshold.

Overall though, in this case, it's probably being overly pedantic to bound the distributions like this given it most likely will make no difference. More just a case of keeping this in mind to make sure we don't use distributions that theoretically include impossible values, for future return period modeling.

@zackarno
Copy link
Author

zackarno commented Oct 23, 2024

No no I don't think we need to do anything at a pixel level, I just grabbed a pixel because it was an easy way to illustrate my point.

I think your proposed way to go forward makes sense. I suspect that in practice this won't really be an issue. Re: point 3 though- I think the issue is more that bounding the distribution to [0, 1] would actually change the shape of the whole distribution (slightly), not just return periods above a certain threshold.

Overall though, in this case, it's probably being overly pedantic to bound the distributions like this given it most likely will make no difference. More just a case of keeping this in mind to make sure we don't use distributions that theoretically include impossible values, for future return period modeling.

I understand. Thats a good point on how the overall shape would change. I don't think it's overly pedantic ;-).

A couple more thoughts on dealing with the bounding:

  • I was curious what would happen if we just did a logit transform rather than log in the pearson III processing log(x / (1 - x)). However, not really sure about any implication
  • Something like the above or an adaptation of the beta distr method (like you have done) seem like potentially good ways to handle, I just am a bit concerned about making up a new method to deal with a standard calculation. Overall, I can't really find much literature on either. However, last night I did find this paper: Regional Index Insurance Using Satellite-Based Fractional Flooded Area which seems to deal w/ the exact problem. Perhaps we can look into a bit... still would like to base a method on more than just this single paper, but seems promising anyways.

@zackarno
Copy link
Author

zackarno commented Oct 23, 2024

One additional point on bounding [0,1]:

  • there are upper bounds in many places based on many factors (phys geography, hydro, etc.) and we won't be able to factor these in. Not sure that a conceptual bath tub model that can just fill any terrain to 100% makes sense as it wouldn't be physically possible?
  • Wondering if the above point pushes more toward a 4 param beta distr where upper bounds are calculated rather than assumed or rather just not dealing with the upper bound and following something like steps here: Rp calcs #7 (comment) acknowledging that the shape is not perfect... because i think the shape might actually also get distorted even with an enforced 1 bound

@t-downing
Copy link

Ah yeah that paper looks to be doing exactly what we need to do! But yes, it's only one paper.

I think we would avoid trying to explicitly estimate an upper bound, because we still know the only hard upper bound is 100%. According to their documentation, that is the fraction of unmaksed area, so even areas where much of the pixel is "persistent water" can still theoretically reach 100% from their algorithm (even if this is physically impossible). This does, I've just realized, beg the question- should we take into account the fraction of persistent water? Because in a pixel that is e.g. 50% persistent water, a flood fraction of 25% really only means 12.5% of the pixel is flooded. For flood exposure I think it's ok to ignore, because we can assume that no people live in persistent water areas. But for taking raster stats of purely the SFED this might mean we should weight pixels based on their fraction of persistent water. Possibly splitting hairs here though- would probably only make a difference on coastlines where there would be alot of persistent water (less so with rivers).

Re: logit- maybe this is a good middle ground that still uses the "industry-standard" LP3 distribution, but bounded to [0, 1]. But of course it would change the distribution shape alot. Maybe we see what it looks like?

Or- I am still good with just moving forward with standard LP3, and just double checking it doesn't put a non-negligible fraction of the distribution above 100% for any admins.

@zackarno
Copy link
Author

zackarno commented Oct 30, 2024

@t-downing - yeah i think its definitely tricky to figure out what the exact best method was. I'd say the upper bounds in some pixels can be 100% but in many admins it cannot be as the hydrological modeling algorithm used for floodscan would have made some areas not-floodable at the high resolution. As some areas could not be 100% when they are aggregated to SFED_AREA areas that were excluded from flooding would automatically lower the % below 100.

This being said, I'm sure we could justify an upper limit of 100 if we really wanted... but it doesn't appear to necessary as I am seeing the LP3 curve flatten out quite quickly: heres a screenshot from this rendered notebook. Interestingly some odd cases where GEV exceeds 100%, but LP3 and Gumbel flatten out quite alot.

image

it's true that the upper limit would potentially change the shape to a degree, but if these are leveling out so far below 100% im not sure how necessary it is or if it could have even an unintended effect. Also btw, i extended the plots to 100,000 years and they look the same, just harder to read....

still think it will be worth check on the full admin level stats

@t-downing
Copy link

Interesting where GEV really goes off the charts. Guess that's a good reason not to use it!

And yeah I think comparison for all admins (or at least ones that are more different than Somalia) will be good. Based on their algorithm performance doc there's definitely variation in flood detection based on land cover.

Like, the Cameroon area I was looking at has alot of "herbaceous wetland", and looks like this (from WorldCover):

Whereas the most "herbaceous wetland"-y part of Somalia only looks like this (to scale):

Basically I think in addition to any difference in actual flooding, there'll be a systematic difference in flooding detection that would push those SFED values even higher.

Anyways sorry enough speculation on my side! We can just check the actual values when we've fitted them for all admins.

@zackarno
Copy link
Author

zackarno commented Nov 5, 2024

So what do we say we consider this PR "finished" and:

  1. migrate current branch to the new repo: hdx-floodscan
  2. Open a PR and merge in new repo
  3. close the PR here

I think we can still continue the interesting discussion on methods in the other repo as the raster-stats become available for testing, but as long as methods are implemented correctly I don't see any reason to hold up merging there as we are just writing function to implement different method options, not actual pipeline code yet.

@t-downing @hannahker - thoughts on the proposal?

@hannahker
Copy link
Collaborator

@zackarno good from my end!

@t-downing
Copy link

@zackarno yes sounds good to me! We can keep talking about which RP distribution model to actually use for the HDX dataset in hdx_floodscan. Agreed that the models are implemented correctly, just a matter now of deciding which one to go with based on how the RP distributions look for all the admins.

@zackarno
Copy link
Author

closing PR as it has been merged in new repo: OCHA-DAP/hdx-floodscan#2

@zackarno zackarno closed this Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants