-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rp calcs #7
Conversation
# %autoreload 2 | ||
|
||
# %% | ||
from src.utils import cloud_utils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also add
import os
os.chdir('..')
So that the src
module is imported properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this to me?
- Is this required on top of the
setup.cfg
,pyproject.toml
,_init_.py
files andpip install -e.
? - or does this replace the need for some combination of the above?
The modules seem to be importing properly on my machine using the top bullet point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think running pip install -e .
obviates the need for os.chdir('..')
. Then you can import the src
module the same way regardless of what directory you're in, with a notebook or a .py
. Then I also put pip install -e .
in whatever GH Action workflow file so it runs the same in there too. I am however not sure what the best practice is, it's just based on our coding guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice! I'm not very familiar with using pip install -e
. Looks great then.
# We can also run them all at once with this wrapper | ||
|
||
# %% | ||
rp.lp3_params_all2(df_max[df_max["adm1_en"]==sample_admin].value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting an error since the lp3_params_all2
function doesn't exist. Is this just lp3_params_all
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes sorry - leftover dev code - fixed 846210a
from lmoments3 import distr, lmom_ratios | ||
|
||
|
||
def lp3_params(x, est_method="lmoments"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a use case for having these params as an intermediate output? Otherwise I wonder if it might make sense to just put this function inside lp3_rp
so the user only has to specify the method.
This could also avoid some potential confusion as the output format for each method is different too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah for the purpose of the actual data set pipeline i think its fine to wrap them together.
However, they are more useful separate for general analysis as lp3_params()
is calculated on an aggregated data set where the block maxima is calculated over any desired time block. The lp3_rp()
command then takes the calculated params as inputs and applies them to any random vector -- which for our purpose is the values from the "raw" data set (not aggregated).
You could combine them together in a more sophisticated wrapper where you specify the time-step for the first aggregation, but it seems like it might end up not as flexible as we often want in various analyses
setup.cfg
Outdated
@@ -0,0 +1,5 @@ | |||
[metadata] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would .gitignore
this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, im not sure what of these setup files are typically ignored? If we remove someone else has to recreate?
anyways i ignored it: b6729ab
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always left it in to be able to run pip install -e .
(and allow a GH Action to run it). So I just ended up recreating this file locally. But if there's a better way of doing this I am definitely open to it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok yes this makes sense! I didn't understand the connection with pip install -e
# Now we can loop through each admin and calculate RPs associated with each value | ||
|
||
# %% | ||
unique_adm1s = df_params_long["adm1_en"].unique() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also df_params_long
isn't defined!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woopsy - left over code from exploration - fixed. 846210a
def lp3_rp(x, params, est_method="lmoments"): | ||
x = np.asarray(x) | ||
x_sorted = np.sort(x) | ||
x_log = np.log10(x_sorted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is throwing a lot of runtime warnings from the demo notebook: RuntimeWarning: divide by zero encountered in log10
-- is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Should deal with that.
I think for the purpose of calculating RP values for FloodScan the best and easiest way to deal w/ it would be to remove all 0 values for the log-transform step and then re-insert them at the end with an RP value of 1
. What do you think @t-downing ? Note this is different treatment of 0s than lp3_param()
func which runs only on the block maxima aggregate. In that calculation we just set 0 equal to the next lowest value (this is less of an issue because there are not many 0s in the annual maxima aggregate).
Maybe one of you can explain to me all the other warnings that chunk generates, I think I am doing something weird in pandas thats not recommended, but works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's also the classic
/var/folders/rv/xmclt0vn5y7cqt46s5xq3h080000gn/T/ipykernel_90173/1646057511.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Although from this stack trace it's hard to tell where this is coming from... Doesn't actually look like it's your code? So might be ok to ignore that for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from here! #7 (comment)
Added some minor comments around the implementation approach and reproducibility of the notebook code. What are your thoughts on which method is most appropriate here? |
thank you very much!! i think i've addressed them all, but still need to implement #7 (comment) - also curious if you or Tristan would be able demystify the other warning I mentioned on this comment. But yeah i'm not 100% sure on which LP3 method yet. i think we will be able to make a more informed decision once we have the internal zonal stats pipelline complete where we can do a more thorough comparison of the implications of the decision (some already done on just a subset here). This is kind of why I provided the argument so that we can play with both for now. This will also be useful to inform how we want to treat the 0s as we will be able to see how much of an issue this is. I think its actually a bit more important in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just quickly submitting this review now so you can see my comments as they're relevant to the question you just asked about pandas warnings. Will do another review in later today to address the distribution modeling.
setup.cfg
Outdated
@@ -0,0 +1,5 @@ | |||
[metadata] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've always left it in to be able to run pip install -e .
(and allow a GH Action to run it). So I just ended up recreating this file locally. But if there's a better way of doing this I am definitely open to it!
@@ -0,0 +1,123 @@ | |||
import pandas as pd | |||
import numpy as np | |||
from scipy.stats import skew, norm, pearson3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to include the libraries:
scipy
matplotlib
lmoments
pyarrow
that I had to install manually either to requirements.txt
or some other requirements file.
|
||
for adm in unique_adm1s: | ||
df_params_adm_filt = df_params[df_params["adm1_en"] == adm] | ||
df_adm_filt = df[df["adm1_en"] == adm] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you replace this line with df_adm_filt = df[df["adm1_en"] == adm].copy()
you can avoid all the
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
(which to be totally honest I never really understood, but I do know that you can get rid of them by putting a .copy()
somewhere)
# %% | ||
# %matplotlib inline | ||
# %load_ext autoreload | ||
# %autoreload 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could chuck in a %load_ext jupyter_black
here if you want to black
-format the cells as you run them- up to you.
import pandas as pd | ||
import numpy as np | ||
from scipy.stats import skew, norm, pearson3 | ||
from lmoments3 import distr, lmom_ratios |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly reluctant to use lmoments3
as it only has 7 stars on GH. If there's a simpler way we can calculate the distributions it may be better to use in production. But for now no prob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm good point on the stars! let's just leave in for the experimental stage and potentially remove on main pipeline
Also -- as just discussed with @zackarno we'll create a separate repo for the Floodscan HDX outputs, so let's not merge this and leave as an exploratory PR we can pull from. |
sounds good - let's keep the PR going here though so convo is all in one place. Once approved -- I'll open a PR in separate repo and link it |
I think my main concern for the fits is handling values that get close to 1 (i.e., totally flooded). For the areas that we've checked already (Somalia), this isn't really an issue because the fraction of the distribution that is >1 is negligible. But it might be more of an issue elsewhere, and if we want to set LP3 as our standard methodology for RP, we should just make sure it's robust to these situations. I haven't done a thorough investigation, but this is just an example I've picked from some recent analysis in Logone-et-Chari department (admin2) in northern Cameroon. Here's the maximum Basically, values near 1 definitely happen (and there was one pixel that actually hit 1). Now, taking the pixel with the highest median value, we can do the various LP3 fits to its maximum yearly value (treating the pixel exactly the same as the previous We see that we exceed a value of 100% at a return period just above 10 years (whereas of course the empirical levels off). We can also see this by looking at the PDF (histogram of the actual yearly maxima in blue): For all distributions there is a decent chunk above 100%. Now, of course, this is a totally fabricated example by cherry-picking a single very wet pixel. But I was able to find this in a tiny part of the world I happened to be looking at already, so it's possible that there will be an admin somewhere else that has more of an overlap with very wet pixels, and we get the same problem with the actual admin bounds. So basically I think it's worth running the RP calcs for more admins to see whether this is actually a problem or whether I'm just being fussy. |
@t-downing good points, It makes sense to me that we should try to understand implications of how and when an RP calculated from the modelled LP3 distribution would exceed 100% flood fraction. However, I'm not sure I understand why we'd be concerned about a single pixel being at 100%. The calculations are run on I'd think it makes sense to consider at the zonal (admin 2) level and tthat once we get the admin2 stats for the entire FloodScan AOI, we could examine more with something like:
But maybe im missing something about why we should be concerned about the pixel level RP so please explain! I guess it would be nice to have method that handles that well, but I'm not sure that calculating at pixel level is all that useful (or even comparable to zonal)-- especially not for this data set. |
No no I don't think we need to do anything at a pixel level, I just grabbed a pixel because it was an easy way to illustrate my point. I think your proposed way to go forward makes sense. I suspect that in practice this won't really be an issue. Re: point 3 though- I think the issue is more that bounding the distribution to [0, 1] would actually change the shape of the whole distribution (slightly), not just return periods above a certain threshold. Overall though, in this case, it's probably being overly pedantic to bound the distributions like this given it most likely will make no difference. More just a case of keeping this in mind to make sure we don't use distributions that theoretically include impossible values, for future return period modeling. |
I understand. Thats a good point on how the overall shape would change. I don't think it's overly pedantic ;-). A couple more thoughts on dealing with the bounding:
|
One additional point on bounding [0,1]:
|
Ah yeah that paper looks to be doing exactly what we need to do! But yes, it's only one paper. I think we would avoid trying to explicitly estimate an upper bound, because we still know the only hard upper bound is 100%. According to their documentation, that is the fraction of unmaksed area, so even areas where much of the pixel is "persistent water" can still theoretically reach 100% from their algorithm (even if this is physically impossible). This does, I've just realized, beg the question- should we take into account the fraction of persistent water? Because in a pixel that is e.g. 50% persistent water, a flood fraction of 25% really only means 12.5% of the pixel is flooded. For flood exposure I think it's ok to ignore, because we can assume that no people live in persistent water areas. But for taking raster stats of purely the SFED this might mean we should weight pixels based on their fraction of persistent water. Possibly splitting hairs here though- would probably only make a difference on coastlines where there would be alot of persistent water (less so with rivers). Re: logit- maybe this is a good middle ground that still uses the "industry-standard" LP3 distribution, but bounded to [0, 1]. But of course it would change the distribution shape alot. Maybe we see what it looks like? Or- I am still good with just moving forward with standard LP3, and just double checking it doesn't put a non-negligible fraction of the distribution above 100% for any admins. |
@t-downing - yeah i think its definitely tricky to figure out what the exact best method was. I'd say the upper bounds in some pixels can be 100% but in many admins it cannot be as the hydrological modeling algorithm used for floodscan would have made some areas not-floodable at the high resolution. As some areas could not be 100% when they are aggregated to This being said, I'm sure we could justify an upper limit of 100 if we really wanted... but it doesn't appear to necessary as I am seeing the LP3 curve flatten out quite quickly: heres a screenshot from this rendered notebook. Interestingly some odd cases where GEV exceeds 100%, but LP3 and Gumbel flatten out quite alot. it's true that the upper limit would potentially change the shape to a degree, but if these are leveling out so far below 100% im not sure how necessary it is or if it could have even an unintended effect. Also btw, i extended the plots to 100,000 years and they look the same, just harder to read.... still think it will be worth check on the full admin level stats |
Interesting where GEV really goes off the charts. Guess that's a good reason not to use it! And yeah I think comparison for all admins (or at least ones that are more different than Somalia) will be good. Based on their algorithm performance doc there's definitely variation in flood detection based on land cover. Like, the Cameroon area I was looking at has alot of "herbaceous wetland", and looks like this (from WorldCover): Whereas the most "herbaceous wetland"-y part of Somalia only looks like this (to scale): Basically I think in addition to any difference in actual flooding, there'll be a systematic difference in flooding detection that would push those Anyways sorry enough speculation on my side! We can just check the actual values when we've fitted them for all admins. |
So what do we say we consider this PR "finished" and:
I think we can still continue the interesting discussion on methods in the other repo as the raster-stats become available for testing, but as long as methods are implemented correctly I don't see any reason to hold up merging there as we are just writing function to implement different method options, not actual pipeline code yet. @t-downing @hannahker - thoughts on the proposal? |
@zackarno good from my end! |
@zackarno yes sounds good to me! We can keep talking about which RP distribution model to actually use for the HDX dataset in |
closing PR as it has been merged in new repo: OCHA-DAP/hdx-floodscan#2 |
Implement Log Pearson III RP/RV calculations which we will use for FloodScan tabular dataset.
I wrote a
return_periods.py
module to hold the functions and then created an exploratory.py
script that can be opened w/ jupytext to show how one might implement the functions.More theoretical background on RP methods can be found here.
I'm not sure if this will end up being put on this repo in the final pipeline, but will still do the PR so that the calculations themselves can be reviewed by @t-downing . Also tagging @hannahker & @isatotun to think/recommend how we'd actually orchestrate implementation.