-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEP: Deprecate all data reading functionality via pandas-datareader #97
Conversation
@twiecki @richafrank requesting review from research and engineering, ready to merge otherwise. |
LGTM. One thing we should add, though, is a function that takes prices and computes correctly indexed returns which one can pass into |
@twiecki done |
Thanks, this definitely requires engineering review. CC @richafrank |
empyrical/utils.py
Outdated
@@ -24,6 +24,16 @@ | |||
import pandas as pd | |||
from pandas.tseries.offsets import BDay | |||
from pandas_datareader import data as web |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twiecki @richafrank there is a known import error with pandas_datareader
: it has been fixed upstream here, but has yet to be released in a new version. See this StackOverflow thread, and this pyfolio issue that brought this to my attention.
Wondering what we should do about this. Perhaps wrap it in a try-except, and raise a warning if the import fails? If we're deprecating pandas-datareader
functionality anyways, it doesn't make sense to keep this line around if it raises an import error.
That sounds like a good solution to me.
…On Fri, Jun 1, 2018 at 4:39 PM, George Ho ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In empyrical/utils.py
<#97 (comment)>:
> @@ -24,6 +24,16 @@
import pandas as pd
from pandas.tseries.offsets import BDay
from pandas_datareader import data as web
@twiecki <https://github.com/twiecki> @richafrank
<https://github.com/richafrank> there is a known import error with
pandas_datareader: it has been fixed upstream here
<pydata/pandas-datareader#520>, but has yet to be
released in a new version. See this StackOverflow thread
<https://stackoverflow.com/questions/50394873/import-pandas-datareader-gives-importerror-cannot-import-name-is-list-like>
.
Wondering what we should do about this. Perhaps wrap it in a try-except,
and raise a warning if the import fails? If we're deprecating
pandas-datareader functionality anyways, it doesn't make sense to keep
this line around if it raises an import error.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#97 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApJmHDvFXNEj4bVyfBRWfVs4zWxUQnrks5t4VIkgaJpZM4UFLcn>
.
|
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eigenfoo I took a pass here.
In general the deprecation warning implementation seems reasonable. I had some concerns about how strongly we want to push people toward Quantopian research, and I had some more philosophical comments on the newly-added returns function.
README.md
Outdated
|
||
For alternative data sources, we suggest the following: | ||
|
||
1. Migrate your research workflow to the Quantopian Research environment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little torn on whether we want to include this here. On the one hand, it's definitely a useful and relevant resource for a consumer of empyrical. On the other hand, it feels a bit like advertising, which feels a little odd in an open source project.
I'd be curious to hear thoughts on this from others who regularly work on open source (cc @twiecki @richafrank @llllllllll @ehebert, @TimShawver)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I agree...I've been thinking about some similar questions with qgrid since we're going to direct people to research from there. It feels OK to me in the qgrid case because we're going to be using research to host a demo of qgrid. In this case it feels a bit weird because we're not directing them to anything empyrical-specific within research, we're just saying here's another place you can get data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if making the statement less imperative would help it make it less of an advertisement.
i.e. "we suggest" -> "these options are available".
I lean on the side of removing, so that we don't have to worry about semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, removed all suggestions.
empyrical/utils.py
Outdated
@@ -230,10 +274,12 @@ def get_utc_timestamp(dt): | |||
_1_bday = BDay() | |||
|
|||
|
|||
@deprecated(msg=DATAREADER_DEPRECATION_WARNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this being deprecated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only called in functions that this PR is deprecating, and since this function is for internal use only (i.e. starts with an underscore), I thought it would make sense to deprecate it. Is that not good practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could, though the particular message is seems a bit confusing for this function. If this is called from deprecated functions, then I'd also be worried about warning spam from this. Since this is a "private" function (in general, functions prefixed with an _
are private by convention), it doesn't seem all that important to me to deprecate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed deprecation warning.
empyrical/utils.py
Outdated
@@ -205,14 +224,39 @@ def ensure_directory(path): | |||
raise | |||
|
|||
|
|||
def compute_returns(prices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this function being added in this PR? This seems out of place with the rest of the changes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do want this, I would probably call this something like simple_returns
to distinguish it from log returns. As a general naming note, I'm usually not a fan of including generic verbs like compute
in function names. All functions compute things, so the compute_
doesn't really add any useful information for a reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, get_symbol_returns_from_yahoo
(which is being deprecated) fetched pricing data from Yahoo, and computed returns, as in this line:
https://github.com/quantopian/empyrical/blob/master/empyrical/utils.py#L420
This function is merely to retain support for the conversion from prices to returns. Thinking about it, I agree that putting it in utils
is a bit out of place, and that it probably belongs in stats
with the name simple_returns
.
empyrical/utils.py
Outdated
""" | ||
|
||
rets = prices.pct_change().dropna() | ||
rets.index = rets.index.tz_localize('UTC') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note as on your other PR re: not updating values in place here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stops being a problem if we implement the other changes discussed above.
empyrical/utils.py
Outdated
and index coerced to be tz-aware. | ||
""" | ||
|
||
rets = prices.pct_change().dropna() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling dropna()
and modifying the index here feel philosophically out of place to me for empyrical.
In my mind at least, the primary purpose of empyrical is to be low-level infrastructure that can be shared betwen Zipline, PyFolio, and Alphalens. Zipline pretty much never wants to call a function that silently discards bad data, or that changes the localization of a datetime index, because both of those behaviors prevent us from noticing bugs, and they incur nontrivial performance overhead.
Calling dropna()
in particular here seems a bit fraught if we expect DataFrame input, because the behavior there is that pandas will drop an entire row if there are any NaNs in the row. If we expect the user to have missing data, then that seems like not a great behavior (if they pass enough columns, it seems likely that they'll drop a large percentage of their input data). If we don't expect them to have missing data, then we're just incurring extra cost for no reason.
If we want to include a simple returns function in empyrical, I would probably put it in stats
and write it as something like:
def simple_returns(prices):
"""Compute simple returns from a timeseries of prices.
<rest of the docstring>
"""
if isinstance(prices, (pd.DataFrame, pd.Series)):
return prices.pct_change().iloc[1:]
else:
# Assume ndarray
out = np.diff(prices, axis=0)
np.divide(out, prices[:-1], out=out)
return out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs a test if we decide to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will sound bad but I was calling dropna
because that was the way it was done before, and I assumed there was a reason for that 😦 I believe that dropna
was only used to drop the first row, which would be NaN. Given that, using .iloc[1:]
is more watertight.
I agree that this function belongs in stats
, though. The only question is whether or not we want empyrical
to tz-localize anything. Philosophically, I feel as if that should be the responsibility of downstream packages like alphalens and pyfolio, since empyrical is only responsible for low-level financial calculations, and is not meant to be user-facing at all. This relates to the input sanitization PR for pyfolio.
@twiecki thoughts? I've went ahead and implemented these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, .iloc[1:]
is a bit clearer on the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yeah, I'd be happy if we just completely did away with the whole tz-aware thing. Since we don't provide any data in the future ourselves, we can probably just ignore it and the user has to ensure that it matches up.
empyrical/utils.py
Outdated
"version." | ||
"\n" | ||
"Please use empyrical in the Quantopian Research environment, or " | ||
"supply your own data. See README.md for more details.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user reading this error message isn't necessarily going to know what README.md you mean. I'd probably make this a link the github README, and/or add a section in the docs.
empyrical/utils.py
Outdated
"in empyrical has been deprecated and will be removed in a future " | ||
"version." | ||
"\n" | ||
"Please use empyrical in the Quantopian Research environment, or " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inclination would be to drop the reference to Quantopian Research here, especially if we include a reference to it in the README.
empyrical/utils.py
Outdated
@@ -205,14 +224,39 @@ def ensure_directory(path): | |||
raise | |||
|
|||
|
|||
def compute_returns(prices): | |||
""" | |||
Computes correctly-indexed returns from prices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reader of this function, I'm not sure I would know what "correctly-indexed" means in this context. Is there something more specific we can say here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to mean 'tz-localized', but with the other comments below I don't think we want to do that after all.
Note to self: still need to write a test for the |
Not quite sure why tests are failing. The checks fail on my local machine as well, but using Any pointers @ssanderson @twiecki? |
@twiecki fixed unit tests. Ready for another pass, or merging. |
README.md
Outdated
- `empyrical.utils.cache_dir` | ||
- `empyrical.utils.data_path` | ||
- `empyrical.utils.ensure_directory` | ||
- `empyrical.utils._1_bday_ago` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't deprecated anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops
@eigenfoo I had one more comment on the README. Otherwise this LGTM. |
Thanks @eigenfoo! |
Closes #96 #89 #65 #64
Deprecates all data-fetching functionality via
pandas-datareader
, and adds asimple_returns
helper function to replace that functionality that was implicitly done in the data-fetchers.To quote the REAME:
As of early 2018, Yahoo Finance has suffered major API breaks with no stable replacement, and the Google Finance API has not been stable since late 2017 (source). In recent months it has become a greater and greater strain on the
empyrical
development team to maintain support for fetching data throughpandas-datareader
and other third-party libraries, as these APIs are known to be unstable.As a result, all
empyrical
support for data reading functionality has been deprecated and will be removed in a future version.Users should beware that the following functions are now deprecated:
empyrical.utils.cache_dir
empyrical.utils.data_path
empyrical.utils.ensure_directory
empyrical.utils.get_utc_timestamp
empyrical.utils._1_bday_ago
empyrical.utils.get_fama_french
empyrical.utils.load_portfolio_risk_factors
empyrical.utils.default_returns_func
empyrical.utils.get_symbol_returns_from_yahoo
Users should expect regular failures from the following functions, pending patches to the Yahoo or Google Finance API:
empyrical.utils.default_returns_func
empyrical.utils.get_symbol_returns_from_yahoo