Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEP: Deprecate all data reading functionality via pandas-datareader #97

Merged
merged 22 commits into from
Jun 14, 2018

Conversation

eigenfoo
Copy link
Contributor

@eigenfoo eigenfoo commented May 18, 2018

Closes #96 #89 #65 #64

Deprecates all data-fetching functionality via pandas-datareader, and adds a simple_returns helper function to replace that functionality that was implicitly done in the data-fetchers.

To quote the REAME:

As of early 2018, Yahoo Finance has suffered major API breaks with no stable replacement, and the Google Finance API has not been stable since late 2017 (source). In recent months it has become a greater and greater strain on the empyrical development team to maintain support for fetching data through pandas-datareader and other third-party libraries, as these APIs are known to be unstable.

As a result, all empyrical support for data reading functionality has been deprecated and will be removed in a future version.

Users should beware that the following functions are now deprecated:

  • empyrical.utils.cache_dir
  • empyrical.utils.data_path
  • empyrical.utils.ensure_directory
  • empyrical.utils.get_utc_timestamp
  • empyrical.utils._1_bday_ago
  • empyrical.utils.get_fama_french
  • empyrical.utils.load_portfolio_risk_factors
  • empyrical.utils.default_returns_func
  • empyrical.utils.get_symbol_returns_from_yahoo

Users should expect regular failures from the following functions, pending patches to the Yahoo or Google Finance API:

  • empyrical.utils.default_returns_func
  • empyrical.utils.get_symbol_returns_from_yahoo

@eigenfoo
Copy link
Contributor Author

eigenfoo commented May 18, 2018

@twiecki @richafrank requesting review from research and engineering, ready to merge otherwise.

@twiecki
Copy link
Contributor

twiecki commented May 22, 2018

LGTM. One thing we should add, though, is a function that takes prices and computes correctly indexed returns which one can pass into pyfolio.

@eigenfoo
Copy link
Contributor Author

@twiecki done

@twiecki
Copy link
Contributor

twiecki commented May 22, 2018

Thanks, this definitely requires engineering review. CC @richafrank

@@ -24,6 +24,16 @@
import pandas as pd
from pandas.tseries.offsets import BDay
from pandas_datareader import data as web
Copy link
Contributor Author

@eigenfoo eigenfoo Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twiecki @richafrank there is a known import error with pandas_datareader: it has been fixed upstream here, but has yet to be released in a new version. See this StackOverflow thread, and this pyfolio issue that brought this to my attention.

Wondering what we should do about this. Perhaps wrap it in a try-except, and raise a warning if the import fails? If we're deprecating pandas-datareader functionality anyways, it doesn't make sense to keep this line around if it raises an import error.

@twiecki
Copy link
Contributor

twiecki commented Jun 4, 2018 via email

@eigenfoo
Copy link
Contributor Author

eigenfoo commented Jun 4, 2018

Done.

Copy link

@ssanderson ssanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eigenfoo I took a pass here.

In general the deprecation warning implementation seems reasonable. I had some concerns about how strongly we want to push people toward Quantopian research, and I had some more philosophical comments on the newly-added returns function.

README.md Outdated

For alternative data sources, we suggest the following:

1. Migrate your research workflow to the Quantopian Research environment,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little torn on whether we want to include this here. On the one hand, it's definitely a useful and relevant resource for a consumer of empyrical. On the other hand, it feels a bit like advertising, which feels a little odd in an open source project.

I'd be curious to hear thoughts on this from others who regularly work on open source (cc @twiecki @richafrank @llllllllll @ehebert, @TimShawver)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I agree...I've been thinking about some similar questions with qgrid since we're going to direct people to research from there. It feels OK to me in the qgrid case because we're going to be using research to host a demo of qgrid. In this case it feels a bit weird because we're not directing them to anything empyrical-specific within research, we're just saying here's another place you can get data.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if making the statement less imperative would help it make it less of an advertisement.
i.e. "we suggest" -> "these options are available".

I lean on the side of removing, so that we don't have to worry about semantics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, removed all suggestions.

@@ -230,10 +274,12 @@ def get_utc_timestamp(dt):
_1_bday = BDay()


@deprecated(msg=DATAREADER_DEPRECATION_WARNING)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being deprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only called in functions that this PR is deprecating, and since this function is for internal use only (i.e. starts with an underscore), I thought it would make sense to deprecate it. Is that not good practice?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could, though the particular message is seems a bit confusing for this function. If this is called from deprecated functions, then I'd also be worried about warning spam from this. Since this is a "private" function (in general, functions prefixed with an _ are private by convention), it doesn't seem all that important to me to deprecate this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed deprecation warning.

@@ -205,14 +224,39 @@ def ensure_directory(path):
raise


def compute_returns(prices):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this function being added in this PR? This seems out of place with the rest of the changes here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do want this, I would probably call this something like simple_returns to distinguish it from log returns. As a general naming note, I'm usually not a fan of including generic verbs like compute in function names. All functions compute things, so the compute_ doesn't really add any useful information for a reader.

Copy link
Contributor Author

@eigenfoo eigenfoo Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, get_symbol_returns_from_yahoo (which is being deprecated) fetched pricing data from Yahoo, and computed returns, as in this line:
https://github.com/quantopian/empyrical/blob/master/empyrical/utils.py#L420

This function is merely to retain support for the conversion from prices to returns. Thinking about it, I agree that putting it in utils is a bit out of place, and that it probably belongs in stats with the name simple_returns.

"""

rets = prices.pct_change().dropna()
rets.index = rets.index.tz_localize('UTC')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as on your other PR re: not updating values in place here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stops being a problem if we implement the other changes discussed above.

and index coerced to be tz-aware.
"""

rets = prices.pct_change().dropna()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling dropna() and modifying the index here feel philosophically out of place to me for empyrical.

In my mind at least, the primary purpose of empyrical is to be low-level infrastructure that can be shared betwen Zipline, PyFolio, and Alphalens. Zipline pretty much never wants to call a function that silently discards bad data, or that changes the localization of a datetime index, because both of those behaviors prevent us from noticing bugs, and they incur nontrivial performance overhead.

Calling dropna() in particular here seems a bit fraught if we expect DataFrame input, because the behavior there is that pandas will drop an entire row if there are any NaNs in the row. If we expect the user to have missing data, then that seems like not a great behavior (if they pass enough columns, it seems likely that they'll drop a large percentage of their input data). If we don't expect them to have missing data, then we're just incurring extra cost for no reason.

If we want to include a simple returns function in empyrical, I would probably put it in stats and write it as something like:

def simple_returns(prices):
    """Compute simple returns from a timeseries of prices.

    <rest of the docstring>
    """
    if isinstance(prices, (pd.DataFrame, pd.Series)):
        return prices.pct_change().iloc[1:]
    else:
        # Assume ndarray
        out = np.diff(prices, axis=0)
        np.divide(out, prices[:-1], out=out)
        return out

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs a test if we decide to keep it.

Copy link
Contributor Author

@eigenfoo eigenfoo Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will sound bad but I was calling dropna because that was the way it was done before, and I assumed there was a reason for that 😦 I believe that dropna was only used to drop the first row, which would be NaN. Given that, using .iloc[1:] is more watertight.

I agree that this function belongs in stats, though. The only question is whether or not we want empyrical to tz-localize anything. Philosophically, I feel as if that should be the responsibility of downstream packages like alphalens and pyfolio, since empyrical is only responsible for low-level financial calculations, and is not meant to be user-facing at all. This relates to the input sanitization PR for pyfolio.

@twiecki thoughts? I've went ahead and implemented these changes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, .iloc[1:] is a bit clearer on the intent.

Copy link
Contributor

@twiecki twiecki Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yeah, I'd be happy if we just completely did away with the whole tz-aware thing. Since we don't provide any data in the future ourselves, we can probably just ignore it and the user has to ensure that it matches up.

"version."
"\n"
"Please use empyrical in the Quantopian Research environment, or "
"supply your own data. See README.md for more details.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user reading this error message isn't necessarily going to know what README.md you mean. I'd probably make this a link the github README, and/or add a section in the docs.

"in empyrical has been deprecated and will be removed in a future "
"version."
"\n"
"Please use empyrical in the Quantopian Research environment, or "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination would be to drop the reference to Quantopian Research here, especially if we include a reference to it in the README.

@@ -205,14 +224,39 @@ def ensure_directory(path):
raise


def compute_returns(prices):
"""
Computes correctly-indexed returns from prices.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader of this function, I'm not sure I would know what "correctly-indexed" means in this context. Is there something more specific we can say here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to mean 'tz-localized', but with the other comments below I don't think we want to do that after all.

@eigenfoo
Copy link
Contributor Author

eigenfoo commented Jun 7, 2018

Note to self: still need to write a test for the simple_returns function.

@eigenfoo
Copy link
Contributor Author

eigenfoo commented Jun 8, 2018

Not quite sure why tests are failing. The checks fail on my local machine as well, but using pdb doesn't reproduce the error.

Any pointers @ssanderson @twiecki?

@eigenfoo
Copy link
Contributor Author

@twiecki fixed unit tests. Ready for another pass, or merging.

README.md Outdated
- `empyrical.utils.cache_dir`
- `empyrical.utils.data_path`
- `empyrical.utils.ensure_directory`
- `empyrical.utils._1_bday_ago`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't deprecated anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops

@ssanderson
Copy link

@eigenfoo I had one more comment on the README. Otherwise this LGTM.

@twiecki twiecki merged commit 30a5c4c into quantopian:master Jun 14, 2018
@twiecki
Copy link
Contributor

twiecki commented Jun 14, 2018

Thanks @eigenfoo!

@eigenfoo eigenfoo mentioned this pull request Dec 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants