Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STEP - Causality Model Proposal #40

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Spinachboul
Copy link

@Spinachboul Spinachboul commented Feb 2, 2025

  • This PR is about adopting causality-based models which enables users to analyze and incorporate causal relationships in time series data.
  • The initial implementation includes defining the Granger Causality with the utility of statsmodel package.
  • Plans are to expand to other models such as SCMs (Structural Causal Models) , Causal Impact Analysis.

@fkiraly
Would like your review on this proposal and feedback for the same.

Discussed recently in one of the community meetups

@felipeangelimvieira
Copy link
Contributor

Hi @Spinachboul ! Really interesting. Your prototypical implementation is more focused on causal discovery, isn't it? Since get_causal_relationships returns a dict of booleans.

Should we separate causal discovery (identifying edges in a causal graph) and causal estimation (quantifying the causal effects)?

@Spinachboul
Copy link
Author

Spinachboul commented Feb 4, 2025

@felipeangelimvieira
Yes, you're right! It would be great to quantify the relationships and keeping it different from discovery for a more modular approach (Idk. Maybe combining both would also be okayish but would prefer to keep them separately)

@Spinachboul
Copy link
Author

@felipeangelimvieira
But I am not sure about implementation of causal_estimation, would need some time to come up with good methods

@Spinachboul
Copy link
Author

@felipeangelimvieira
Would you like me to work on that as well, since that's also a great feature which will be needed!!

@felipeangelimvieira
Copy link
Contributor

@felipeangelimvieira
Would you like me to work on that as well, since that's also a great feature which will be needed!!

It seems like something huge, we should possibly study the possibilities if we are willing to go that way 🤔
was there a discussion somewhere in discord or in other repos about having such causal capabilities? I see that it would a really nice to have! I was wondering if these causal discovery algorithms could be implemented as transformations, that take series and output a primitive dataframe with the causal relationship between each series. Similar to a feature selection algorithm.

@Spinachboul
Copy link
Author

@felipeangelimvieira
Would you like me to work on that as well, since that's also a great feature which will be needed!!

It seems like something huge, we should possibly study the possibilities if we are willing to go that way 🤔 was there a discussion somewhere in discord or in other repos about having such causal capabilities? I see that it would a really nice to have! I was wondering if these causal discovery algorithms could be implemented as transformations, that take series and output a primitive dataframe with the causal relationship between each series. Similar to a feature selection algorithm.

I mentioned this idea to @fkiraly earlier in a meetup
I was planning to first get this implemented this on a smaller scale (for variables) which would show how would the results look like.
Of course we can produce relationships between series as well where a df would give n*(n-1)/2 results, given n series

@fkiraly
Copy link
Contributor

fkiraly commented Feb 6, 2025

It seems like something huge, we should possibly study the possibilities if we are willing to go that way 🤔 was there a discussion somewhere in discord or in other repos about having such causal capabilities?

There is at least one other discussion stream, it is in the Waterloo Diabetes project - FYI @RobotPsychologist, @dvirzg

Copy link
Contributor

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting proposal!

I think this needs more details:

  • what are the API constraints on the return of get_causal_relationships? Can this be any object? Or does this need to have a specific structure
  • What does fit take as input?

Specifically, could you write complete docstrings?

Further, the design looks much like the BaseParamFitter API in the param_est module, with y added - currently the design does not have an y, but the PR of @satvshr sktime/sktime#7737 does add this argument.

@Spinachboul
Copy link
Author

Very interesting proposal!

I think this needs more details:

  • what are the API constraints on the return of get_causal_relationships? Can this be any object? Or does this need to have a specific structure
  • What does fit take as input?

Specifically, could you write complete docstrings?

Further, the design looks much like the BaseParamFitter API in the param_est module, with y added - currently the design does not have an y, but the PR of @satvshr sktime/sktime#7737 does add this argument.

get_causal_relationships
Right now, it returns a dictionary of booleans indicating causal relationships, but we should refine this. Should it return a structured DataFrame, adjacency matrix, or something else?
fit()
Currently, it takes in multivariate time series data, but we need to clarify whether exogenous variables should be supported explicitly.

@fkiraly
Copy link
Contributor

fkiraly commented Feb 6, 2025

Right now, it returns a dictionary of booleans indicating causal relationships, but we should refine this. Should it return a structured DataFrame, adjacency matrix, or something else?

Can you please be more specific, what do these booleans mean, and what are keys/values?

@Spinachboul
Copy link
Author

@fkiraly
For get_causal_relationships, the idea is to return a dictionary where the keys represent the pairs of time series being evaluated for causal relationships, and the values are booleans are whether a causal relationship exists between them.
Keys: Each key would be a tuple of 2 series "(series1 , series2)"
Values: True or False

Example

{
  ("series_1", "series_2"): True,  
  ("series_1", "series_3"): False, 
  ("series_2", "series_1"): False, 
}

@fkiraly
Copy link
Contributor

fkiraly commented Feb 6, 2025

hm, I see. This seems a bit coarse though, since the usual type of causal model includes information such as a lag and/or temporal information?

What I conclude is that the return type could differ, so to me it looks like conceptually this could be a BaseParamFitter.

@Spinachboul
Copy link
Author

hm, I see. This seems a bit coarse though, since the usual type of causal model includes information such as a lag and/or temporal information?

What I conclude is that the return type could differ, so to me it looks like conceptually this could be a BaseParamFitter.

I see the alignment with BaseParamFitter—especially since causal discovery involves estimating parameters from data.
Would you suggest a specific format for the output? Maybe something similar to how BaseParamFitter structures parameter estimation results?

@Spinachboul
Copy link
Author

@fkiraly @felipeangelimvieira

{'GNPDEFL': {'p_values': {'lag_1': np.float64(0.026476931316108947), 'lag_2': np.float64(0.009733634400436127), 'lag_3': np.float64(0.38419852258973297)}, 'f_statistics': {'lag_1': np.float64(4.92459997027204), 'lag_2': np.float64(9.264335854605449), 'lag_3': np.float64(3.048560568286733)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}, 'GNP': {'p_values': {'lag_1': np.float64(0.0006408724209669563), 'lag_2': np.float64(0.0008142205309970811), 'lag_3': np.float64(0.20728479688777332)}, 'f_statistics': {'lag_1': np.float64(11.653343604411486), 'lag_2': np.float64(14.226558612107596), 'lag_3': np.float64(4.55673893254005)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}, 'UNEMP': {'p_values': {'lag_1': np.float64(0.004885397945988011), 'lag_2': np.float64(0.046851094174294124), 'lag_3': np.float64(0.4830204408659219)}, 'f_statistics': {'lag_1': np.float64(7.92138055609516), 'lag_2': np.float64(6.121561831459139), 'lag_3': np.float64(2.457501797798271)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}, 'ARMED': {'p_values': {'lag_1': np.float64(0.7072092336214633), 'lag_2': np.float64(0.8866305644628261), 'lag_3': np.float64(0.9872402888611982)}, 'f_statistics': {'lag_1': np.float64(0.14108046927690157), 'lag_2': np.float64(0.2406537669494021), 'lag_3': np.float64(0.1356494262747804)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}, 'POP': {'p_values': {'lag_1': np.float64(0.003734244452874032), 'lag_2': np.float64(2.988839245657152e-07), 'lag_3': np.float64(6.973137157524612e-06)}, 'f_statistics': {'lag_1': np.float64(8.408725069699363), 'lag_2': np.float64(30.04642110216874), 'lag_3': np.float64(26.649289355650488)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}}
  Predictor  Max F-Statistic  Max Adjusted R²
0   GNPDEFL         9.264336              NaN
1       GNP        14.226559              NaN
2     UNEMP         7.921381              NaN
3     ARMED         0.240654              NaN
4       POP        30.046421              NaN

This is the output from the initial impementation of the GrangerCausalityFitter class

  • I did user BaseParamFitter for accessing the functions, though fit() method wasn't completely implemented
  • So in counter to that, I harcoded a custom ParamFitter class, defining own fitted_params
class ParamFitter(BaseParamFitter):
    def _fit(self, X):
        self.fitted_params_ = {"mean": X.mean()}
        return self
  • Also took into consideration the significant parameters such as max_lag , test_method , significance_level
    (which was mentioned earlier)

Return Type

A dictionary which consists of p_values and F_Statistics as values and predictors as keys

Would love to know your feedback on this!!

@satvshr
Copy link

satvshr commented Feb 7, 2025

Hey @Spinachboul!

  • So in counter to that, I harcoded a custom ParamFitter class, defining own fitted_params

I do not think you should do that, given that it conflicts with the API design throughout the codebase (any core dev can verify that claim). Just to make a suggestion, I think we should pause this PR (i.e., the incomplete _fit part) until I have implemented y in the base class.

@Spinachboul
Copy link
Author

Hey @Spinachboul!

  • So in counter to that, I harcoded a custom ParamFitter class, defining own fitted_params

I do not think you should do that, given that it conflicts with the API design throughout the codebase (any core dev can verify that claim). Just to make a suggestion, I think we should pause this PR (i.e., the incomplete _fit part) until I have implemented y in the base class.

Yeah that works too, and then would simply go with the absract methods of BaseParamFitter
Cool then I woul wait for your PR and then resume working on mine
Thanks for heads up!!

@RobotPsychologist
Copy link

RobotPsychologist commented Feb 13, 2025

Sorry I lost track of this discussion thread, it seems like this may be a significant and interesting project. Perhaps some world class expertise would be appropriate to bring in on this discussion? @dvirzg would your supervisors at the Perimeter Institute be interested in contributing to this?

@Spinachboul
Copy link
Author

@RobotPsychologist That would be great!
Would love to have you on board!!

@fkiraly
Copy link
Contributor

fkiraly commented Feb 17, 2025

@Spinachboul, thanks for your ideas, and sorry for the late reply!

Regarding the Granger causality fitter, I think this does not jutsify a new base class, but it does justify a new instance of the parameter fitter. I would use the version by @satvshr here: sktime/sktime#7737 which allows a second argument y.

The fitted parameters would be the most significant lag, and the other parameters such as significances, statistics per lag, etc. This would go in the same API reference category as the AR-based lag fitter that @satvshr was going to add.

@fkiraly
Copy link
Contributor

fkiraly commented Feb 17, 2025

Regarding causality models overall: I think we should look at a broader class of examples first. Granger causality is a misnomer and - despite its name - is not actually a causality model (along the lines of "do operator" in the Pearl calculus). It is a simple lag association model, which is the deeper reason that we can map it onto the parameter fitter model.

I agree with @RobotPsychologist that this PR would benefit from some expertise, more specifically, examples of time series specific modelling approaches that should be expected to be covered.

@Spinachboul
Copy link
Author

Spinachboul commented Feb 17, 2025

@fkiraly @RobotPsychologist
Thanks for the insightful review
Based on the previous comments, I have written some draft prototype, would like to know if we are on the same page with this

from sktime.param_est.base import BaseParamFitter
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import grangercausalitytests

class GrangerCausalityFitter(BaseParamFitter):

    def __init__(self, max_lag=5, significance_level=0.05):
        self.max_lag = max_lag
        self.significance_level = significance_level
        super().__init__()

    def _fit(self, X, y=None):
 
        # Ensure y is a DataFrame for compatibility
        y = y if isinstance(y, pd.DataFrame) else y.to_frame()

    
        data = pd.concat([y, X], axis=1)

        results = grangercausalitytests(data, max_lag=self.max_lag, verbose=False)
        p_values = {}
        test_statistics = {}

        for lag, result in results.items():
            p_values[lag] = result[0]['ssr_ftest'][1]  
            test_statistics[lag] = result[0]['ssr_ftest'][0] 

   
        significant_lags = {lag: p for lag, p in p_values.items() if p < self.significance_level}
        most_significant_lag = min(significant_lags, key=significant_lags.get) if significant_lags else None

        self.fitted_params_ = {
            "most_significant_lag": most_significant_lag,
            "p_values": p_values,
            "test_statistics": test_statistics,
        }

        return self

    def get_fitted_params(self):
        return self.fitted_params_

@Spinachboul
Copy link
Author

So the above code checks the following points:

  • No new Base Class introduced
  • Would accept the args (x, Y)
  • Provides significant lags and stats
  • Also handles Granger Causality as a lag association model

@satvshr
Copy link

satvshr commented Feb 17, 2025

The fitted parameters would be the most significant lag, and the other parameters such as significances, statistics per lag, etc. This would go in the same API reference category as the AR-based lag fitter that @satvshr was going to add.

ARlagorder was added in #7693 and merged, if that reference is required here.

@Spinachboul
Copy link
Author

The fitted parameters would be the most significant lag, and the other parameters such as significances, statistics per lag, etc. This would go in the same API reference category as the AR-based lag fitter that @satvshr was going to add.

ARlagorder was added in #7693 and merged, if that reference is required here.

Thanks for the heads up! @satvshr.. This would be helpful in considering the optimal lag value

@fkiraly
Copy link
Contributor

fkiraly commented Feb 17, 2025

So the above code checks the following points:

* No new Base Class introduced

* Would accept the args `(x, Y)`

* Provides significant **lags** and **stats**

* Also handles `Granger Causality` as a lag association model

Great!

Except that I would put the fitted params not in a single dict, but direct attributes ending in underscore.

I would also call the different lags simply lags_, because that can easily be plugged into the various estimators that take lags.

It would be appreciated if you could open - or comment in an existing issue, if exists - for Granger causality based lag estimation.

@Spinachboul
Copy link
Author

So the above code checks the following points:

* No new Base Class introduced

* Would accept the args `(x, Y)`

* Provides significant **lags** and **stats**

* Also handles `Granger Causality` as a lag association model

Great!

Except that I would put the fitted params not in a single dict, but direct attributes ending in underscore.

I would also call the different lags simply lags_, because that can easily be plugged into the various estimators that take lags.

It would be appreciated if you could open - or comment in an existing issue, if exists - for Granger causality based lag estimation.

@fkiraly Sure! I have just raised an issue for the same on Sktime
And made some changes in the prototype as per your suggestions
Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

5 participants