STEP - Causality Model Proposal #40

Spinachboul · 2025-02-02T10:03:36Z

This PR is about adopting causality-based models which enables users to analyze and incorporate causal relationships in time series data.
The initial implementation includes defining the Granger Causality with the utility of statsmodel package.
Plans are to expand to other models such as SCMs (Structural Causal Models) , Causal Impact Analysis.

@fkiraly
Would like your review on this proposal and feedback for the same.

Discussed recently in one of the community meetups

felipeangelimvieira · 2025-02-03T17:10:16Z

Hi @Spinachboul ! Really interesting. Your prototypical implementation is more focused on causal discovery, isn't it? Since get_causal_relationships returns a dict of booleans.

Should we separate causal discovery (identifying edges in a causal graph) and causal estimation (quantifying the causal effects)?

Spinachboul · 2025-02-04T05:27:24Z

@felipeangelimvieira
Yes, you're right! It would be great to quantify the relationships and keeping it different from discovery for a more modular approach (Idk. Maybe combining both would also be okayish but would prefer to keep them separately)

Spinachboul · 2025-02-04T05:32:25Z

@felipeangelimvieira
But I am not sure about implementation of causal_estimation, would need some time to come up with good methods

Spinachboul · 2025-02-05T13:54:52Z

@felipeangelimvieira
Would you like me to work on that as well, since that's also a great feature which will be needed!!

felipeangelimvieira · 2025-02-05T16:24:30Z

@felipeangelimvieira
Would you like me to work on that as well, since that's also a great feature which will be needed!!

It seems like something huge, we should possibly study the possibilities if we are willing to go that way 🤔
was there a discussion somewhere in discord or in other repos about having such causal capabilities? I see that it would a really nice to have! I was wondering if these causal discovery algorithms could be implemented as transformations, that take series and output a primitive dataframe with the causal relationship between each series. Similar to a feature selection algorithm.

Spinachboul · 2025-02-05T17:09:17Z

@felipeangelimvieira
Would you like me to work on that as well, since that's also a great feature which will be needed!!

It seems like something huge, we should possibly study the possibilities if we are willing to go that way 🤔 was there a discussion somewhere in discord or in other repos about having such causal capabilities? I see that it would a really nice to have! I was wondering if these causal discovery algorithms could be implemented as transformations, that take series and output a primitive dataframe with the causal relationship between each series. Similar to a feature selection algorithm.

I mentioned this idea to @fkiraly earlier in a meetup
I was planning to first get this implemented this on a smaller scale (for variables) which would show how would the results look like.
Of course we can produce relationships between series as well where a df would give n*(n-1)/2 results, given n series

fkiraly · 2025-02-06T12:16:10Z

It seems like something huge, we should possibly study the possibilities if we are willing to go that way 🤔 was there a discussion somewhere in discord or in other repos about having such causal capabilities?

There is at least one other discussion stream, it is in the Waterloo Diabetes project - FYI @RobotPsychologist, @dvirzg

fkiraly

Very interesting proposal!

I think this needs more details:

what are the API constraints on the return of get_causal_relationships? Can this be any object? Or does this need to have a specific structure
What does fit take as input?

Specifically, could you write complete docstrings?

Further, the design looks much like the BaseParamFitter API in the param_est module, with y added - currently the design does not have an y, but the PR of @satvshr sktime/sktime#7737 does add this argument.

Spinachboul · 2025-02-06T12:22:50Z

Very interesting proposal!

I think this needs more details:

what are the API constraints on the return of get_causal_relationships? Can this be any object? Or does this need to have a specific structure

What does fit take as input?

Specifically, could you write complete docstrings?

Further, the design looks much like the BaseParamFitter API in the param_est module, with y added - currently the design does not have an y, but the PR of @satvshr sktime/sktime#7737 does add this argument.

get_causal_relationships
Right now, it returns a dictionary of booleans indicating causal relationships, but we should refine this. Should it return a structured DataFrame, adjacency matrix, or something else?
fit()
Currently, it takes in multivariate time series data, but we need to clarify whether exogenous variables should be supported explicitly.

fkiraly · 2025-02-06T13:26:01Z

Right now, it returns a dictionary of booleans indicating causal relationships, but we should refine this. Should it return a structured DataFrame, adjacency matrix, or something else?

Can you please be more specific, what do these booleans mean, and what are keys/values?

Spinachboul · 2025-02-06T13:31:54Z

@fkiraly
For get_causal_relationships, the idea is to return a dictionary where the keys represent the pairs of time series being evaluated for causal relationships, and the values are booleans are whether a causal relationship exists between them.
Keys: Each key would be a tuple of 2 series "(series1 , series2)"
Values: True or False

Example

{
  ("series_1", "series_2"): True,  
  ("series_1", "series_3"): False, 
  ("series_2", "series_1"): False, 
}

fkiraly · 2025-02-06T15:23:39Z

hm, I see. This seems a bit coarse though, since the usual type of causal model includes information such as a lag and/or temporal information?

What I conclude is that the return type could differ, so to me it looks like conceptually this could be a BaseParamFitter.

Spinachboul · 2025-02-06T17:31:22Z

hm, I see. This seems a bit coarse though, since the usual type of causal model includes information such as a lag and/or temporal information?

What I conclude is that the return type could differ, so to me it looks like conceptually this could be a BaseParamFitter.

I see the alignment with BaseParamFitter—especially since causal discovery involves estimating parameters from data.
Would you suggest a specific format for the output? Maybe something similar to how BaseParamFitter structures parameter estimation results?

Spinachboul · 2025-02-07T06:23:47Z

@fkiraly @felipeangelimvieira

{'GNPDEFL': {'p_values': {'lag_1': np.float64(0.026476931316108947), 'lag_2': np.float64(0.009733634400436127), 'lag_3': np.float64(0.38419852258973297)}, 'f_statistics': {'lag_1': np.float64(4.92459997027204), 'lag_2': np.float64(9.264335854605449), 'lag_3': np.float64(3.048560568286733)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}, 'GNP': {'p_values': {'lag_1': np.float64(0.0006408724209669563), 'lag_2': np.float64(0.0008142205309970811), 'lag_3': np.float64(0.20728479688777332)}, 'f_statistics': {'lag_1': np.float64(11.653343604411486), 'lag_2': np.float64(14.226558612107596), 'lag_3': np.float64(4.55673893254005)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}, 'UNEMP': {'p_values': {'lag_1': np.float64(0.004885397945988011), 'lag_2': np.float64(0.046851094174294124), 'lag_3': np.float64(0.4830204408659219)}, 'f_statistics': {'lag_1': np.float64(7.92138055609516), 'lag_2': np.float64(6.121561831459139), 'lag_3': np.float64(2.457501797798271)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}, 'ARMED': {'p_values': {'lag_1': np.float64(0.7072092336214633), 'lag_2': np.float64(0.8866305644628261), 'lag_3': np.float64(0.9872402888611982)}, 'f_statistics': {'lag_1': np.float64(0.14108046927690157), 'lag_2': np.float64(0.2406537669494021), 'lag_3': np.float64(0.1356494262747804)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}, 'POP': {'p_values': {'lag_1': np.float64(0.003734244452874032), 'lag_2': np.float64(2.988839245657152e-07), 'lag_3': np.float64(6.973137157524612e-06)}, 'f_statistics': {'lag_1': np.float64(8.408725069699363), 'lag_2': np.float64(30.04642110216874), 'lag_3': np.float64(26.649289355650488)}, 'adj_r_squared': {'lag_1': nan, 'lag_2': nan, 'lag_3': nan}}}
  Predictor  Max F-Statistic  Max Adjusted R²
0   GNPDEFL         9.264336              NaN
1       GNP        14.226559              NaN
2     UNEMP         7.921381              NaN
3     ARMED         0.240654              NaN
4       POP        30.046421              NaN

This is the output from the initial impementation of the GrangerCausalityFitter class

I did user BaseParamFitter for accessing the functions, though fit() method wasn't completely implemented
So in counter to that, I harcoded a custom ParamFitter class, defining own fitted_params

class ParamFitter(BaseParamFitter):
    def _fit(self, X):
        self.fitted_params_ = {"mean": X.mean()}
        return self

Also took into consideration the significant parameters such as max_lag , test_method , significance_level
(which was mentioned earlier)

Return Type

A dictionary which consists of p_values and F_Statistics as values and predictors as keys

Would love to know your feedback on this!!

satvshr · 2025-02-07T07:13:28Z

Hey @Spinachboul!

So in counter to that, I harcoded a custom ParamFitter class, defining own fitted_params

I do not think you should do that, given that it conflicts with the API design throughout the codebase (any core dev can verify that claim). Just to make a suggestion, I think we should pause this PR (i.e., the incomplete _fit part) until I have implemented y in the base class.

Spinachboul · 2025-02-07T07:29:36Z

Hey @Spinachboul!

So in counter to that, I harcoded a custom ParamFitter class, defining own fitted_params

I do not think you should do that, given that it conflicts with the API design throughout the codebase (any core dev can verify that claim). Just to make a suggestion, I think we should pause this PR (i.e., the incomplete _fit part) until I have implemented y in the base class.

Yeah that works too, and then would simply go with the absract methods of BaseParamFitter
Cool then I woul wait for your PR and then resume working on mine
Thanks for heads up!!

RobotPsychologist · 2025-02-13T21:23:30Z

Sorry I lost track of this discussion thread, it seems like this may be a significant and interesting project. Perhaps some world class expertise would be appropriate to bring in on this discussion? @dvirzg would your supervisors at the Perimeter Institute be interested in contributing to this?

Spinachboul · 2025-02-13T21:29:36Z

@RobotPsychologist That would be great!
Would love to have you on board!!

fkiraly · 2025-02-17T19:17:29Z

@Spinachboul, thanks for your ideas, and sorry for the late reply!

Regarding the Granger causality fitter, I think this does not jutsify a new base class, but it does justify a new instance of the parameter fitter. I would use the version by @satvshr here: sktime/sktime#7737 which allows a second argument y.

The fitted parameters would be the most significant lag, and the other parameters such as significances, statistics per lag, etc. This would go in the same API reference category as the AR-based lag fitter that @satvshr was going to add.

fkiraly · 2025-02-17T19:19:50Z

Regarding causality models overall: I think we should look at a broader class of examples first. Granger causality is a misnomer and - despite its name - is not actually a causality model (along the lines of "do operator" in the Pearl calculus). It is a simple lag association model, which is the deeper reason that we can map it onto the parameter fitter model.

I agree with @RobotPsychologist that this PR would benefit from some expertise, more specifically, examples of time series specific modelling approaches that should be expected to be covered.

Spinachboul · 2025-02-17T20:50:32Z

@fkiraly @RobotPsychologist
Thanks for the insightful review
Based on the previous comments, I have written some draft prototype, would like to know if we are on the same page with this

from sktime.param_est.base import BaseParamFitter
import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import grangercausalitytests

class GrangerCausalityFitter(BaseParamFitter):

    def __init__(self, max_lag=5, significance_level=0.05):
        self.max_lag = max_lag
        self.significance_level = significance_level
        super().__init__()

    def _fit(self, X, y=None):
 
        # Ensure y is a DataFrame for compatibility
        y = y if isinstance(y, pd.DataFrame) else y.to_frame()

    
        data = pd.concat([y, X], axis=1)

        results = grangercausalitytests(data, max_lag=self.max_lag, verbose=False)
        p_values = {}
        test_statistics = {}

        for lag, result in results.items():
            p_values[lag] = result[0]['ssr_ftest'][1]  
            test_statistics[lag] = result[0]['ssr_ftest'][0] 

   
        significant_lags = {lag: p for lag, p in p_values.items() if p < self.significance_level}
        most_significant_lag = min(significant_lags, key=significant_lags.get) if significant_lags else None

        self.fitted_params_ = {
            "most_significant_lag": most_significant_lag,
            "p_values": p_values,
            "test_statistics": test_statistics,
        }

        return self

    def get_fitted_params(self):
        return self.fitted_params_

Spinachboul · 2025-02-17T20:53:24Z

So the above code checks the following points:

No new Base Class introduced
Would accept the args (x, Y)
Provides significant lags and stats
Also handles Granger Causality as a lag association model

satvshr · 2025-02-17T21:09:00Z

The fitted parameters would be the most significant lag, and the other parameters such as significances, statistics per lag, etc. This would go in the same API reference category as the AR-based lag fitter that @satvshr was going to add.

ARlagorder was added in #7693 and merged, if that reference is required here.

Spinachboul · 2025-02-17T21:23:02Z

The fitted parameters would be the most significant lag, and the other parameters such as significances, statistics per lag, etc. This would go in the same API reference category as the AR-based lag fitter that @satvshr was going to add.

ARlagorder was added in #7693 and merged, if that reference is required here.

Thanks for the heads up! @satvshr.. This would be helpful in considering the optimal lag value

fkiraly · 2025-02-17T21:52:56Z

So the above code checks the following points:

* No new Base Class introduced

* Would accept the args `(x, Y)`

* Provides significant **lags** and **stats**

* Also handles `Granger Causality` as a lag association model

Great!

Except that I would put the fitted params not in a single dict, but direct attributes ending in underscore.

I would also call the different lags simply lags_, because that can easily be plugged into the various estimators that take lags.

It would be appreciated if you could open - or comment in an existing issue, if exists - for Granger causality based lag estimation.

Spinachboul · 2025-02-18T08:21:59Z

So the above code checks the following points:
* No new Base Class introduced

* Would accept the args `(x, Y)`

* Provides significant **lags** and **stats**

* Also handles `Granger Causality` as a lag association model
Great!

Except that I would put the fitted params not in a single dict, but direct attributes ending in underscore.

I would also call the different lags simply lags_, because that can easily be plugged into the various estimators that take lags.

It would be appreciated if you could open - or comment in an existing issue, if exists - for Granger causality based lag estimation.

@fkiraly Sure! I have just raised an issue for the same on Sktime
And made some changes in the prototype as per your suggestions
Thanks!!

Spinachboul added 2 commits February 2, 2025 15:26

causality model proposal

7b80111

append proposal appropriately

f8f0fba

fkiraly requested changes Feb 6, 2025

View reviewed changes

fkiraly assigned Spinachboul Feb 17, 2025

Spinachboul mentioned this pull request Feb 18, 2025

[ENH] Granger Causality-Based Lag Estimation sktime/sktime#7861

Open

This was referenced Feb 18, 2025

[MENTEE] Mridul Jain sktime/mentoring#54

Open

Optimal Lag Estimation Order for Granger Causality Tests sktime/sktime#7892

Closed

add lag estimator using granger causality tests sktime/sktime#7893

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STEP - Causality Model Proposal #40

STEP - Causality Model Proposal #40

Spinachboul commented Feb 2, 2025 •

edited

Loading

felipeangelimvieira commented Feb 3, 2025

Spinachboul commented Feb 4, 2025 •

edited

Loading

Spinachboul commented Feb 4, 2025

Spinachboul commented Feb 5, 2025

felipeangelimvieira commented Feb 5, 2025

Spinachboul commented Feb 5, 2025

fkiraly commented Feb 6, 2025 •

edited

Loading

fkiraly left a comment

Spinachboul commented Feb 6, 2025

fkiraly commented Feb 6, 2025

Spinachboul commented Feb 6, 2025

fkiraly commented Feb 6, 2025

Spinachboul commented Feb 6, 2025

Spinachboul commented Feb 7, 2025

satvshr commented Feb 7, 2025

Spinachboul commented Feb 7, 2025

RobotPsychologist commented Feb 13, 2025 •

edited

Loading

Spinachboul commented Feb 13, 2025

fkiraly commented Feb 17, 2025

fkiraly commented Feb 17, 2025

Spinachboul commented Feb 17, 2025 •

edited

Loading

Spinachboul commented Feb 17, 2025

satvshr commented Feb 17, 2025

Spinachboul commented Feb 17, 2025

fkiraly commented Feb 17, 2025

Spinachboul commented Feb 18, 2025

STEP - Causality Model Proposal #40

Are you sure you want to change the base?

STEP - Causality Model Proposal #40

Conversation

Spinachboul commented Feb 2, 2025 • edited Loading

felipeangelimvieira commented Feb 3, 2025

Spinachboul commented Feb 4, 2025 • edited Loading

Spinachboul commented Feb 4, 2025

Spinachboul commented Feb 5, 2025

felipeangelimvieira commented Feb 5, 2025

Spinachboul commented Feb 5, 2025

fkiraly commented Feb 6, 2025 • edited Loading

fkiraly left a comment

Choose a reason for hiding this comment

Spinachboul commented Feb 6, 2025

fkiraly commented Feb 6, 2025

Spinachboul commented Feb 6, 2025

fkiraly commented Feb 6, 2025

Spinachboul commented Feb 6, 2025

Spinachboul commented Feb 7, 2025

Return Type

satvshr commented Feb 7, 2025

Spinachboul commented Feb 7, 2025

RobotPsychologist commented Feb 13, 2025 • edited Loading

Spinachboul commented Feb 13, 2025

fkiraly commented Feb 17, 2025

fkiraly commented Feb 17, 2025

Spinachboul commented Feb 17, 2025 • edited Loading

Spinachboul commented Feb 17, 2025

satvshr commented Feb 17, 2025

Spinachboul commented Feb 17, 2025

fkiraly commented Feb 17, 2025

Spinachboul commented Feb 18, 2025

Spinachboul commented Feb 2, 2025 •

edited

Loading

Spinachboul commented Feb 4, 2025 •

edited

Loading

fkiraly commented Feb 6, 2025 •

edited

Loading

RobotPsychologist commented Feb 13, 2025 •

edited

Loading

Spinachboul commented Feb 17, 2025 •

edited

Loading