-
Notifications
You must be signed in to change notification settings - Fork 3
Tisane API: Definitions and Overview
- (more important) API alternatives for Study Design: Where should "repeated measures" and the notion of "between"/"within" subjects lie in the design specification?
- (less important) API updates for Statistical Model: Are explicit demarcations of fixed vs. random vs. interaction variables helpful?
Definition: Study designs describe how and from whom data were collected. Experiments and observational studies both have study designs. In particular, designs consist of three aspects:
- data type and cardinality (data schema),
- hierarchies of observations, and
- clusters in the data that are due to collection procedures
Variables (for both examples)
math = ts.Numeric('MathAchievement')
hw = ts.Numeric('HomeWork')
race = ts.Nominal('Race', cardinality=5)
tutoring = ts.Nominal('Tutoring', cardinality=3)
mean_ses = ts.Numeric('MeanSES')
Observational study:
design = ts.Design(
dv=math,
ivs=ts.Level(identifier='student', measures=[hw, race]).nest_under(
ts.Level(identifier='school', measures=[mean_ses]))
)
I read this as... "In this study, we are interested in math as the dependent variable. There are two levels of nested IVs. Students have hw and race. Schools have mean_ses. Students are nested under schools."
Experiment:
design = ts.Design(
dv=math,
ivs=ts.Level(identifier='student', measures=[Repeat(tutoring)]).nest_under(
ts.Level(identifier='school', measures=[mean_ses]))
)
I read this as... "In this study, we are interested in math as the dependent variable. There are two levels of nested IVs. Students have multiple values for tutoring. Schools have mean_ses. Students are nested under schools."
Alternative: Associate "repeats" with the DV:
design = ts.Design(
dv=ts.Repeat(outcome=math, according_to=tutoring, for=student),
ivs=ts.Level(identifier='student', measures=[tutoring]).nest_under(
ts.Level(identifier='school', measures=[mean_ses]))
)
Assumptions:
- The DV is at lowest level. Otherwise, it would not make sense to used the lowest level in an analysis.
- Without
Repeat
the default assumption about measures is that they are "between" subjects, meaning each subject (at each level) provides or has one value.
API Constructs:
- Variables contain data type (e.g.,
ts.Numeric, ts.Nominal
) and cardinality information. -
Levels
specify any hierarchies in the data -
Repeat
specifies a measure that has been -- QUESTION
Previously, I had identified:
- data type and cardinality (data schema),
- nesting structures,
- the presence of manipulations (e.g., treatment vs. control), and
- how the manipulations, if any, were distributed (e.g., between-, within-subjects). (I revised because the third and fourth seem related and not always applicable for experiments vs. observational studies.)
Definition: A statistical model is a Generalized Linear Mixed-effects Model, which inherits properties from "generalized linear models" and "mixed-effects models." Namely, a statistical model consists of the following:
- a dependent variable
- a set of independent variables (both fixed effects and random slopes and random intercepts)
- a family describing the distribution of the dependent variable (also called "response variable"), which is allowed to be non-Gaussian (e.g., Poisson, Binomial)
- a link function describing how the estimated/predicted values of the dependent variable relate to the dependent variable (e.g., identity, log, loglog)
Note: A "variance function" describes how a family defines its variance. However, it a "variance function" is only one part of a family, and a family is more descriptive. Thus, I moved away from having a "variance function" to having a "family" in the Statistical Model.
Statistical model for observational study above without separation:
sm = ts.StatisticalModel(
dv=math,
ivs=[ ts.FixedVariable(race),
ts.FixedVaraible(mean_ses),
ts.RandomSlope(slope_for_each=hw, slopes_vary_among=school),
ts.RandomIntercept(intercept_for_each=hw, intercepts_vary_among=school),
ts.Interaction(hw, mean_ses)
],
family='Gaussian',
link_func='identity'
)
Pros: (1) No arbitrary distinction between different types of IVs. Prioritize IV vs. IV type. (2) Connection to mathematical formula is a bit easier to read/detect. Cons: (1) Could easily get confusing if IVs are listed in arbitrary order.
Alternative that separates out fixed, random, and interaction effects:
sm = ts.StatisticalModel(
dv=math,
fixed_ivs=[ts.FixedVariable(race), ts.FixedVaraible(mean_ses)],
random_ivs=[ts.RandomSlope(slope_for_each=hw, slopes_vary_among=school),
ts.RandomIntercept(intercept_for_each=hw, intercepts_vary_among=school)],
interaction_ivs=[ts.Interaction(hw, mean_ses)],
family='Gaussian',
link_func='identity'
)
Pros: (1) Effect types are more detectable (2) maintains some similarities to formulae Cons: (1) Is emphasizing effect types the right thing to do here? (Prioritize IV type vs. IV)
API constructs:
- Dv
- Sets of ivs
- family
- link function
Misc question: A better way to represent family and link_func that aren't arbitrary strings --> ts objects?