Tisane API: Definitions and Overview

QUESTIONS:

(more important) API alternatives for Study Design: Where should "repeated measures" and the notion of "between"/"within" subjects lie in the design specification?
(less important) API updates for Statistical Model: Are explicit demarcations of fixed vs. random vs. interaction variables helpful?

Study Design

Definition: Study designs describe how and from whom data were collected. Experiments and observational studies both have study designs. In particular, designs consist of three aspects:

data type and cardinality (data schema),
hierarchies of observations, and
clusters in the data that are due to collection procedures

Variables (for both examples)

math = ts.Numeric('MathAchievement')
hw = ts.Numeric('HomeWork')
race = ts.Nominal('Race', cardinality=5)
tutoring = ts.Nominal('Tutoring', cardinality=3)
mean_ses = ts.Numeric('MeanSES')

Observational study:

design = ts.Design(
               dv=math, 
               ivs=ts.Level(identifier='student', measures=[hw, race]).nest_under(
                   ts.Level(identifier='school', measures=[mean_ses]))
        )

I read this as... "In this study, we are interested in math as the dependent variable. There are two levels of nested IVs. Students have hw and race. Schools have mean_ses. Students are nested under schools."

Experiment:

design = ts.Design(
               dv=math, 
               ivs=ts.Level(identifier='student', measures=[Repeat(tutoring)]).nest_under(
                   ts.Level(identifier='school', measures=[mean_ses]))
        )

I read this as... "In this study, we are interested in math as the dependent variable. There are two levels of nested IVs. Students have multiple values for tutoring. Schools have mean_ses. Students are nested under schools."

Alternative: Associate "repeats" with the DV:

design = ts.Design(
               dv=ts.Repeat(outcome=math, according_to=tutoring, for=student), 
               ivs=ts.Level(identifier='student', measures=[tutoring]).nest_under(
                   ts.Level(identifier='school', measures=[mean_ses]))
        )

Assumptions:

The DV is at lowest level. Otherwise, it would not make sense to used the lowest level in an analysis.
Without Repeat the default assumption about measures is that they are "between" subjects, meaning each subject (at each level) provides or has one value.

API Constructs:

Variables contain data type (e.g., ts.Numeric, ts.Nominal) and cardinality information.
Levels specify any hierarchies in the data
Repeat specifies a measure that has been -- QUESTION

Previously, I had identified:

data type and cardinality (data schema),
nesting structures,
the presence of manipulations (e.g., treatment vs. control), and
how the manipulations, if any, were distributed (e.g., between-, within-subjects). (I revised because the third and fourth seem related and not always applicable for experiments vs. observational studies.)

Statistical Model

Definition: A statistical model is a Generalized Linear Mixed-effects Model, which inherits properties from "generalized linear models" and "mixed-effects models." Namely, a statistical model consists of the following:

a dependent variable
a set of independent variables (both fixed effects and random slopes and random intercepts)
a family describing the distribution of the dependent variable (also called "response variable"), which is allowed to be non-Gaussian (e.g., Poisson, Binomial)
a link function describing how the estimated/predicted values of the dependent variable relate to the dependent variable (e.g., identity, log, loglog)

Note: A "variance function" describes how a family defines its variance. However, it a "variance function" is only one part of a family, and a family is more descriptive. Thus, I moved away from having a "variance function" to having a "family" in the Statistical Model.

Statistical model for observational study above without separation:

        sm = ts.StatisticalModel(
            dv=math, 
            ivs=[   ts.FixedVariable(race), 
                    ts.FixedVaraible(mean_ses),
                    ts.RandomSlope(slope_for_each=hw, slopes_vary_among=school), 
                    ts.RandomIntercept(intercept_for_each=hw, intercepts_vary_among=school),
                    ts.Interaction(hw, mean_ses)
            ],
            family='Gaussian', 
            link_func='identity'
       )

Pros: (1) No arbitrary distinction between different types of IVs. Prioritize IV vs. IV type. (2) Connection to mathematical formula is a bit easier to read/detect. Cons: (1) Could easily get confusing if IVs are listed in arbitrary order.

Alternative that separates out fixed, random, and interaction effects:

        sm = ts.StatisticalModel(
            dv=math, 
            fixed_ivs=[ts.FixedVariable(race), ts.FixedVaraible(mean_ses)],
            random_ivs=[ts.RandomSlope(slope_for_each=hw, slopes_vary_among=school), 
                        ts.RandomIntercept(intercept_for_each=hw, intercepts_vary_among=school)],
            interaction_ivs=[ts.Interaction(hw, mean_ses)],
            family='Gaussian', 
            link_func='identity'
        )

Pros: (1) Effect types are more detectable (2) maintains some similarities to formulae Cons: (1) Is emphasizing effect types the right thing to do here? (Prioritize IV type vs. IV)

API constructs:

Dv
Sets of ivs
family
link function

Misc question: A better way to represent family and link_func that aren't arbitrary strings --> ts objects?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tisane API: Definitions and Overview

QUESTIONS:

Study Design

Statistical Model

Clone this wiki locally