Skip to content

Latest commit

 

History

History
102 lines (88 loc) · 5.5 KB

README.md

File metadata and controls

102 lines (88 loc) · 5.5 KB

StepMix

PyPI version Build Documentation Status Code style: black Downloads Downloads arXiv

For StepMixR, please refer to this repository.

A Python package following the scikit-learn API for generalized mixture modeling. The package supports categorical data (Latent Class Analysis) and continuous data (Gaussian Mixtures/Latent Profile Analysis). StepMix can be used for both clustering and supervised learning.

Additional features include:

  • Support for missing values through Full Information Maximum Likelihood (FIML);
  • Multiple stepwise Expectation-Maximization (EM) estimation methods based on pseudolikelihood theory;
  • Covariates and distal outcomes;
  • Parametric and non-parametric bootstrapping.

Reference

If you find StepMix useful, please leave a ⭐ and consider citing our arXiv preprint:

@article{morin2023stepmix,
  title={StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables},
  author={Morin, Sacha and Legault, Robin and Lalibert{\'e}, F{\'e}lix and Bakk, Zsuzsa and Gigu{\`e}re, Charles-{\'E}douard and de la Sablonni{\`e}re, Roxane and Lacourse, {\'E}ric},
  journal={arXiv preprint arXiv:2304.03853},
  year={2023}
}

Install

You can install StepMix with pip, preferably in a virtual environment:

pip install stepmix

Quickstart

A StepMix mixture using categorical variables on a preloaded data matrix. StepMix accepts either numpy.arrayor pandas.DataFrame. Categories should be integer-encoded and 0-indexed.

from stepmix.stepmix import StepMix

# Categorical StepMix Model with 3 latent classes
model = StepMix(n_components=3, measurement="categorical")
model.fit(data)

# Allow missing values
model_nan = StepMix(n_components=3, measurement="categorical_nan")
model_nan.fit(data_nan)

For binary data you can also use measurement="binary" or measurement="binary_nan". For continuous data, you can fit a Gaussian Mixture with diagonal covariances using measurement="continuous" or measurement="continuous_nan".

Set verbose=1 for a detailed output.

Please refer to the StepMix tutorials to learn how to combine continuous and categorical data in the same model.

Tutorials

Detailed tutorials are available in notebooks:

  1. Generalized Mixture Models with StepMix: an in-depth look at how mixture models can be defined with StepMix. The tutorial uses the Iris Dataset as an example and covers:
    1. Gaussian Mixtures (Latent Profile Analysis);
    2. Binary Mixtures (LCA);
    3. Categorical Mixtures (LCA);
    4. Mixed Categorical and Continuous Mixtures;
    5. Missing Values through Full-Information Maximum Likelihood.
  2. Stepwise Estimation with StepMix: a tutorial demonstrating how to define measurement and structural models. The tutorial discusses:
    1. LCA models with distal outcomes;
    2. LCA models with covariates;
    3. 1-step, 2-step and 3-step estimation;
    4. Corrections (BCH or ML) and other options for 3-step estimation;
    5. Putting it All Together: A Complete Model with Missing Values
  3. Model Selection:
    1. Selecting the number of components in a mixture model (n_components) with cross-validation;
    2. Selecting the number of components with the Parametric Bootstrapped Likelihood Ratio Test (BLRT);
    3. Fit indices: AIC, BIC and other metrics.
  4. Parameters, Bootstrapping and CI: a tutorial discussing how to:
    1. Access StepMix parameters;
    2. Bootstrap StepMix estimators;
    3. Quickly plot confidence intervals.
  5. Supervised and Semi-Supervised Learning with StepMix:
    1. Binary Classification;
    2. Multiclass Classification;
    3. Semi-Supervised Learning;
    4. Cross-Validation.
  6. Deriving p-values in StepMix: a tutorial demonstrating how to transform SM parameters into conventional regression coefficients and how to derive p-values. The tutorial covers models with:
    1. Continuous covariate;
    2. Binary covariate;
    3. Categorical covariate;
    4. Multiple covariates (different distributions);
    5. Binary distal outcome;