This package provides a Python interface for partial least squares (PLS) analysis, a multivariate statistical technique used to relate two sets of variables.
If you know where you're going, feel free to jump ahead:
This package requires Python >= 3.5. Assuming you have the correct version of Python installed, you can install this package by opening a terminal and running the following:
git clone https://github.com/rmarkello/pyls.git
cd pyls
python setup.py install
There are plans (hopes?) to get this set up on PyPI for an easier installation process, but that is a long-term goal!
Partial least squares (PLS) is a statistical technique that aims to find shared information between two sets of variables. If you're unfamiliar with PLS and are interested in a thorough (albeit quite technical) treatment of it Abdi et al., 2013 is a good resource. There are multiple "flavors" of PLS that are tailored to different use cases; this package implements two functions that fall within the category typically referred to as PLS-C (PLS correlation) or PLS-SVD (PLS singular value decomposition) and one function that falls within the category typically referred to as PLS-R (PLS regression).
The functionality of the current package largely mirrors that originally introduced by McIntosh et al., (1996) in their Matlab toolbox. However, while the Matlab toolbox has a significant number of tools dedicated to integrating neuroimaging-specific paradigms (i.e., loading M/EEG and fMRI data), the current Python package aims to implement and expand on only the core statistical functions of that toolbox.
While the core algorithms of PLS implemented in this package are present (to a degree) in scikit-learn
, this package provides a different API and includes some additional functionality.
Namely, pyls
:
- Has integrated significance and reliability testing via built-in permutation testing and bootstrap resampling,
- Implements mean-centered PLS for multivariate group/condition comparisons,
- Uses the SIMPLS instead of the NIPALS algorithm for PLS regression
pyls
implement two subtypes of PLS-C: a more traditional form that we call "behavioral PLS" (pyls.behavioral_pls
) and a somewhat newer form that we call "mean-centered PLS" (pyls.meancentered_pls
).
It also implements one type of PLS-R, which uses the SIMPLS algorithm (pyls.pls_regression
); this is, in principle, very similar to "behavioral PLS."
As the more "traditional" form of PLS-C, pyls.behavioral_pls
looks to find relationships between two sets of variables.
To run a behavioral PLS we would do the following:
>>> import numpy as np
# let's create two data arrays with 80 observations
>>> X = np.random.rand(80, 10000) # a 10000-feature (e.g., neural) data array
>>> Y = np.random.rand(80, 10) # a 10-feature (e.g., behavioral) data array
# we're going to pretend that this data is from 2 groups of 20 subjects each,
# and that each subject participated in 2 task conditions
>>> groups = [20, 20] # a list with the number of subjects in each group
>>> n_cond = 2 # the number of tasks or conditions
# run the analysis and look at the results structure
>>> from pyls import behavioral_pls
>>> bpls = behavioral_pls(X, Y, groups=groups, n_cond=n_cond)
>>> bpls
PLSResults(x_weights, y_weights, x_scores, y_scores, y_loadings, singvals, varexp, permres,
bootres, splitres, cvres, inputs)
In contrast to behavioral PLS, pyls.meancentered_pls
doesn't look to find relationships between two sets of variables, but rather tries to find relationships between groupings in a single set of variables. As such, we will only provide it with one of our created data arrays (X
) and it will attempt to examine how the features of that array differ between groups and/or conditions. To run a mean-centered PLS we would do the following:
>>> from pyls import meancentered_pls
>>> mpls = meancentered_pls(X, groups=groups, n_cond=n_cond)
>>> mpls
PLSResults(x_weights, y_weights, x_scores, y_scores, singvals, varexp, permres, bootres, splitres,
inputs)
Whereas pyls.behavioral_pls
aims to maximize the symmetric relationship between X
and Y
, pyls.pls_regression
performs a directed decomposition.
That is, it aims to find components in X
that explain the most variance in Y
(but not necessarily vice versa).
To run a PLS regression analysis we would do the following:
>>> from pyls import pls_regression
>>> plsr = pls_regression(X, Y, n_components=5)
>>> plsr
PLSResults(x_weights, x_scores, y_scores, y_loadings, varexp, permres, bootres, inputs)
Currently pyls.pls_regression()
does not support groups or conditions.
The docstrings of the results objects (bpls
, plsr
, and mpls
in the above example) have some information describing what each output represents, so while we work on improving our documentation you can rely on those for some insight! Try typing help(bpls)
, help(plsr)
, or help(mpls)
to get more information on what the different values represent.
If you are at all familiar with the Matlab PLS toolbox you might notice that the results structures have a dramatically different naming convention; despite this all the same information should be present!