Skip to content

Commit

Permalink
paper edits
Browse files Browse the repository at this point in the history
  • Loading branch information
lbluque committed Nov 29, 2023
1 parent abe571d commit 576cc6a
Showing 1 changed file with 23 additions and 26 deletions.
49 changes: 23 additions & 26 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,12 @@ easier to fit and interpret compared to dense models [@Hastie:2015]. Regression
resulting in sparse linear models such as the Lasso [@Tibshirani:1996; @Zou:2006] and
Best Subset Selection [@Hocking:1967] have been widely used in a variety of fields.
However, many regression problems involve covariates that have a natural underlying
structure, such as group or hierarchical relationships between covariates, that can be
leveraged to obtain improved model performance and interpretability, such problems occur
in a wide range of fields including genomics [@Chen:2021], bioinformatics [@Ma:2007],
medicine [@Kim:2012], econometrics [@Athey:2017], chemistry [@Gu:2018], and materials
science [@Leong:2019]. Several generalizations of the Lasso
[@Yuan:2006; @Friedman:2010; @Simon:2013; @Wang:2019] and Best Subset Selection
structure, such as group or hierarchical relationships, that can be
leveraged to obtain improved model performance and interpretability. Such structured
regression problems occur in a wide range of fields including genomics [@Chen:2021],
bioinformatics [@Ma:2007], medicine [@Kim:2012], econometrics [@Athey:2017],
chemistry [@Gu:2018], and materials science [@Leong:2019]. Several generalizations of the
Lasso [@Yuan:2006; @Friedman:2010; @Simon:2013; @Wang:2019] and Best Subset Selection
[@Bertsimas:2016-a; @Bertsimas:2016-b] have been developed to effectively exploit
additional structure in linear regression.

Expand All @@ -55,7 +55,7 @@ open-source and proprietary solvers. In particular, for regression problems with
integer programming objectives, access to state-of-the-art proprietary solvers enables
solving larger problems that would otherwise be unsolvable within reasonable time limits.

A handful of pre-existing Python libraries implement a handful of sparse linear
A handful of pre-existing Python libraries implement a subset of sparse linear
regression models that are also `scikit-learn` compatible. `celer` [@Massias:2018] and
`groupyr` [@Richie-Halford:2021] include efficient implementations of the Lasso and
Group Lasso, among other linear models. `group-lasso` [@Moe:2020] is another
Expand All @@ -73,34 +73,24 @@ available solvers. `sparse-lm` satisfies the need for a flexible and comprehensi
library that enables easy experimentation and comparisons of different sparse
linear regression algorithms within a single package.

Statistical regression models with structured sparsity (involving grouped covariates,
sparse grouped covariates, and hierarchical relationships between covariates terms)
parametrized via Group Lasso or Best Subset Selection based objectives have been used in a
wide range of scientific disciplines, including genomics [@Chen:2021], bioinformatics [@Ma:2007],
medicine [@Kim:2012], econometrics [@Athey:2017], chemistry [@Gu:2018], and materials science
[@Leong:2019]. The flexible implementation of sparse linear regression models in `sparse-lm`
allows researchers to easily experiment and choose the best regression model for their
specific problem. `sparse-lm` has already been used to build linear models with
structured sparsity in a handful of material science studies
[@Barroso-Luque:2022; @Zhong:2022; @Xie:2023, @Zhong:2023].

# Background

![Schematic of a linear model with grouped covariates with hierarchical relations.
Groups of covariates are represented with different colors and hierarchical
relationships are represented with arrows (i.e. group 3 depends on group 1). The figure
was inspired by Ref. [@Richie-Halford:2021].](linear-model.pdf){ width=55% }

Structured sparsity can be introduced into regression problems in one of two ways. The
first method to obtain structured sparsity is by using regularization by way of
Structured sparsity can be introduced into regression problems in one of two ways:
convex group regularization or mixed integer quadratic programming with linear constraints.
The first way to obtain structured sparsity is by using regularization based on
generalizations of the Lasso, such as the Group Lasso and the Sparse Group
Lasso [@Yuan:2006; @Friedman:2010; @Simon:2013; @Wang:2019]. The Sparse Group Lasso
regression problem can be expressed as follows,

\begin{equation}
\beta^* = \underset{\beta}{\text{argmin}}\;||\mathbf{X}
\beta - \mathbf{y}||^2_2 + (1-\alpha)\lambda\sum_{\mathbf{g}\in G}\sqrt{|\mathbf{g}
}||\beta_{\mathbf{g}}||_2 + \alpha\lambda||\beta||_1
}| ||\beta_{\mathbf{g}}||_2 + \alpha\lambda||\beta||_1
\end{equation}

where $\mathbf{X}$ is the design matrix, $\mathbf{y}$ is the response vector, and
Expand Down Expand Up @@ -148,21 +138,28 @@ introduce hierarchical structure into the model. Finally, we have also included
$\ell_2$ regularization term controlled by the hyperparameter $\lambda$, which is useful
when dealing with poorly conditioned design matrices.

Statistical regression models with structured sparsity
parametrized via Group Lasso or Best Subset Selection based objectives have been used in a
wide range of scientific disciplines, including genomics [@Chen:2021], bioinformatics [@Ma:2007],
medicine [@Kim:2012], econometrics [@Athey:2017], chemistry [@Gu:2018], and materials science
[@Leong:2019]. The flexible implementation of sparse linear regression models in `sparse-lm`
allows researchers to easily experiment and choose the best regression model for their
specific problem. `sparse-lm` has already been used to build linear models with
structured sparsity in a handful of material science studies
[@Barroso-Luque:2022; @Zhong:2022; @Xie:2023; @Zhong:2023].

# Usage

Since the linear regression models in `sparse-lm` are implemented to be compatible with
`scikit-learn` [@Pedregosa:2011; @Buitinck:2013], they can be used independently or as
part of a workflow---such as in a hyperparameter selection class or a pipeline---
in similar fashion to any of the available models in the `sklearn.linear_model` module.

A variety of linear regression models with flexible regularization and feature selection
options are implemented. The implemented models are listed below:

## Implemented regression models

The table below shows the regression models that are implemented in `sparse-lm` as well
as available implementations in other Python packages. $\checkmark$ indicates that the model selected
is applicable by the package located in the corresponding column.
as available implementations in other Python packages. $\checkmark$ indicates that the
model selected is implemented in the package located in the corresponding column.


| Model | `sparse-lm` | `celer` | `groupyr` | `group-lasso` | `skglm` | `abess` |
Expand Down

0 comments on commit 576cc6a

Please sign in to comment.