pcLasso
is a package that fits the principal components lasso, a new method for obtaining sparse models for supervised learning problems. pcLasso shrinks predictions toward the top principal components of the feature matrix. It combines the power of Principal Components Regression with the sparsity of the lasso. The method is also able to handle grouped features, where the features can belong to one of many groups. In that case, pcLasso shrinks the component of the parameter vector towards the top principal componets of the corresponding feature matrix.
We introduce some notation that we will use throughout this vignette. Let there be \(n\) observations, each with feature vector \(x_i \in \mathbb{R}^p\) and response \(y_i\). Let \(X \in \mathbb{R}^{n \times p}\) denote the overall feature matrix, and let \(y \in \mathbb{R}^n\) denote the vector of responses. Assume our data features come in \(K\) groups, and let \(X_k \in \mathbb{R}^{n \times p_k}\) denote the feature matrix for group \(k\). In the simplest case with \(K=1\), there is no feature grouping.
For each feature matrix \(X_k\), let \(X_k = U_k D_k V_k^T\) be its singular value decomposition (SVD). Let \(D_k\) have diagonal entries \(d_{k1}, d_{k2}, \dots\), and let \(D_{d_{k1}^2 - d_{kj}^2}\) denote the diagonal matrix such that the \(j\)th diagonal entry is \(d_{k1} - d_{kj}^2\).
pcLasso
solves the optimization problem
\(\underset{\beta_0,\beta}{\text{minimize}} \quad \dfrac{1}{n} \displaystyle\sum_{i=1}^N w_i \ell (y_i, \beta_0 + \beta^T x_i) + \lambda \|\beta\|_1 + \dfrac{\theta}{2} \sum_{k = 1}^K \beta_k^T V_k D_{d_{k1}^2 - d_{kj}^2} V_k^T \beta_k.\)
In the above, \(\ell(y, \eta)\) is the negative log-likelihood contribution for observation \(i\); e.g. in the Gaussian case \(\ell(y, \eta) = \frac{1}{2}(y - \eta)^2\). \(w_i\) is the observation weight given to observation \(i\) (by default \(w_i = 1\) for all \(i\)). \(\beta_k\) is the subvector of \(\beta\) which corresponds to group \(k\). \(\lambda\) and \(\theta\) are non-negative hyperparameters. pcLasso
solves the optimization problem for a grid of \(\lambda\) values; \(\theta\) must be specified in advance by the user.
pcLasso
uses cyclical coordinate descent, which successively optimizes the objective function over each parameter with all other parameters fixed, cycling repeatedly until convergence.
The package also includes methods for prediction and plotting, and a function which performs \(k\)-fold cross-validation.
For more details, please see our paper on arXiv.
The simplest way to obtain pcLasso
is to install it directly from CRAN. Type the following command in R console:
install.packages("pcLasso")
This command downloads the R package and installs it to the default directories. Users may change add a repos
option to the function call to specify which repository to download from, depending on their locations and preferences.
Alternatively, users can download the package source at CRAN and type Unix commands to install it to the desired location.
The purpose of this section is to give users a general sense of the package. We will briefly go over the main functions of the package, as well as some key options. More details are given in later sections.
First, we load the pcLasso
package:
library(pcLasso)
Let’s generate some data:
set.seed(944)
100
n <- 20
p <- matrix(rnorm(n*p), n, p)
x <- matrix(c(rep(2, 5), rep(0, 15)), ncol = 1)
beta <- x %*% beta + rnorm(n) y <-
We fit the model using the most basic call to pcLasso
:
pcLasso(x, y, ratio = 0.8) fit <-
In addition to the feature matrix x
and the response y
, the user must specify either theta
or ratio
(but not both). theta
is the hyperparameter value for the last term in the objective function. The scale for theta
depends on x
and y
and hence can be hard to determine a priori. Instead of specifying theta
, we recommend that the user specify ratio
, a hyperparameter whose value lies in the interval \((0,1]\). Roughly speaking, smaller values of ratio
represent greater shrinkage to the top principal components. (ratio = 1
gives the usual lasso.)
If the argument verbose
is set to TRUE
, while the function is running, pcLasso
informs the user which lambda
in the lambda
sequence it is currently processing by printing it to the console.
If ratio
is passed to pcLasso
, it internally computes the corresponding value of theta
and uses that in minimizing the objective function. This value can be retrieved as the value of theta
in the output list.
The function pcLasso
returns a pcLasso
object. At the moment, the only way to extract model coefficients is to use list syntax on the returned object. For example, the code below extracts the intercept and coefficients for the model at the 20th lambda
value:
# intercept
$a0[20]
fit#> [1] -0.05855159
# coefficients
$beta[, 20]
fit#> [1] 0.26955362 0.34871182 0.31431625 0.44873422 0.17694987 0.21373963
#> [7] 0.00000000 -0.05236147 0.00000000 0.00000000 -0.17767456 0.11606285
#> [13] 0.01645924 0.11182802 0.00000000 -0.10867816 0.06198964 0.12127325
#> [19] -0.04744480 -0.07523572
A pcLasso
object has a nzero
key which tells us the number of non-zero model coefficients at each value of lambda
:
$nzero
fit#> [1] 0 1 1 4 4 4 5 5 5 6 7 7 11 12 13 14 15 16 16 16 17 17 17 18 18
#> [26] 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20
#> [51] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
#> [76] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
We can make predictions using a pcLasso
object by calling the predict
method. Each column gives the predictions for a value of lambda
.
# get predictions for 20th model
predict(fit, x[1:5, ])[, 20]
#> [1] 1.0830384 -0.3009271 1.8452090 1.3588647 -1.1403960
By default, pcLasso
assumes that all the features in x
belong to one group. If the features come in groups, pcLasso
can make use of this information in the model-fitting procedure.
Assume our features come in 4 (non-overlapping) groups of size 5:
vector("list", 4)
groups <-for (k in 1:4) {
5 * (k-1) + 1:5
groups[[k]] <-
}
groups#> [[1]]
#> [1] 1 2 3 4 5
#>
#> [[2]]
#> [1] 6 7 8 9 10
#>
#> [[3]]
#> [1] 11 12 13 14 15
#>
#> [[4]]
#> [1] 16 17 18 19 20
We can use this information in our fit by specifying the groups
option. groups
must be a list of length \(K\) (the number of groups). groups[[k]]
is a vector of column indices representing the features which belong to group \(k\). (For example, groups[[1]] <- c(3, 4, 6)
means that columns 3, 4 and 6 of x
belong to group 1.) Every feature must belong to at least one group.
pcLasso(x, y, ratio = 0.8, groups = groups) fit <-
One advantage of pcLasso
is that the algorithm works with overlapping groups as well. For example, we modify the groups list such that features 6 and 7 also belong to group 1. We can make the same call to pcLasso
:
1]] <- 1:7
groups[[
groups#> [[1]]
#> [1] 1 2 3 4 5 6 7
#>
#> [[2]]
#> [1] 6 7 8 9 10
#>
#> [[3]]
#> [1] 11 12 13 14 15
#>
#> [[4]]
#> [1] 16 17 18 19 20
pcLasso(x, y, ratio = 0.8, groups = groups) fit <-
One thing to note with overlapping groups is that the model coefficients and the number of non-zero coefficients are stored differently in the output object. To get the coefficients in the original feature space, look at the origbeta
key:
# intercept at 20th model: same as before
$a0[20]
fit#> [1] -0.03609611
# coefficients at 20th model: look at origbeta instead
$origbeta[, 20]
fit#> [1] 0.9453523 1.0498390 1.0494266 1.3574616 0.7357579 0.3295442 0.0000000
#> [8] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [15] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
To get the number of non-zero coefficients in the original feature space, look at orignzero
:
$orignzero
fit#> [1] 0 1 1 3 4 4 5 5 5 5 5 6 6 6 6 6 6 6 6 6 7 8 8 8 10
#> [26] 10 12 12 12 13 13 13 15 15 16 17 17 17 17 17 17 18 19 19 19 19 19 19 19 19
#> [51] 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
#> [76] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
For more information on what the algorithm does in the case of overlapping groups, see Section 3.2 of our paper.
We can perform \(k\)-fold cross-validation (CV) with cv.pcLasso
. It does 10-fold cross-validation by default:
cv.pcLasso(x, y, groups = groups, ratio = 0.8) cvfit <-
We can change the number of folds using the nfolds
option:
cv.pcLasso(x, y, groups = groups, ratio = 0.8, nfolds = 5) cvfit <-
If we want to specify which observation belongs to which fold, we can do that by specifying the foldid
option, which is a vector of length \(n\), with the \(i\)th element being the fold number for observation \(i\).
sample(rep(seq(10), length = n))
foldid <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, foldid = foldid) cvfit <-
A cv.pcLasso
call returns a cv.pcLasso
object. We can plot this object to get the CV curve with error bars (one standard error in each direction). The left vertical dotted line represents lambda.min
, the lambda
value which attains minimum CV error, while the right vertical dotted line represents lambda.1se
, the largest lambda
value with CV error within one standard error of the minimum CV error.
plot(cvfit)
The numbers at the top represent the number of non-zero coefficients for each model in the original feature space. If the groups are overlapping, we can plot the number of non-zero coefficients for each model in the expanded feature space instead by setting orignz = FALSE
:
plot(cvfit, orignz = FALSE)
The two special lambda
values can be extracted directly from the cv.pcLasso
object as well:
$lambda.min
cvfit#> [1] 0.0002570626
$lambda.1se
cvfit#> [1] 0.1899878
Predictions can be made from the fitted cv.pcLasso
object. By default, predictions are given for lambda
being equal to lambda.1se
. To get predictions are lambda.min
, set s = "lambda.min"
.
predict(cvfit, x[1:5, ]) # s = lambda.1se
#> [1] 4.850616 -2.192256 5.120815 3.127318 -4.748510
predict(cvfit, x[1:5, ], s = "lambda.min")
#> [1] 5.059693 -2.553863 7.064400 4.051717 -4.870323
Here are some other options that one may specify for the pcLasso
and cv.pcLasso
functions:
w
: The user can pass a vector of length \(n\) representing observation weights. The squared residual of the observations are weighted according to this vector. By default, this is set to 1 for all observations.
family
: The default value for the family
option of the pcLasso
and cv.pcLasso
functions is gaussian
. Use this default when y
is a quantitative variable (i.e. takes values along the real number line). The objective function for the Gaussian family is
\(\underset{\beta_0,\beta}{\text{minimize}} \quad \dfrac{1}{2n} \displaystyle\sum_{i=1}^N w_i (y_i - \beta_0 - \beta^T x_i)^2 + \lambda \|\beta\|_1 + \dfrac{\theta}{2} \sum_{k = 1}^K \beta_k^T V_k D_{d_{k1}^2 - d_{kj}^2} V_k^T \beta_k.\)
As before, \(\lambda\) and \(\theta\) are hyperparameters. The user passes a specific value of theta
(or its ratio
equivalent) to the pcLasso
and cv.pcLasso
functions, and the function computes the model fit for a path of lambda
values.
For binary prediction, use family = binomial
. In this setting, the response y
should be a numeric vector containing just 0s and 1s.
SVD_info
: A list containing SVD information that the algorithm uses. The user typically does not need to provide these options to pcLasso
; pcLasso
will compute them from the given data. Internally, cv.pcLasso
provides these options to pcLasso
for computational efficiency.
lambda
: The lambda
sequence at which the model fit will be computed. This is typically not provided by the user: the program can construct the sequence on its own. When automatically generated, the lambda
sequence is determined by lambda.max
(internally computed) and lambda.min.ratio
. (lambda.min.ratio
is the ratio of smallest value of the generated lambda
sequence, say lambda.min
, to lambda.max
.) The program generates nlam
values (default is 100) linear on the log scale from lambda.max
down to lambda.min
.
standardize
: If set to TRUE
, the columns of the feature matrix x
are scaled to have unit variance before the algorithm is run.
For more information, type ?pcLasso
or ?cv.pcLasso
.