Large-Scale Eigenvalue Decomposition and SVD with RSpectra

Introduction

Eigenvalue decomposition is a commonly used technique in numerous statistical problems. For example, principal component analysis (PCA) basically conducts eigenvalue decomposition on the sample covariance of a data matrix: the eigenvalues are the component variances, and eigenvectors are the variable loadings.

In R, the standard way to compute eigenvalues is the eigen() function. However, when the matrix becomes large, eigen() can be very time-consuming: the complexity to calculate all eigenvalues of a \(n \times n\) matrix is \(O(n^3)\).

While in real applications, we usually only need to compute a few eigenvalues or eigenvectors, for example to visualize high dimensional data using PCA, we may only use the first two or three components to draw a scatterplot. Unfortunately in eigen(), there is no option to limit the number of eigenvalues to be computed. This means that we always need to do the full eigen decomposition, which can cause a huge waste in computation.

The same thing happens in Singular Value Decomposition (SVD). It is often the case that only a Partial SVD or Truncated SVD is needed, and moreover the matrix is usually stored in sparse format. Base R does not provide functions suitable for these special needs.

And this is why the RSpectra package was developed. RSpectra provides an R interface to the Spectra library, which is used to solve large scale eigenvalue problems. The core part of Spectra and RSpectra was written in efficient C++ code, and they can handle large scale matrices in either dense or sparse formats.

Eigenvalue Problem

Basic Usage

The RSpectra package provides functions eigs() and eigs_sym() to calculate eigenvalues of general and symmetric matrices respectively. If the matrix is known to be symmetric, eigs_sym() is preferred since it guarantees that the eigenvalues are real.

To obtain eigenvalues of a square matrix A, simply call the eigs() or eigs_sym() function, tell it how many eigenvalues are requested (argument k), and which ones are going to be selected (argument which). By default, which = "LM" means to pick the eigenvalues with the largest magnitude (modulus for complex numbers and absolute value for real numbers).

Below we first generate some random matrices and display some of the eigenvalues and eigenvectors:

set.seed(123)
n = 100  # matrix size
k = 5    # number of eigenvalues to calculate
# Some random data
M = matrix(rnorm(n^2), n)
# Make it symmetric
A = crossprod(M)
# Show its largest 5 eigenvalues and associated eigenvectors
head(eigen(A)$values, 5)

## [1] 391.3649 367.1143 353.5425 322.6301 315.4341

head(eigen(A)$vectors[, 1:5])

##             [,1]         [,2]          [,3]         [,4]        [,5]
## [1,]  0.12981428 -0.011305031 -0.0001022712  0.088485329  0.05035607
## [2,] -0.06690188  0.001012109  0.0103448033 -0.009411417 -0.07992121
## [3,] -0.05333064 -0.057256548  0.0998913793 -0.171712454  0.01334678
## [4,]  0.13423510 -0.165303749  0.0076458165 -0.006789840  0.10246642
## [5,] -0.13023611 -0.019154947  0.0425667905 -0.160273074 -0.06246803
## [6,] -0.06632982 -0.053457170  0.0323196492 -0.074827181 -0.16365506

Now we use RSpectra to directly obtain the largest 5 eigenvalues:

library(RSpectra)
res = eigs_sym(A, k, which = "LM")  # "LM" is the default
res$values

## [1] 391.3649 367.1143 353.5425 322.6301 315.4341

head(res$vectors)

##             [,1]         [,2]          [,3]         [,4]        [,5]
## [1,] -0.12981428  0.011305031  0.0001022712  0.088485329  0.05035607
## [2,]  0.06690188 -0.001012109 -0.0103448033 -0.009411417 -0.07992121
## [3,]  0.05333064  0.057256548 -0.0998913793 -0.171712454  0.01334678
## [4,] -0.13423510  0.165303749 -0.0076458165 -0.006789840  0.10246642
## [5,]  0.13023611  0.019154947 -0.0425667905 -0.160273074 -0.06246803
## [6,]  0.06632982  0.053457170 -0.0323196492 -0.074827181 -0.16365506

If only eigenvalues are requested, the retvec option can be used:

eigs_sym(A, k, opts = list(retvec = FALSE))

## $values
## [1] 391.3649 367.1143 353.5425 322.6301 315.4341
## 
## $vectors
## NULL
## 
## $nconv
## [1] 5
## 
## $niter
## [1] 4
## 
## $nops
## [1] 59

Matrix in Sparse Format

For really large data, the matrix is usually stored in sparse format. RSpectra supports the dgCMatrix and dgRMatrix matrix types defined in the Matrix package.

library(Matrix)
Msp = as(M, "dgCMatrix")
Asp = as(A, "dgRMatrix")

## 'as(<matrix>, "dgRMatrix")' is deprecated.
## Use 'as(as(as(., "dMatrix"), "generalMatrix"), "RsparseMatrix")' instead.
## See help("Deprecated") and help("Matrix-deprecated").

eigs(Msp, k, which = "LR", opts = list(retvec = FALSE))$values  # largest real part

## [1] 9.685399-1.964917i 9.685399+1.964917i 9.696766+0.000000i 8.270271+1.751652i
## [5] 8.270271-1.751652i

eigs_sym(Asp, k, opts = list(retvec = FALSE))$values

## [1] 391.3649 367.1143 353.5425 322.6301 315.4341

An even more general way to specify the matrix A is to define a function that calculates A %*% x for any vector x. This representation is called the Function Interface, which can also be regarded as a sparse format, since users do not need to store the matrix A, but only the operator x -> Ax. For example, to represent a diagonal matrix \(diag(1, 2, \ldots, 10)\) and calculate its eigenvalues, we can define the function f and call the eigs_sym() function:

# Implicitly define the matrix by a function that calculates A %*% x
# Below represents a diagonal matrix whose elements are stored in the `args` parameter
f = function(x, args)
{
    # diag(args) %*% x == x * args
    return(x * args)
}
eigs_sym(f, k = 3, n = 10, args = 1:10)

## $values
## [1] 10  9  8
## 
## $vectors
##                [,1]          [,2]          [,3]
##  [1,] -1.059650e-16  2.610725e-16  5.640223e-17
##  [2,] -1.807759e-16 -6.140243e-16  4.250073e-17
##  [3,] -6.381546e-17 -1.564504e-16 -3.348016e-16
##  [4,]  2.081668e-17 -7.979728e-17 -5.551115e-17
##  [5,] -2.656295e-17  6.245005e-17  0.000000e+00
##  [6,]  1.919648e-16 -5.780966e-16  4.544976e-16
##  [7,]  4.240010e-16 -2.579967e-15 -5.160802e-17
##  [8,] -2.201744e-16  1.368480e-15  1.000000e+00
##  [9,] -1.502446e-15 -1.000000e+00  9.278602e-16
## [10,]  1.000000e+00 -1.280419e-15  1.693354e-16
## 
## $nconv
## [1] 3
## 
## $niter
## [1] 1
## 
## $nops
## [1] 10

n gives the dimension of the matrix, and args contains user data that will be passed to f.

Smallest Eigenvalues

Sometimes you may need to calculate the smallest (in magnitude) k eigenvalues of a matrix. A direct way to do so is to use eigs(..., which = "SM"). However, the algorithm of RSpectra is good at finding large eigenvalues rather than small ones, so which = "SM" tends to require much more iterations or even may not converge.

The recommended way to retrieve the smallest eigenvalues is to utilize the spectral transformation: If \(A\) has eigenvalues \(\lambda_i\), then \((A-\sigma I)^{-1}\) has eigenvalues \(1/(\lambda_i-\sigma)\). Therefore, we can set \(\sigma = 0\) and calculate the largest eigenvalues of \(A^{-1}\), whose reciprocals are exactly the smallest eigenvalues of \(A\).

eigs() implements the spectral transformation via the parameter sigma, so the following code returns the smallest eigenvalues of A. Note that eigs() always returns eigenvalues in the original scale (i.e., \(\lambda_i\) instead of \(1/(\lambda_i-\sigma)\)).

eigs_sym(A, k, which = "LM", sigma = 0)$values  # recommended way

## [1] 0.39278178 0.21130356 0.11411209 0.02151974 0.00473068

eigs_sym(A, k, which = "SM")$values             # not recommended

## [1] 0.39278178 0.21130356 0.11411209 0.02151974 0.00473068

More generally, the option which = "LM", sigma = s selects eigenvalues that are closest to s.

Truncated SVD

Truncated SVD (or Partial SVD) is frequently used in text mining and image compression, which computes the leading singular values and singular vectors of a rectangular matrix.

RSpectra has the svds() function to compute Truncated SVD:

set.seed(123)
m = 100
n = 20
k = 5
A = matrix(rnorm(m * n), m)

str(svds(A, k, nu = k, nv = k))

## List of 5
##  $ d    : num [1:5] 14.1 13.8 13 11.8 11.3
##  $ u    : num [1:100, 1:5] 0.1616 0.0871 0.0536 0.0537 0.1271 ...
##  $ v    : num [1:20, 1:5] -0.186 0.116 0.27 -0.273 0.1 ...
##  $ niter: num 1
##  $ nops : num 45

Similar to eigs(), svds() supports sparse matrices:

Asp = as(A, "dgCMatrix")
svds(Asp, k, nu = 0, nv = 0)

## $d
## [1] 14.13079 13.76937 12.95764 11.82205 11.30081
## 
## $u
## NULL
## 
## $v
## NULL
## 
## $niter
## [1] 1
## 
## $nops
## [1] 40

svds() also has the function interface: users provide functions A and Atrans that calcualtes \(Ax\) and \(A'x\) respectively, and svds() will compute the Truncated SVD.

Af      = function(x, args)  as.numeric(args %*% x)
Atransf = function(x, args)  as.numeric(crossprod(args, x))
str(svds(Af, k, Atrans = Atransf, dim = c(m, n), args = Asp))

## List of 5
##  $ d    : num [1:5] 14.1 13.8 13 11.8 11.3
##  $ u    : num [1:100, 1:5] 0.1616 0.0871 0.0536 0.0537 0.1271 ...
##  $ v    : num [1:20, 1:5] -0.186 0.116 0.27 -0.273 0.1 ...
##  $ niter: num 1
##  $ nops : num 45

Performance

RSpectra is written in efficient C++ code. This page gives some benchmarking results of svds() with other similar functions available in R and extension packages.