5. Discretized Partial Least Squares (dPLS)

Koki Tsuyuzaki

Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research
k.t.the-answer@hotmail.co.jp

2024-05-11

Introduction

In this vignette, we consider approximating multiple matrices as a product of ternary (or non-negative) low-rank matrices (a.k.a., factor matrices).

Test data is available from toyModel.

library("dcTensor")
X <- dcTensor::toyModel("dPLS_Easy")

You will see that there are five blocks in the data matrix as follows.

suppressMessages(library("fields"))
layout(t(1:3))
image.plot(X[[1]], main="X1", legend.mar=8)
image.plot(X[[2]], main="X2", legend.mar=8)
image.plot(X[[3]], main="X3", legend.mar=8)

Semi-Ternary Simultaneous Matrix Factorization (STSMF)

Here, we introduce the ternary regularization to take {-1,0,1} values in \(V_{k}\) as below:

\[ \max{\mathrm{tr} \left( V_{j}'X_{j}'X_{k}V_{k} \right)}\ \mathrm{s.t.}\ j ≠k, V \in \{-1,0,1\}, \] where \(j\) and \(k\) range from \(1\) to \(K\), \(K\) is the number of matrices, \(X_{k}\) (\(N \times M_{k}\)) is a \(k\)-th data matrix and \(V_{k}\) (\(M_{k} \times J\)) is a \(k\)-th ternary loading matrix. In dcTensor package, the object function is optimized by combining gradient-descent algorithm (Tsuyuzaki 2020) and ternary regularization.

Basic Usage

In STSMF, a rank parameter \(J\) (\(\leq \min(N, M)\)) is needed to be set in advance. Other settings such as the number of iterations (num.iter) are also available. For the details of arguments of dPLS, see ?dPLS. After the calculation, various objects are returned by dPLS. STSMF is achieved by specifying the ternary regularization parameter as a large value like the below:

set.seed(123456)
out_dPLS <- dPLS(X, Ter_V=1E+5, J=3)
str(out_dPLS, 2)
## List of 6
##  $ U            :List of 3
##   ..$ : num [1:100, 1:3] 8722 8926 8821 8626 8589 ...
##   ..$ : num [1:100, 1:3] 5888 5898 6044 5910 5695 ...
##   ..$ : num [1:100, 1:3] 3879 3904 3961 3806 3909 ...
##  $ V            :List of 3
##   ..$ : num [1:300, 1:3] 0.96 0.966 0.973 0.95 0.925 ...
##   ..$ : num [1:200, 1:3] 0.887 0.892 0.881 0.913 0.913 ...
##   ..$ : num [1:150, 1:3] 0.0252 0.0332 0.0286 0.0346 0.0244 ...
##  $ RecError     : Named num [1:101] 1.00e-09 1.88e+06 1.87e+06 1.83e+06 1.79e+06 ...
##   ..- attr(*, "names")= chr [1:101] "offset" "1" "2" "3" ...
##  $ TrainRecError: Named num [1:101] 1.00e-09 1.88e+06 1.87e+06 1.83e+06 1.79e+06 ...
##   ..- attr(*, "names")= chr [1:101] "offset" "1" "2" "3" ...
##  $ TestRecError : Named num [1:101] 1e-09 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 ...
##   ..- attr(*, "names")= chr [1:101] "offset" "1" "2" "3" ...
##  $ RelChange    : Named num [1:101] 1.00e-09 9.92e-01 5.11e-03 2.20e-02 2.12e-02 ...
##   ..- attr(*, "names")= chr [1:101] "offset" "1" "2" "3" ...

The reconstruction error (RecError) and relative error (RelChange, the amount of change from the reconstruction error in the previous step) can be used to diagnose whether the calculation is converged or not.

layout(t(1:2))
plot(log10(out_dPLS$RecError[-1]), type="b", main="Reconstruction Error")
plot(log10(out_dPLS$RelChange[-1]), type="b", main="Relative Change")

The products of \(U_{k}\) and \(V_{k}\) (\(k = 1 \ldots K\)) show whether the original data matrices are well-recovered by dPLS.

recX <- lapply(seq_along(X), function(x){
  out_dPLS$U[[x]] %*% t(out_dPLS$V[[x]])
})
layout(rbind(1:3, 4:6))
image.plot(t(X[[1]]))
image.plot(t(X[[2]]))
image.plot(t(X[[3]]))
image.plot(t(recX[[1]]))
image.plot(t(recX[[2]]))
image.plot(t(recX[[3]]))

The histograms of \(V_{k}\)s show that all the factor matrices \(V_{k}\) looks ternary.

layout(rbind(1:3, 4:6))
hist(out_dPLS$U[[1]], breaks=100)
hist(out_dPLS$U[[2]], breaks=100)
hist(out_dPLS$U[[3]], breaks=100)
hist(out_dPLS$V[[1]], breaks=100)
hist(out_dPLS$V[[2]], breaks=100)
hist(out_dPLS$V[[3]], breaks=100)

Session Information

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] nnTensor_1.2.0    fields_15.2       viridisLite_0.4.2 spam_2.9-1       
## [5] dcTensor_1.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4       jsonlite_1.8.7     highr_0.10         compiler_4.3.1    
##  [5] maps_3.4.1         Rcpp_1.0.11        plot3D_1.4         tagcloud_0.6      
##  [9] jquerylib_0.1.4    scales_1.2.1       yaml_2.3.7         fastmap_1.1.1     
## [13] ggplot2_3.4.3      R6_2.5.1           tcltk_4.3.1        knitr_1.43        
## [17] MASS_7.3-60        dotCall64_1.0-2    misc3d_0.9-1       tibble_3.2.1      
## [21] munsell_0.5.0      pillar_1.9.0       bslib_0.5.1        RColorBrewer_1.1-3
## [25] rlang_1.1.1        utf8_1.2.3         cachem_1.0.8       xfun_0.40         
## [29] sass_0.4.7         cli_3.6.1          magrittr_2.0.3     digest_0.6.33     
## [33] grid_4.3.1         rTensor_1.4.8      lifecycle_1.0.3    vctrs_0.6.3       
## [37] evaluate_0.21      glue_1.6.2         fansi_1.0.4        colorspace_2.1-0  
## [41] rmarkdown_2.24     pkgconfig_2.0.3    tools_4.3.1        htmltools_0.5.6

References

Tsuyuzaki, K. et al. 2020. “Benchmarking Principal Component Analysis for Large-Scale Single-Cell RNA-Sequencing.” BMC Genome Biology 21(1): 9.