Transforming data into sparse matrices

Michael Cysouw

2024-05-08

Transforming data into sparse matrices

The basic functions of this package are almost trivial, but they allow for a highly flexible and efficient transformation of data into sparse matrices. The tree basic functions are ttMatrix, pwMatrix, and jMatrix. This vignette will present a gentle instroduction to these three functions.

ttMatrix

The principle of ttMatrix (“type-token Matrix”) is that a sparse matrix is created from a vector. All unique elements in the vector are listed as rows (“types”), and the elements of the vector itself are listed as columns (“tokens”). The matrix shows which types are found in which position of the original vector. Take a simple character vector with repeating elements as an example, then ttMatrix will return the row names ($rownames) by default separately from the matrix ($M). The column names are of course identical to the input in data.

data <- c("a", "b", "a", "c", "b", "c", "A")
ttMatrix(data)
## $M
## 4 x 7 sparse Matrix of class "ngCMatrix"
##                   
## [1,] . . . . . . |
## [2,] | . | . . . .
## [3,] . | . . | . .
## [4,] . . . | . | .
## 
## $rownames
## [1] "A" "a" "b" "c"

You can also include the row and column names into the matrix as output by using the option simplify = TRUE. Note that in many cases of large sparse matrices, the same row or column names will be repeated in different sparse matrices. In such situations it is preferable to store the row or column names separately. Simply duplicating them migth take a lot of space.

Also note that the ordering of the rows (the “types”) is crucial. This ordering is determined by the so-called collation.locale option, which defaults to “C”” (which means the numerical ordering of the Unicode Standard is used). This ordering is different from the default ordering on most systems, which might internally for example use the collation locale `en_US.UTF-8’. The difference can be discerned in the example below, as the capital letters are ordered after the lowercase letters in the US-english locale, but before in the “C” locale (compare with the output of the previous call).

ttMatrix(data, simplify = TRUE, collation.locale = "en_US.UTF-8")
## 4 x 7 sparse Matrix of class "ngCMatrix"
##   a b a c b c A
## A . . . . . . |
## a | . | . . . .
## b . | . . | . .
## c . . . | . | .

pwMatrix

The principle of pwMatrix (“part-whole Matrix”) is so simple that it seems almost trivial. However, in combination with a bit matrix algebra the utility of this simple sparse Matrix is very impressive. First consider a simple vector of strings. Basically, pwMatrix looks at all parts of the strings (split into parts as determined by the separator sep). In the example below, The row and column names are added to the matrix by simplify = TRUE, and a gap has been added between the strings (like treating them as a sentence with boundaries).

strings <- c("this", "is", "that")
(PW <- pwMatrix(strings, sep = "", simplify = TRUE, gap.length = 1, gap.symbol = "_") )
## 12 x 3 sparse Matrix of class "ngCMatrix"
##   this is that
## t    |  .    .
## h    |  .    .
## i    |  .    .
## s    |  .    .
## _    .  .    .
## i    .  |    .
## s    .  |    .
## _    .  .    .
## t    .  .    |
## h    .  .    |
## a    .  .    |
## t    .  .    |

By itself, this part-whole Matrix is not very interesting. It becomes more interesting when we also consider the type-token Matrix of the “parts” (i.e. rownames) of the part-whole Matrix. The matrix product of these two matrices gives a sparse way to count how often each “part” occurs in each “whole”. This is a quick and sparse way to count parts.

TT <- ttMatrix(rownames(PW), simplify = TRUE)
printSpMatrix(TT, col.names = TRUE)
##   t h i s _ i s _ t h a t
## _ . . . . | . . | . . . .
## a . . . . . . . . . . | .
## h . | . . . . . . . | . .
## i . . | . . | . . . . . .
## s . . . | . . | . . . . .
## t | . . . . . . . | . . |
(TT*1) %*% (PW*1)
## 6 x 3 sparse Matrix of class "dgCMatrix"
##   this is that
## _    .  .    .
## a    .  .    1
## h    1  .    1
## i    1  1    .
## s    1  1    .
## t    1  .    2

Even more amazing is the following trick. Using a “upper-shift” matrix (i.e. a matrix with 1s on the diagonal one above the main diagonal), then the following simple matrix multiplication lists the number of adjacent combinations (i.e. how often is one letter followed by another in the strings? First element in the rows, second in the columns):

S <- bandSparse( n = ncol(TT), k = 1)
(TT*1) %*% (S*1) %*% t(TT*1)
## 6 x 6 sparse Matrix of class "dgCMatrix"
##   _ a h i s t
## _ . . . 1 . 1
## a . . . . . 1
## h . 1 . 1 . .
## i . . . . 2 .
## s 2 . . . . .
## t . . 2 . . .

This kind of computations start making a lot of sense once the number of “parts” start rising quickly, e.g. when looking at different words in sentences.