The basic functions of this package are almost trivial, but they
allow for a highly flexible and efficient transformation of data into
sparse matrices. The tree basic functions are ttMatrix
,
pwMatrix
, and jMatrix
. This vignette will
present a gentle instroduction to these three functions.
The principle of ttMatrix
(“type-token Matrix”) is that
a sparse matrix is created from a vector. All unique elements in the
vector are listed as rows (“types”), and the elements of the vector
itself are listed as columns (“tokens”). The matrix shows which types
are found in which position of the original vector. Take a simple
character vector with repeating elements as an example, then
ttMatrix
will return the row names ($rownames
)
by default separately from the matrix ($M
). The column
names are of course identical to the input in data
.
## $M
## 4 x 7 sparse Matrix of class "ngCMatrix"
##
## [1,] . . . . . . |
## [2,] | . | . . . .
## [3,] . | . . | . .
## [4,] . . . | . | .
##
## $rownames
## [1] "A" "a" "b" "c"
You can also include the row and column names into the matrix as
output by using the option simplify = TRUE
. Note that in
many cases of large sparse matrices, the same row or column names will
be repeated in different sparse matrices. In such situations it is
preferable to store the row or column names separately. Simply
duplicating them migth take a lot of space.
Also note that the ordering of the rows (the “types”) is crucial.
This ordering is determined by the so-called
collation.locale
option, which defaults to “C”” (which
means the numerical ordering of the Unicode Standard is used). This
ordering is different from the default ordering on most systems, which
might internally for example use the collation locale `en_US.UTF-8’. The
difference can be discerned in the example below, as the capital letters
are ordered after the lowercase letters in the US-english locale, but
before in the “C” locale (compare with the output of the previous
call).
## 4 x 7 sparse Matrix of class "ngCMatrix"
## a b a c b c A
## A . . . . . . |
## a | . | . . . .
## b . | . . | . .
## c . . . | . | .
The principle of pwMatrix
(“part-whole Matrix”) is so
simple that it seems almost trivial. However, in combination with a bit
matrix algebra the utility of this simple sparse Matrix is very
impressive. First consider a simple vector of strings. Basically,
pwMatrix
looks at all parts of the strings (split into
parts as determined by the separator sep
). In the example
below, The row and column names are added to the matrix by
simplify = TRUE
, and a gap has been added between the
strings (like treating them as a sentence with boundaries).
strings <- c("this", "is", "that")
(PW <- pwMatrix(strings, sep = "", simplify = TRUE, gap.length = 1, gap.symbol = "_") )
## 12 x 3 sparse Matrix of class "ngCMatrix"
## this is that
## t | . .
## h | . .
## i | . .
## s | . .
## _ . . .
## i . | .
## s . | .
## _ . . .
## t . . |
## h . . |
## a . . |
## t . . |
By itself, this part-whole Matrix is not very interesting. It becomes more interesting when we also consider the type-token Matrix of the “parts” (i.e. rownames) of the part-whole Matrix. The matrix product of these two matrices gives a sparse way to count how often each “part” occurs in each “whole”. This is a quick and sparse way to count parts.
## t h i s _ i s _ t h a t
## _ . . . . | . . | . . . .
## a . . . . . . . . . . | .
## h . | . . . . . . . | . .
## i . . | . . | . . . . . .
## s . . . | . . | . . . . .
## t | . . . . . . . | . . |
## 6 x 3 sparse Matrix of class "dgCMatrix"
## this is that
## _ . . .
## a . . 1
## h 1 . 1
## i 1 1 .
## s 1 1 .
## t 1 . 2
Even more amazing is the following trick. Using a “upper-shift” matrix (i.e. a matrix with 1s on the diagonal one above the main diagonal), then the following simple matrix multiplication lists the number of adjacent combinations (i.e. how often is one letter followed by another in the strings? First element in the rows, second in the columns):
## 6 x 6 sparse Matrix of class "dgCMatrix"
## _ a h i s t
## _ . . . 1 . 1
## a . . . . . 1
## h . 1 . 1 . .
## i . . . . 2 .
## s 2 . . . . .
## t . . 2 . . .
This kind of computations start making a lot of sense once the number of “parts” start rising quickly, e.g. when looking at different words in sentences.