fullRankMatrix - Comparison to other packages

Other available packages that detect linear dependent columns

There are already a few other packages out there that offer functions to detect linear dependent columns. Here are the ones we are aware of:

library(fullRankMatrix)

# let's say we have 10 fruit salads and indicate which ingredients are present in each salad
strawberry <- c(1,1,1,1,0,0,0,0,0,0)
poppyseed <- c(0,0,0,0,1,1,1,0,0,0)
orange <- c(1,1,1,1,1,1,1,0,0,0)
pear <- c(0,0,0,1,0,0,0,1,1,1)
mint <- c(1,1,0,0,0,0,0,0,0,0)
apple <- c(0,0,0,0,0,0,1,1,1,1)

# let's pretend we know how each fruit influences the sweetness of a fruit salad
# in this case we say that strawberries and oranges have the biggest influence on sweetness
set.seed(30)
strawberry_sweet <- strawberry * rnorm(10, 4)
poppyseed_sweet <- poppyseed * rnorm(10, 0.1)
orange_sweet <- orange * rnorm(10, 5)
pear_sweet <- pear * rnorm(10, 0.5)
mint_sweet <- mint * rnorm(10, 1)
apple_sweet <- apple * rnorm(10, 2)

sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet +
  mint_sweet + apple_sweet 

mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple)

caret::findLinearCombos(): https://rdrr.io/cran/caret/man/findLinearCombos.html

This function identifies which columns are linearly dependent and suggests which columns to remove. But it doesn’t provide appropriate naming for the remaining columns to indicate that any significant associations with the remaining columns are actually associations with the space spanned by the originally linearly dependent columns. Just removing the and then fitting the linear model would lead to erroneous interpretation.

caret_result <- caret::findLinearCombos(mat)

Fitting a linear model with the orange column removed would lead to the erroneous interpretation that strawberry and poppyseed have the biggest influence on the fruit salad sweetness, but we know it is actually strawberry and orange.

mat_caret <- mat[, -caret_result$remove]
fit <- lm(sweetness ~ mat_caret + 0)
print(summary(fit))
#> 
#> Call:
#> lm(formula = sweetness ~ mat_caret + 0)
#> 
#> Residuals:
#>        1        2        3        4        5        6        7        8 
#> -2.00934  2.00934 -1.34248  1.34248  0.92807 -2.27054  1.34248 -0.01963 
#>        9       10 
#>  1.26385 -2.58670 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)   
#> mat_caretstrawberry   8.9087     2.0267   4.396  0.00705 **
#> mat_caretpoppyseed    6.5427     1.5544   4.209  0.00842 **
#> mat_caretpear         1.2800     2.3056   0.555  0.60269   
#> mat_caretmint         0.6582     2.6242   0.251  0.81193   
#> mat_caretapple        1.2595     2.2526   0.559  0.60019   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.357 on 5 degrees of freedom
#> Multiple R-squared:  0.9504, Adjusted R-squared:  0.9007 
#> F-statistic: 19.15 on 5 and 5 DF,  p-value: 0.002824

WeightIt::make_full_rank(): https://rdrr.io/cran/WeightIt/man/make_full_rank.html

This function removes some of the linearly dependent columns to create a full rank matrix, but doesn’t rename the remaining columns accordingly. For the user it isn’t clear which columns were linearly dependent and they can’t choose which column will be removed.

mat_weightit <- WeightIt::make_full_rank(mat, with.intercept = FALSE)
mat_weightit
#>       strawberry poppyseed pear mint apple
#>  [1,]          1         0    0    1     0
#>  [2,]          1         0    0    1     0
#>  [3,]          1         0    0    0     0
#>  [4,]          1         0    1    0     0
#>  [5,]          0         1    0    0     0
#>  [6,]          0         1    0    0     0
#>  [7,]          0         1    0    0     1
#>  [8,]          0         0    1    0     1
#>  [9,]          0         0    1    0     1
#> [10,]          0         0    1    0     1

As above fitting a linear model with this full rank matrix would lead to erroneous interpretation that strawberry and poppyseed influence the sweetness, but we know it is actually strawberry and orange.

fit <- lm(sweetness ~ mat_weightit + 0)
print(summary(fit))
#> 
#> Call:
#> lm(formula = sweetness ~ mat_weightit + 0)
#> 
#> Residuals:
#>        1        2        3        4        5        6        7        8 
#> -2.00934  2.00934 -1.34248  1.34248  0.92807 -2.27054  1.34248 -0.01963 
#>        9       10 
#>  1.26385 -2.58670 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)   
#> mat_weightitstrawberry   8.9087     2.0267   4.396  0.00705 **
#> mat_weightitpoppyseed    6.5427     1.5544   4.209  0.00842 **
#> mat_weightitpear         1.2800     2.3056   0.555  0.60269   
#> mat_weightitmint         0.6582     2.6242   0.251  0.81193   
#> mat_weightitapple        1.2595     2.2526   0.559  0.60019   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.357 on 5 degrees of freedom
#> Multiple R-squared:  0.9504, Adjusted R-squared:  0.9007 
#> F-statistic: 19.15 on 5 and 5 DF,  p-value: 0.002824

plm::detect.lindep(): https://rdrr.io/cran/plm/man/detect.lindep.html

The function returns which columns are potentially linearly dependent.

plm::detect.lindep(mat)
#> [1] "Suspicious column number(s): 1, 2, 3"
#> [1] "Suspicious column name(s):   strawberry, poppyseed, orange"

However it doesn’t capture all cases. For example here plm::detect.lindep() says there are no dependent columns, while there are several:

c1 <- rbinom(10, 1, .4)
c2 <- 1-c1
c3 <- integer(10)
c4 <- c1
c5 <- 2*c2
c6 <- rbinom(10, 1, .8)
c7 <- c5+c6
mat_test <- as.matrix(data.frame(c1,c2,c3,c4,c5,c6,c7))

plm::detect.lindep(mat_test)
#> [1] "No linear dependent column(s) detected."

fullRankMatrix captures these cases:

result <- make_full_rank_matrix(mat_test)
result$matrix
#>       (c1_AND_c4) SPACE_1_AXIS1 SPACE_1_AXIS2
#>  [1,]           1     0.0000000  4.111431e-16
#>  [2,]           0    -0.4082483 -5.419613e-17
#>  [3,]           1     0.0000000  7.071068e-01
#>  [4,]           0    -0.4082483  1.083923e-17
#>  [5,]           1     0.0000000  7.071068e-01
#>  [6,]           0    -0.4082483  1.083923e-17
#>  [7,]           0    -0.4082483  1.083923e-17
#>  [8,]           0    -0.4082483  1.083923e-17
#>  [9,]           1     0.0000000  0.000000e+00
#> [10,]           0    -0.4082483  1.083923e-17

Smisc::findDepMat(): https://rdrr.io/cran/Smisc/man/findDepMat.html

NOTE: this package was removed from CRAN as of 2020-01-26 (https://CRAN.R-project.org/package=Smisc) due to failing checks.

This function indicates linearly dependent rows/columns, but it doesn’t state which rows/columns are linearly dependent with each other.

However, this function seems to not work well for one-hot encoded matrices and the package doesn’t seem to be updated anymore (s. this issue: https://github.com/pnnl/Smisc/issues/24).

# example provided by Smisc documentation
Y <- matrix(c(1, 3, 4,
              2, 6, 8,
              7, 2, 9,
              4, 1, 7,
              3.5, 1, 4.5), byrow = TRUE, ncol = 3)
Smisc::findDepMat(t(Y), rows = FALSE)

Trying with the model matrix from our example above:

Smisc::findDepMat(mat, rows=FALSE)
#> Error in if (!depends[j]) { : missing value where TRUE/FALSE needed