There are already a few other packages out there that offer functions to detect linear dependent columns. Here are the ones we are aware of:
library(fullRankMatrix)
# let's say we have 10 fruit salads and indicate which ingredients are present in each salad
strawberry <- c(1,1,1,1,0,0,0,0,0,0)
poppyseed <- c(0,0,0,0,1,1,1,0,0,0)
orange <- c(1,1,1,1,1,1,1,0,0,0)
pear <- c(0,0,0,1,0,0,0,1,1,1)
mint <- c(1,1,0,0,0,0,0,0,0,0)
apple <- c(0,0,0,0,0,0,1,1,1,1)
# let's pretend we know how each fruit influences the sweetness of a fruit salad
# in this case we say that strawberries and oranges have the biggest influence on sweetness
set.seed(30)
strawberry_sweet <- strawberry * rnorm(10, 4)
poppyseed_sweet <- poppyseed * rnorm(10, 0.1)
orange_sweet <- orange * rnorm(10, 5)
pear_sweet <- pear * rnorm(10, 0.5)
mint_sweet <- mint * rnorm(10, 1)
apple_sweet <- apple * rnorm(10, 2)
sweetness <- strawberry_sweet + poppyseed_sweet+ orange_sweet + pear_sweet +
mint_sweet + apple_sweet
mat <- cbind(strawberry,poppyseed,orange,pear,mint,apple)
caret::findLinearCombos()
: https://rdrr.io/cran/caret/man/findLinearCombos.html
This function identifies which columns are linearly dependent and suggests which columns to remove. But it doesn’t provide appropriate naming for the remaining columns to indicate that any significant associations with the remaining columns are actually associations with the space spanned by the originally linearly dependent columns. Just removing the and then fitting the linear model would lead to erroneous interpretation.
Fitting a linear model with the orange
column removed
would lead to the erroneous interpretation that strawberry
and poppyseed
have the biggest influence on the fruit salad
sweetness
, but we know it is actually
strawberry
and orange
.
mat_caret <- mat[, -caret_result$remove]
fit <- lm(sweetness ~ mat_caret + 0)
print(summary(fit))
#>
#> Call:
#> lm(formula = sweetness ~ mat_caret + 0)
#>
#> Residuals:
#> 1 2 3 4 5 6 7 8
#> -2.00934 2.00934 -1.34248 1.34248 0.92807 -2.27054 1.34248 -0.01963
#> 9 10
#> 1.26385 -2.58670
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> mat_caretstrawberry 8.9087 2.0267 4.396 0.00705 **
#> mat_caretpoppyseed 6.5427 1.5544 4.209 0.00842 **
#> mat_caretpear 1.2800 2.3056 0.555 0.60269
#> mat_caretmint 0.6582 2.6242 0.251 0.81193
#> mat_caretapple 1.2595 2.2526 0.559 0.60019
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.357 on 5 degrees of freedom
#> Multiple R-squared: 0.9504, Adjusted R-squared: 0.9007
#> F-statistic: 19.15 on 5 and 5 DF, p-value: 0.002824
WeightIt::make_full_rank()
: https://rdrr.io/cran/WeightIt/man/make_full_rank.html
This function removes some of the linearly dependent columns to create a full rank matrix, but doesn’t rename the remaining columns accordingly. For the user it isn’t clear which columns were linearly dependent and they can’t choose which column will be removed.
mat_weightit <- WeightIt::make_full_rank(mat, with.intercept = FALSE)
mat_weightit
#> strawberry poppyseed pear mint apple
#> [1,] 1 0 0 1 0
#> [2,] 1 0 0 1 0
#> [3,] 1 0 0 0 0
#> [4,] 1 0 1 0 0
#> [5,] 0 1 0 0 0
#> [6,] 0 1 0 0 0
#> [7,] 0 1 0 0 1
#> [8,] 0 0 1 0 1
#> [9,] 0 0 1 0 1
#> [10,] 0 0 1 0 1
As above fitting a linear model with this full rank matrix would lead
to erroneous interpretation that strawberry
and
poppyseed
influence the sweetness
, but we know
it is actually strawberry
and orange
.
fit <- lm(sweetness ~ mat_weightit + 0)
print(summary(fit))
#>
#> Call:
#> lm(formula = sweetness ~ mat_weightit + 0)
#>
#> Residuals:
#> 1 2 3 4 5 6 7 8
#> -2.00934 2.00934 -1.34248 1.34248 0.92807 -2.27054 1.34248 -0.01963
#> 9 10
#> 1.26385 -2.58670
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> mat_weightitstrawberry 8.9087 2.0267 4.396 0.00705 **
#> mat_weightitpoppyseed 6.5427 1.5544 4.209 0.00842 **
#> mat_weightitpear 1.2800 2.3056 0.555 0.60269
#> mat_weightitmint 0.6582 2.6242 0.251 0.81193
#> mat_weightitapple 1.2595 2.2526 0.559 0.60019
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.357 on 5 degrees of freedom
#> Multiple R-squared: 0.9504, Adjusted R-squared: 0.9007
#> F-statistic: 19.15 on 5 and 5 DF, p-value: 0.002824
plm::detect.lindep()
: https://rdrr.io/cran/plm/man/detect.lindep.html
The function returns which columns are potentially linearly dependent.
plm::detect.lindep(mat)
#> [1] "Suspicious column number(s): 1, 2, 3"
#> [1] "Suspicious column name(s): strawberry, poppyseed, orange"
However it doesn’t capture all cases. For example here
plm::detect.lindep()
says there are no dependent columns,
while there are several:
c1 <- rbinom(10, 1, .4)
c2 <- 1-c1
c3 <- integer(10)
c4 <- c1
c5 <- 2*c2
c6 <- rbinom(10, 1, .8)
c7 <- c5+c6
mat_test <- as.matrix(data.frame(c1,c2,c3,c4,c5,c6,c7))
plm::detect.lindep(mat_test)
#> [1] "No linear dependent column(s) detected."
fullRankMatrix
captures these cases:
result <- make_full_rank_matrix(mat_test)
result$matrix
#> (c1_AND_c4) SPACE_1_AXIS1 SPACE_1_AXIS2
#> [1,] 1 0.0000000 4.111431e-16
#> [2,] 0 -0.4082483 -5.419613e-17
#> [3,] 1 0.0000000 7.071068e-01
#> [4,] 0 -0.4082483 1.083923e-17
#> [5,] 1 0.0000000 7.071068e-01
#> [6,] 0 -0.4082483 1.083923e-17
#> [7,] 0 -0.4082483 1.083923e-17
#> [8,] 0 -0.4082483 1.083923e-17
#> [9,] 1 0.0000000 0.000000e+00
#> [10,] 0 -0.4082483 1.083923e-17
Smisc::findDepMat()
: https://rdrr.io/cran/Smisc/man/findDepMat.html
NOTE: this package was removed from CRAN as of 2020-01-26 (https://CRAN.R-project.org/package=Smisc) due to failing checks.
This function indicates linearly dependent rows/columns, but it doesn’t state which rows/columns are linearly dependent with each other.
However, this function seems to not work well for one-hot encoded matrices and the package doesn’t seem to be updated anymore (s. this issue: https://github.com/pnnl/Smisc/issues/24).
# example provided by Smisc documentation
Y <- matrix(c(1, 3, 4,
2, 6, 8,
7, 2, 9,
4, 1, 7,
3.5, 1, 4.5), byrow = TRUE, ncol = 3)
Smisc::findDepMat(t(Y), rows = FALSE)
Trying with the model matrix from our example above:
Smisc::findDepMat(mat, rows=FALSE)
#> Error in if (!depends[j]) { : missing value where TRUE/FALSE needed