In this vignette we consider a simple scenario with a continuous
dependent variable and a set of continuous predictors. First, we load
the required packages and store the example dataset
GSPCRexdata
(see the helpfile for details
?GSPCRexdata
) in two separate objects:
# Load R packages
library(gspcr) # this package!
library(superpc) # alternative comparison package
library(patchwork) # combining ggplots
# Load data
X <- GSPCRexdata$X$cont
y <- GSPCRexdata$y$cont
# Define names of measures to compare
fit_measure_vec <- c("LRT", "PR2", "MSE", "F", "AIC", "BIC")
superpc
This package extends the superpc
R package by
introducing more fit measures and threshold types, which allows for the
consideration of a wider range of variable types. Here we want to show
that the results obtained by superpc
can
be replicated by gspcr
when using the same
fit measure and threshold type.
To use superpc
we need to prepare the data in a
different format.
# Prepare data for superpc
data.train <- list(
x = t(as.matrix(scale(X))),
y = y,
featurenames = colnames(X)
)
We can then train the model with the
superpc::superpc.train()
function and observe the solution
paths.
# Train the model (computes the scores for each feature)
train.obj <- superpc.train(
data = data.train,
type = "regression"
)
# Cross-validate the model
cv.obj <- superpc.cv(
fit = train.obj,
data = data.train,
min.features = 1,
max.features = nrow(data.train$x),
n.fold = 10,
n.threshold = 20,
n.components = 5
)
# Cross-validation solution paths
cv.obj_plot <- superpc.plotcv(cv.obj)
Finally, we can specify and train gspcr
in the same way
and compare the cross-validation solution paths.
# Train gspcr with the same specification as superpc
gspcr_superpc <- cv_gspcr(
dv = y,
ivs = X,
fit_measure = "F",
thrs = "normalized",
nthrs = 20,
npcs_range = 1:5,
K = 10,
min_features = 1,
max_features = ncol(X)
)
# Create plot the cross-validation curves
plot(
gspcr_superpc,
errorBars = TRUE,
discretize = FALSE,
)
We can also see that the thresholds values computed are exactly the same:
# Report the threshold values
data.frame(
superpc = round(cv.obj$thresholds, 3),
gpscr = round(gspcr_superpc$thr, 3),
diff = round(cv.obj$thresholds - gspcr_superpc$thr, 3)
)
superpc gpscr diff
1 0.051 0.051 0
2 0.398 0.398 0
3 0.745 0.745 0
4 1.093 1.093 0
5 1.440 1.440 0
6 1.787 1.787 0
7 2.135 2.135 0
8 2.482 2.482 0
9 2.829 2.829 0
10 3.177 3.177 0
11 3.524 3.524 0
12 3.871 3.871 0
13 4.219 4.219 0
14 4.566 4.566 0
15 4.913 4.913 0
16 5.260 5.260 0
17 5.608 5.608 0
18 5.955 5.955 0
19 6.302 6.302 0
20 6.650 6.650 0
In this section, I want to showcase the effectiveness of using
cross-validation to select the number of PCs for GSPCR.
We can estimate the same solution paths for GSPCR with and without using
K-fold cross-validation by changing the number of folds used. If we set
K = 1
in the cv_gspcr()
call, the data is
assigned a single fold, and the fit measures are evaluated on the same
data the model is trained on.
First, let’s estimate the solution paths with K-fold
cross-validation. To keep the comparison simple, we specify two options
for the number of PCs (5 and 20), but any other set of values could be
used. We train the model with the six available fit measures stored in
the object fit_measure_vec
.
# Train the GSPCR model with two number of PCs options
out_fit_meas_cv <- lapply(fit_measure_vec, function(i) {
cv_gspcr(
dv = y,
ivs = X,
K = 10,
npcs_range = c(5, 20),
fit_measure = i,
thrs = "normalized"
)
})
# Plot solution paths
plots <- lapply(seq_along(fit_measure_vec), function(i) {
# Reverse y?
rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])
# Make plots
plot(
x = out_fit_meas_cv[[i]],
y = fit_measure_vec[[i]],
labels = TRUE,
y_reverse = rev,
errorBars = FALSE,
discretize = FALSE,
print = FALSE
)
})
# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
Then, we repeat the same procedure but we set K = 1
, to
avoid K-fold cross-validation.
# Train the GSPCR model with many number of components
out_fit_meas_no_CV <- lapply(fit_measure_vec, function(i) {
cv_gspcr(
dv = y,
ivs = X,
K = 1,
npcs_range = c(5, 20),
fit_measure = i,
thrs = "normalized"
)
})
# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
# Reverse y?
rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])
# Make plots
plot(
x = out_fit_meas_no_CV[[i]],
y = fit_measure_vec[[i]],
labels = TRUE,
y_reverse = rev,
errorBars = TRUE,
discretize = FALSE,
print = FALSE
)
})
# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
You can already see from the plots that when using LRT, MSE, and PR2 as fit measures, without cross-validation we would end up selecting the highest number of PCs provided (20 in this case). However, when using K-fold cross-validation, the solution paths would lead us to choose 5 PCs instead.
We can also look at the solutions tables to confirm our read of the plots. The solution we would have found using K-fold CV is:
# Standard solutions
res_CV <- sapply(
1:length(out_fit_meas_cv),
function(meth) {
as.numeric(out_fit_meas_cv[[meth]]$sol_table["standard", ])
}
)
# Give meaningful names
dimnames(res_CV) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)
# Print rounded results
round(t(res_CV), 3)
thr_value thr_number Q
LRT 1.517 3 5
PR2 1.517 3 5
MSE 1.517 3 5
F 1.517 3 5
AIC 1.517 3 5
BIC 0.784 2 5
The solutions we would have obtained without using K-fold CV are:
# Standard solutions
res_no_CV <- sapply(
1:length(out_fit_meas_no_CV),
function(meth) {
as.numeric(out_fit_meas_no_CV[[meth]]$sol_table["standard", ])
}
)
# Give meaningful names
dimnames(res_no_CV) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)
# Print rounded results
round(t(res_no_CV), 3)
thr_value thr_number Q
LRT 1.517 3 20
PR2 1.517 3 20
MSE 1.517 3 20
F 1.517 3 5
AIC 1.517 3 5
BIC 1.517 3 5
As you can see, using CV we find the same solution no matter what the outcome measure, while without using CV, only the AIC, BIC, and F are able to select a low number of PCs.
The results for all fit measures except F
struggle with
accounting for measure complexity. The use of a simple 1-standard-error
rule helps obviate this problem. First, fit the models with many
possible number of components:
# Train the GSPCR model with many number of components
out_fit_meas <- lapply(fit_measure_vec, function(i) {
cv_gspcr(
dv = y,
ivs = X,
fam = "gaussian",
nthrs = 10,
npcs_range = 1:10,
K = 10,
fit_measure = i,
thrs = "normalized",
min_features = 1,
max_features = ncol(X),
oneSE = TRUE
)
})
# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
# Reverse y?
rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])
# Make plots
plot(
x = out_fit_meas[[i]],
y = fit_measure_vec[[i]],
labels = TRUE,
y_reverse = rev,
errorBars = TRUE,
discretize = FALSE,
print = FALSE
)
})
# Patchwork ggplots
(plots[[1]] + plots[[2]] + plots[[3]]) / (plots[[4]] + plots[[5]] + plots[[6]])
Then, extract the solutions obtained by each:
# Standard solutions
res <- sapply(
1:length(out_fit_meas),
function(meth) {
as.numeric(out_fit_meas[[meth]]$sol_table["standard", ])
}
)
# Give meaningful names
dimnames(res) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)
# Print rounded results
round(t(res), 3)
thr_value thr_number Q
LRT 0.784 2 4
PR2 0.784 2 4
MSE 1.517 3 5
F 2.250 4 1
AIC 1.517 3 3
BIC 2.250 4 1
Finally, you can check which solutions would be chosen by using the 1-standard-error rule:
# 1se solutions
res_1se <- sapply(
1:length(out_fit_meas),
function(meth) {
as.numeric(out_fit_meas[[meth]]$sol_table["oneSE", ])
}
)
# Give meaningful names
dimnames(res_1se) <- list(c("thr_value", "thr_number", "Q"), fit_measure_vec)
# Print rounded results
round(t(res_1se), 3)
thr_value thr_number Q
LRT 1.517 3 3
PR2 1.517 3 3
MSE 1.517 3 3
F 1.517 3 1
AIC 2.250 4 1
BIC 2.250 4 2
To speed up the model-fitting process, it can be a good idea to find model-building strategies that are less time-consuming than CV. You can use the BIC fit measure without CV to select the appropriate threshold value and the number of components. To do so, you can specify the number of folds to 1 and the fit measure to BIC or AIC.
# Define vector of measures to be used
fit_measure_vec <- c("LRT", "AIC", "BIC")
# Train the GSPCR model with the different values
out_fit_meas <- lapply(fit_measure_vec, function(i) {
cv_gspcr(
dv = y,
ivs = X,
fam = "gaussian",
nthrs = 10,
npcs_range = c(1, 2, 5, 20),
K = 1,
fit_measure = i,
thrs = "normalized",
min_features = 1,
max_features = ncol(X),
oneSE = TRUE
)
})
# Plot them
plots <- lapply(seq_along(fit_measure_vec), function(i) {
# Reverse y?
rev <- grepl("MSE|AIC|BIC", fit_measure_vec[i])
# Make plots
plot(
x = out_fit_meas[[i]],
y = fit_measure_vec[[i]],
labels = TRUE,
y_reverse = rev,
errorBars = FALSE,
discretize = FALSE,
print = FALSE
)
})
# Patchwork ggplots
plots[[1]] + plots[[2]] + plots[[3]]
You can also look at the solutions:
# Put solutions together
rbind(
LRT = out_fit_meas[[1]]$sol_table["standard", ],
AIC = out_fit_meas[[2]]$sol_table["standard", ],
BIC = out_fit_meas[[3]]$sol_table["standard", ]
)
thr_value thr_number Q
LRT 1.517276 3 20
AIC 1.517276 3 5
BIC 1.517276 3 5