caret
packageThe spectacles
package focuses on making the handling of
spectral data (along with its associated attribute data) easy: by
design, the tasks of tuning and fitting prediction models (either
for regression or classification) are out-of-scope. Rather than
re-implementing those routines, spectacles
delegates these
tasks to the numerous R packages that facilitates this. In particular,
the package works very well with the caret
package.
Here we demonstrate a simple example of tuning and fitting a
prediction model for the soil organic carbon content field of the
australia
dataset that is shipped with the
spectacles
package. This vignette assumes some basic
understanding of the caret
package. The reader is in
particular referred to the
short introduction to caret
vignette.
The australia
dataset, shipped with the
spectacles
package is a collection of 100 visible
near-infrared spectra collected on air-dried soils. More information
about this dataset is available on its manual page
(?australia
). The dataset can be loaded quickly using the
load_oz
function:
# This loads the "australia" example dataset
oz <- load_oz()
#> Wavelength range: 350 to 2500 nm
#> Spectral resolution: 1 nm
This creates the object oz
, of class
SpectraDataFrame
. A dedicated vignette will be available to
explain the creation of SpectraDataFrame
objects from
scratch.
Pre-processing will be the focus of a dedicated vignette. Here we
keep things simple, and limit pre-processingto remove the splice steps
that affect the spectra. This is done using the splice
function:
First, the australia
dataset is split into a calibration
and a validation set. Here we keep things simple, and use a 75%–25%
random split:
set.seed(1) # To make the split reproducible
idx <- sample(1:nrow(oz), size = 75)
oz_calib <- oz[idx, ]
oz_valid <- oz[-idx, ]
Note that SpectraDataFrame
objects can be subsetted
simply by using the [
operator.
The main spectacles
function used to interface with
caret
is the spectra
function, which extracts
the spectral matrix that is associated with analytical data. This matrix
represents the predictors used to predict a given outcome
(“x
”), while the outcome of the model (“y
”) is
an analytical attribute, and can be extracted from the
SpectraDataFrame
object using the $
operator:
# The `spectra` function extracts the spectral matrix...
spec_mat <- spectra(oz)
big.head(spec_mat)
#> X350 X351 X352 X353 X354 ... X2496
#> 28 0.08173233 0.08074037 0.08312159 0.08690414 0.08761925 ... 0.3888433
#> 36 0.10344727 0.09767456 0.09267905 0.09155867 0.10600654 ... 0.3035470
#> 136 0.11363971 0.13762568 0.14234437 0.13608806 0.13451828 ... 0.3524803
#> 194 0.16374375 0.16650400 0.16646151 0.16540240 0.16768558 ... 0.7547905
#> 215 0.09693917 0.10563760 0.10342025 0.09727478 0.10408881 ... 0.4397273
#> X2497 X2498 X2499 X2500
#> 28 0.3942973 0.3934600 0.3889097 0.3864646
#> 36 0.3016855 0.3077460 0.3116423 0.3036998
#> 136 0.3526042 0.3598545 0.3576893 0.3460697
#> 194 0.7552416 0.7525422 0.7513548 0.7567009
#> 215 0.4433315 0.4389216 0.4340123 0.4373087
# ... while analytical data can be accessed using `$`
oz$carbon
#> [1] 0.63 0.69 1.44 0.33 1.12 1.14 0.72 1.88 0.30 1.60 0.97 1.16 1.84 1.52 1.44
#> [16] 1.54 1.04 1.57 1.18 1.10 1.09 1.00 0.58 0.59 0.52 0.55 0.59 0.82 3.12 0.65
#> [31] 0.73 0.67 1.08 0.79 0.55 0.38 1.31 0.87 1.81 0.87 1.11 3.13 0.49 1.10 1.17
#> [46] 0.75 1.42 0.72 0.20 3.11 0.11 0.62 0.56 1.21 2.24 0.09 0.05 0.58 1.25 0.80
#> [61] 1.25 1.09 1.29 1.51 0.98 3.15 1.46 2.10 4.39 2.60 6.44 3.06 3.80 3.94 3.25
#> [76] 2.34 2.27 5.96 7.91 9.19 4.28 3.63 5.09 8.77 2.89 4.94 5.05 4.66 8.71 4.60
#> [91] 2.13 3.44 3.73 3.00 3.43 9.53 7.27 7.38 3.02 4.45
Therefore, the train
function can simply be used by
populating its x
and y
arguments:
The spectacles
even provide a summary functions akin to
those in caret
, but that work better for spectroscopy. The
spectroSummary
function works like the
defaultSummary
function in caret
, but adds
indicators that are popular in spectroscopy, such as RPD, RPIQ, or
CCC:
fit2 <- train(
x = spectra(oz_calib),
y = oz_calib$carbon,
method = "pls",
tuneLength = 10,
trControl = trainControl(
# Here we can specify the summary function used during parametrisation
summaryFunction = spectroSummary
),
# Here we can specifiy which metric to use to optimise the model parameters
metric = "RPIQ"
)
The parametrisation of the resulting model can be plotted and
inspected using the usual caret
tools:
print(fit2)
#> Partial Least Squares
#>
#> 75 samples
#> 2151 predictors
#>
#> No pre-processing
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 75, 75, 75, 75, 75, 75, ...
#> Resampling results across tuning parameters:
#>
#> ncomp RMSE Rsquared RPD RPIQ CCC Bias
#> 1 2.270830 0.04109954 1.009247 1.040409 0.06355689 0.028842359
#> 2 2.112723 0.18565942 1.086623 1.122197 0.29007126 0.045418074
#> 3 1.925908 0.34933470 1.222142 1.264717 0.48486626 -0.001460203
#> 4 1.791195 0.45828944 1.301441 1.338876 0.60917078 -0.016411213
#> 5 1.506213 0.62339030 1.539812 1.586311 0.75342460 0.034489914
#> 6 1.397651 0.68215657 1.673905 1.717462 0.79435174 0.078286879
#> 7 1.309171 0.71306886 1.760404 1.802363 0.82416014 -0.006024811
#> 8 1.359387 0.69369083 1.714036 1.756475 0.81396694 0.035286883
#> 9 1.504208 0.65452690 1.553504 1.581330 0.78533423 0.042897974
#> 10 1.572661 0.63015038 1.492047 1.516139 0.76944793 0.020828935
#> SE
#> 2.313951
#> 2.152752
#> 1.962224
#> 1.825044
#> 1.534687
#> 1.424066
#> 1.333973
#> 1.385142
#> 1.532704
#> 1.602432
#>
#> RPIQ was used to select the optimal model using the largest value.
#> The final value used for the model was ncomp = 7.
Different models can also be compared:
preds <- extractPrediction(
# Here we specify the `caret` models we want to compare
models = list(
pls1 = fit1,
pls2 = fit2
),
testX = spectra(oz_valid),
testY = oz_valid$carbon
)
# necessary so 2 PLS model can be compared in `plotObsVsPred`
preds$model <- preds$object
plotObsVsPred(preds)
The model fit2
outperforms the model fit1
:
hardly a surprise as we limited the latter to 3 latent variables, which
is clearly too few in this instance.
Finally, those specific performance indicators can also be used to
asses the final results. PLS predictions can be generated using the
predict
function from the caret
package, and
its result passed to postResampleSpectro
:
# Simple example for the entire dataset
postResampleSpectro(
pred = predict(fit2, spectra(oz)),
obs = oz_valid$carbon
)
#> RMSE Rsquared RPD RPIQ CCC Bias SE
#> 2.4148046 NA 0.6614048 0.4182533 0.1667738 0.7262200 4.9291993
Again, this function mimics its caret
equivalent,
postResample
.
A more useful thing to do, from a modelling standpoint, is to compare those performance results on the calibration, validation, and bootstrapped sets (especially the two latter ones):
# Run model predictions and extract performance statistics for
# caliration and validation
res_calibration <- postResampleSpectro(pred = predict(fit2, spectra(oz_calib)), obs = oz_calib$carbon)
res_validation <- postResampleSpectro(pred = predict(fit2, spectra(oz_valid)), obs = oz_valid$carbon)
# Bootstrapped results can be extracted from the `train` object:
res_boot <- fit2$results %>%
filter(ncomp == fit2$bestTune$ncomp) %>%
select(names(res_calibration))
# Assemble the calibration, validation, and
# bootstrapped results in a single data.frame
res <- rbind(
data.frame(type = "Calibration", t(res_calibration)),
data.frame(type = "Validation", t(res_validation)),
data.frame(type = "Bootstrap", res_boot)
)
Which gives the following results:
type | RMSE | Rsquared | RPD | RPIQ | CCC | Bias | SE |
---|---|---|---|---|---|---|---|
Calibration | 1.09 | 0.78 | 2.15 | 2.09 | 0.88 | 0.00 | 1.10 |
Validation | 1.21 | 0.55 | 1.32 | 0.83 | 0.73 | 0.24 | 1.24 |
Bootstrap | 1.31 | 0.71 | 1.76 | 1.80 | 0.82 | -0.01 | 1.33 |