Spectral modelling and predictions using the caret package

Pierre Roudier

2023-07-07

The spectacles package focuses on making the handling of spectral data (along with its associated attribute data) easy: by design, the tasks of tuning and fitting prediction models (either for regression or classification) are out-of-scope. Rather than re-implementing those routines, spectacles delegates these tasks to the numerous R packages that facilitates this. In particular, the package works very well with the caret package.

library(dplyr)
library(spectacles)
library(caret)

Here we demonstrate a simple example of tuning and fitting a prediction model for the soil organic carbon content field of the australia dataset that is shipped with the spectacles package. This vignette assumes some basic understanding of the caret package. The reader is in particular referred to the short introduction to caret vignette.

Loading the example dataset

The australia dataset, shipped with the spectacles package is a collection of 100 visible near-infrared spectra collected on air-dried soils. More information about this dataset is available on its manual page (?australia). The dataset can be loaded quickly using the load_oz function:

# This loads the "australia" example dataset
oz <- load_oz()
#> Wavelength range: 350 to 2500 nm
#> Spectral resolution: 1 nm

This creates the object oz, of class SpectraDataFrame. A dedicated vignette will be available to explain the creation of SpectraDataFrame objects from scratch.

Pre-processing

Pre-processing will be the focus of a dedicated vignette. Here we keep things simple, and limit pre-processingto remove the splice steps that affect the spectra. This is done using the splice function:

oz <- splice(oz)

Data split

First, the australia dataset is split into a calibration and a validation set. Here we keep things simple, and use a 75%–25% random split:

set.seed(1) # To make the split reproducible
idx <- sample(1:nrow(oz), size = 75)
oz_calib <- oz[idx, ]
oz_valid <- oz[-idx, ]

Note that SpectraDataFrame objects can be subsetted simply by using the [ operator.

Parametrisation of a PLS model

The main spectacles function used to interface with caret is the spectra function, which extracts the spectral matrix that is associated with analytical data. This matrix represents the predictors used to predict a given outcome (“x”), while the outcome of the model (“y”) is an analytical attribute, and can be extracted from the SpectraDataFrame object using the $ operator:

# The `spectra` function extracts the spectral matrix...
spec_mat <- spectra(oz)
big.head(spec_mat)
#>           X350       X351       X352       X353       X354 ...     X2496
#> 28  0.08173233 0.08074037 0.08312159 0.08690414 0.08761925 ... 0.3888433
#> 36  0.10344727 0.09767456 0.09267905 0.09155867 0.10600654 ... 0.3035470
#> 136 0.11363971 0.13762568 0.14234437 0.13608806 0.13451828 ... 0.3524803
#> 194 0.16374375 0.16650400 0.16646151 0.16540240 0.16768558 ... 0.7547905
#> 215 0.09693917 0.10563760 0.10342025 0.09727478 0.10408881 ... 0.4397273
#>         X2497     X2498     X2499     X2500
#> 28  0.3942973 0.3934600 0.3889097 0.3864646
#> 36  0.3016855 0.3077460 0.3116423 0.3036998
#> 136 0.3526042 0.3598545 0.3576893 0.3460697
#> 194 0.7552416 0.7525422 0.7513548 0.7567009
#> 215 0.4433315 0.4389216 0.4340123 0.4373087

# ... while analytical data can be accessed using `$`
oz$carbon
#>   [1] 0.63 0.69 1.44 0.33 1.12 1.14 0.72 1.88 0.30 1.60 0.97 1.16 1.84 1.52 1.44
#>  [16] 1.54 1.04 1.57 1.18 1.10 1.09 1.00 0.58 0.59 0.52 0.55 0.59 0.82 3.12 0.65
#>  [31] 0.73 0.67 1.08 0.79 0.55 0.38 1.31 0.87 1.81 0.87 1.11 3.13 0.49 1.10 1.17
#>  [46] 0.75 1.42 0.72 0.20 3.11 0.11 0.62 0.56 1.21 2.24 0.09 0.05 0.58 1.25 0.80
#>  [61] 1.25 1.09 1.29 1.51 0.98 3.15 1.46 2.10 4.39 2.60 6.44 3.06 3.80 3.94 3.25
#>  [76] 2.34 2.27 5.96 7.91 9.19 4.28 3.63 5.09 8.77 2.89 4.94 5.05 4.66 8.71 4.60
#>  [91] 2.13 3.44 3.73 3.00 3.43 9.53 7.27 7.38 3.02 4.45

Therefore, the train function can simply be used by populating its x and y arguments:

fit1 <- train(
  # The `spectra` function extract the spectra matrix
  x = spectra(oz_calib), 
  # analytical data can be extracted using `$`
  y = oz_calib$carbon,
  # Here we choose the PLS regression method
  method = "pls",
  # The train function will try 3 possible parameters for the PLS
  tuneLength = 3
)

Using spectroscopy-specific performance metrics

As part of the model tuning

The spectacles even provide a summary functions akin to those in caret, but that work better for spectroscopy. The spectroSummary function works like the defaultSummary function in caret, but adds indicators that are popular in spectroscopy, such as RPD, RPIQ, or CCC:

fit2 <- train(
    x = spectra(oz_calib),
    y = oz_calib$carbon,
    method = "pls",
    tuneLength = 10,
    trControl = trainControl(
      # Here we can specify the summary function used during parametrisation
      summaryFunction = spectroSummary
    ),
    # Here we can specifiy which metric to use to optimise the model parameters
    metric = "RPIQ"
)

The parametrisation of the resulting model can be plotted and inspected using the usual caret tools:

plot(fit2)

print(fit2)
#> Partial Least Squares 
#> 
#>   75 samples
#> 2151 predictors
#> 
#> No pre-processing
#> Resampling: Bootstrapped (25 reps) 
#> Summary of sample sizes: 75, 75, 75, 75, 75, 75, ... 
#> Resampling results across tuning parameters:
#> 
#>   ncomp  RMSE      Rsquared    RPD       RPIQ      CCC         Bias        
#>    1     2.270830  0.04109954  1.009247  1.040409  0.06355689   0.028842359
#>    2     2.112723  0.18565942  1.086623  1.122197  0.29007126   0.045418074
#>    3     1.925908  0.34933470  1.222142  1.264717  0.48486626  -0.001460203
#>    4     1.791195  0.45828944  1.301441  1.338876  0.60917078  -0.016411213
#>    5     1.506213  0.62339030  1.539812  1.586311  0.75342460   0.034489914
#>    6     1.397651  0.68215657  1.673905  1.717462  0.79435174   0.078286879
#>    7     1.309171  0.71306886  1.760404  1.802363  0.82416014  -0.006024811
#>    8     1.359387  0.69369083  1.714036  1.756475  0.81396694   0.035286883
#>    9     1.504208  0.65452690  1.553504  1.581330  0.78533423   0.042897974
#>   10     1.572661  0.63015038  1.492047  1.516139  0.76944793   0.020828935
#>   SE      
#>   2.313951
#>   2.152752
#>   1.962224
#>   1.825044
#>   1.534687
#>   1.424066
#>   1.333973
#>   1.385142
#>   1.532704
#>   1.602432
#> 
#> RPIQ was used to select the optimal model using the largest value.
#> The final value used for the model was ncomp = 7.

Different models can also be compared:

preds <- extractPrediction(
  # Here we specify the `caret` models we want to compare
  models = list(
    pls1 = fit1, 
    pls2 = fit2
  ), 
  testX = spectra(oz_valid), 
  testY = oz_valid$carbon
) 

# necessary so 2 PLS model can be compared in `plotObsVsPred`
preds$model <- preds$object

plotObsVsPred(preds)

The model fit2 outperforms the model fit1: hardly a surprise as we limited the latter to 3 latent variables, which is clearly too few in this instance.

During the assessment of the performance of the model

Finally, those specific performance indicators can also be used to asses the final results. PLS predictions can be generated using the predict function from the caret package, and its result passed to postResampleSpectro:

# Simple example for the entire dataset
postResampleSpectro(
  pred = predict(fit2, spectra(oz)), 
  obs = oz_valid$carbon
)
#>      RMSE  Rsquared       RPD      RPIQ       CCC      Bias        SE 
#> 2.4148046        NA 0.6614048 0.4182533 0.1667738 0.7262200 4.9291993

Again, this function mimics its caret equivalent, postResample.

A more useful thing to do, from a modelling standpoint, is to compare those performance results on the calibration, validation, and bootstrapped sets (especially the two latter ones):

# Run model predictions and extract performance statistics for 
# caliration and validation
res_calibration <- postResampleSpectro(pred = predict(fit2, spectra(oz_calib)), obs = oz_calib$carbon)
res_validation <- postResampleSpectro(pred = predict(fit2, spectra(oz_valid)), obs = oz_valid$carbon)

# Bootstrapped results can be extracted from the `train` object:
res_boot <- fit2$results %>% 
  filter(ncomp == fit2$bestTune$ncomp) %>% 
  select(names(res_calibration))

# Assemble the calibration, validation, and 
# bootstrapped results in a single data.frame
res <- rbind(
  data.frame(type = "Calibration", t(res_calibration)),
  data.frame(type = "Validation", t(res_validation)),
  data.frame(type = "Bootstrap", res_boot)
)

Which gives the following results:

type RMSE Rsquared RPD RPIQ CCC Bias SE
Calibration 1.09 0.78 2.15 2.09 0.88 0.00 1.10
Validation 1.21 0.55 1.32 0.83 0.73 0.24 1.24
Bootstrap 1.31 0.71 1.76 1.80 0.82 -0.01 1.33