Parallel computation of interpretation methods

The iml package can now handle bigger datasets. Earlier problems with exploding memory have been fixed for FeatureEffect, FeatureImp and Interaction. It’s also possible now to compute FeatureImp and Interaction in parallel. This document describes how.

First we load some data, fit a random forest and create a Predictor object.

set.seed(42)
library("iml")
library("randomForest")
data("Boston", package = "MASS")
rf <- randomForest(medv ~ ., data = Boston, n.trees = 10)
X <- Boston[which(names(Boston) != "medv")]
predictor <- Predictor$new(rf, data = X, y = Boston$medv)

Going parallel

Parallelization is supported via the {future} package. All you need to do is to choose a parallel backend via future::plan().

library("future")
library("future.callr")
#> Warning: Paket 'future.callr' wurde unter R Version 4.3.3 erstellt
# Creates a PSOCK cluster with 2 cores
plan("callr", workers = 2)

Now we can easily compute feature importance in parallel. This means that the computation per feature is distributed among the 2 cores I specified earlier.

imp <- FeatureImp$new(predictor, loss = "mae")
library("ggplot2")
plot(imp)

That wasn’t very impressive, let’s actually see how much speed up we get by parallelization.

bench::system_time({
  plan(sequential)
  FeatureImp$new(predictor, loss = "mae")
})
#> Warning: Paket 'processx' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'lattice' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'callr' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'ps' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'rpart' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'patchwork' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'survival' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'Rcpp' wurde unter R Version 4.3.2 erstellt
#> process    real 
#>   1.94s   3.31s
bench::system_time({
  plan("callr", workers = 2)
  FeatureImp$new(predictor, loss = "mae")
})
#> process    real 
#>   125ms   6.78s

A little bit of improvement, but not too impressive. Parallelization is more useful in the case where the model uses a lot of features or where the feature importance computation is repeated more often to get more stable results.

bench::system_time({
  plan(sequential)
  FeatureImp$new(predictor, loss = "mae", n.repetitions = 10)
})
#> process    real 
#>   2.94s   5.49s

bench::system_time({
  plan("callr", workers = 2)
  FeatureImp$new(predictor, loss = "mae", n.repetitions = 10)
})
#>  process     real 
#> 296.88ms    6.79s

Interaction

Here the parallel computation is twice as fast as the sequential computation of the feature importance.

The parallelization also speeds up the computation of the interaction statistics:

bench::system_time({
  plan(sequential)
  Interaction$new(predictor, grid.size = 15)
})
#> process    real 
#>   5.22s    7.8s
bench::system_time({
  plan("callr", workers = 2)
  Interaction$new(predictor, grid.size = 15)
})
#>  process     real 
#> 265.62ms    9.47s

Feature Effects

Same for FeatureEffects:

bench::system_time({
  plan(sequential)
  FeatureEffects$new(predictor)
})
#>  process     real 
#> 765.62ms    1.25s
bench::system_time({
  plan("callr", workers = 2)
  FeatureEffects$new(predictor)
})
#>  process     real 
#> 984.38ms    9.65s