The iml
package can now handle bigger datasets. Earlier
problems with exploding memory have been fixed for
FeatureEffect
, FeatureImp
and
Interaction
. It’s also possible now to compute
FeatureImp
and Interaction
in parallel. This
document describes how.
First we load some data, fit a random forest and create a Predictor object.
set.seed(42)
library("iml")
library("randomForest")
data("Boston", package = "MASS")
rf <- randomForest(medv ~ ., data = Boston, n.trees = 10)
X <- Boston[which(names(Boston) != "medv")]
predictor <- Predictor$new(rf, data = X, y = Boston$medv)
Parallelization is supported via the {future} package. All you need
to do is to choose a parallel backend via
future::plan()
.
library("future")
library("future.callr")
#> Warning: Paket 'future.callr' wurde unter R Version 4.3.3 erstellt
# Creates a PSOCK cluster with 2 cores
plan("callr", workers = 2)
Now we can easily compute feature importance in parallel. This means that the computation per feature is distributed among the 2 cores I specified earlier.
That wasn’t very impressive, let’s actually see how much speed up we get by parallelization.
bench::system_time({
plan(sequential)
FeatureImp$new(predictor, loss = "mae")
})
#> Warning: Paket 'processx' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'lattice' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'callr' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'ps' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'rpart' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'patchwork' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'survival' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'Rcpp' wurde unter R Version 4.3.2 erstellt
#> process real
#> 1.94s 3.31s
bench::system_time({
plan("callr", workers = 2)
FeatureImp$new(predictor, loss = "mae")
})
#> process real
#> 125ms 6.78s
A little bit of improvement, but not too impressive. Parallelization is more useful in the case where the model uses a lot of features or where the feature importance computation is repeated more often to get more stable results.
bench::system_time({
plan(sequential)
FeatureImp$new(predictor, loss = "mae", n.repetitions = 10)
})
#> process real
#> 2.94s 5.49s
bench::system_time({
plan("callr", workers = 2)
FeatureImp$new(predictor, loss = "mae", n.repetitions = 10)
})
#> process real
#> 296.88ms 6.79s
Here the parallel computation is twice as fast as the sequential computation of the feature importance.
The parallelization also speeds up the computation of the interaction statistics: