literanger: A fast implementation of random forests for multiple imputation

R CMD check status coverage report Common Changelog

by stephematician

literanger is an adaptation of the ranger R package for training and predicting from random forest models within multiple imputation algorithms. ranger is a fast implementation of random forests (Breiman, 2001) or recursive partitioning, particularly suited for high dimensional data (Wright et al, 2017). literanger redesigned the ranger interface to achieve faster prediction, and is now available as a backend for random forests within ‘Multiple Imputation via Chained Equations’ (Van Buuren, 2007) in the R package mice.

Efficient serialization, i.e. reading and writing, of a trained random forest is provided via the cereal library.

Example

require(literanger)

train_idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris_train <- iris[ train_idx, ]
iris_test  <- iris[-train_idx, ]
rf_iris <- train(data=iris_train, response_name="Species")
pred_iris_bagged <- predict(rf_iris, newdata=iris_test,
                            prediction_type="bagged")
pred_iris_inbag  <- predict(rf_iris, newdata=iris_test,
                            prediction_type="inbag")
# compare bagged vs actual test values
table(iris_test$Species, pred_iris_bagged$values)
# compare bagged prediction vs in-bag draw
table(pred_iris_bagged$values, pred_iris_inbag$values)

Literanger supports reading/writing random forests (serialization). We can save rf_iris above using the function call:

write_literanger(rf_iris, "rf_iris.literanger")

In a new R session, we can read the random forest object in and predict for a new test set:

test_idx <- sample(nrow(iris), 1/3 * nrow(iris))
iris_test  <- iris[test_idx, ]
rf_iris_copy <- read_literanger("rf_iris.literanger")
table(iris_test$Specis, predict(rf_iris_copy, newdata=iris_test)$values)

Installation

The release can be installed via:

install.packages('literanger')

The development version can be installed using remotes:

remotes::install_gitlab('stephematician/literanger')

Technical details

A minor variation on mice’s use of random forests is available; each prediction is drawn from in-bag samples from a random tree - thus the computational effort is constant with respect to the size of the forest (number of trees) compared to the original implementation in mice.

The interface of ranger was redesigned such that the trained forest object can be recycled, and the data for training and prediction are passed without (unnecessary) copies, see ranger issue #304.

To-do

Non-exhaustive:

References

Breiman, L. (2001). Random forests. Machine learning, 45, pp. 5-32. doi:10.1023/A:1010933404324.

Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi:10.1016/j.csda.2013.10.025.

Grant, W. S., and Voorhies, R., 2017. cereal - A C++11 library for serialization. https://uscilab.github.io/cereal/.

Van Buuren, S. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), pp. 219-242. doi:10.1177/0962280206074463.

Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi:10.18637/jss.v077.i01.