by stephematician
literanger
is an adaptation of the ranger
R package for training and predicting from random forest models within
multiple imputation algorithms. ranger
is a fast
implementation of random forests (Breiman, 2001) or
recursive partitioning, particularly suited for high dimensional data
(Wright et al,
2017). literanger
redesigned the ranger
interface to achieve faster prediction, and is now available as a
backend for random forests within ‘Multiple Imputation via Chained
Equations’ (Van
Buuren, 2007) in the R package mice
.
Efficient serialization, i.e. reading and writing, of a trained random forest is provided via the cereal library.
require(literanger)
<- sample(nrow(iris), 2/3 * nrow(iris))
train_idx <- iris[ train_idx, ]
iris_train <- iris[-train_idx, ]
iris_test <- train(data=iris_train, response_name="Species")
rf_iris <- predict(rf_iris, newdata=iris_test,
pred_iris_bagged prediction_type="bagged")
<- predict(rf_iris, newdata=iris_test,
pred_iris_inbag prediction_type="inbag")
# compare bagged vs actual test values
table(iris_test$Species, pred_iris_bagged$values)
# compare bagged prediction vs in-bag draw
table(pred_iris_bagged$values, pred_iris_inbag$values)
Literanger supports reading/writing random forests (serialization).
We can save rf_iris
above using the function call:
write_literanger(rf_iris, "rf_iris.literanger")
In a new R session, we can read the random forest object in and predict for a new test set:
<- sample(nrow(iris), 1/3 * nrow(iris))
test_idx <- iris[test_idx, ]
iris_test <- read_literanger("rf_iris.literanger")
rf_iris_copy table(iris_test$Specis, predict(rf_iris_copy, newdata=iris_test)$values)
The release can be installed via:
install.packages('literanger')
The development version can be installed using remotes
:
::install_gitlab('stephematician/literanger') remotes
A minor variation on mice
’s use of random forests is
available; each prediction is drawn from in-bag samples from a random
tree - thus the computational effort is constant with respect to the
size of the forest (number of trees) compared to the original
implementation in mice
.
The interface of ranger
was redesigned such that the
trained forest object can be recycled, and the data for training and
prediction are passed without (unnecessary) copies, see
ranger
issue #304.
Non-exhaustive:
Breiman, L. (2001). Random forests. Machine learning, 45, pp. 5-32. doi:10.1023/A:1010933404324.
Doove, L.L., Van Buuren, S. and Dusseldorp, E., 2014. Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis, 72, pp. 92-104. doi:10.1016/j.csda.2013.10.025.
Grant, W. S., and Voorhies, R., 2017. cereal - A C++11 library for serialization. https://uscilab.github.io/cereal/.
Van Buuren, S. 2007. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16(3), pp. 219-242. doi:10.1177/0962280206074463.
Wright, M. N. and Ziegler, A., 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(i01), pp. 1-17. doi:10.18637/jss.v077.i01.