misl implements Multiple Imputation by Super
Learning, a flexible approach to handling missing data
described in:
Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super learning. Statistical Methods in Medical Research. 31(10):1904–1915. doi: 10.1177/09622802221104238
Rather than relying on a single parametric imputation model,
misl builds a stacked ensemble of machine learning
algorithms for each incomplete column, producing well-calibrated
imputations for continuous, binary, and categorical variables.
We simulate a small dataset with three types of incomplete variables
to demonstrate misl across all supported outcome types.
library(misl)
set.seed(42)
n <- 300
sim_data <- data.frame(
# Continuous predictors (always observed)
age = round(rnorm(n, mean = 50, sd = 12)),
bmi = round(rnorm(n, mean = 26, sd = 4), 1),
# Continuous outcome with missingness
sbp = round(120 + 0.4 * rnorm(n, mean = 50, sd = 12) +
0.3 * rnorm(n, mean = 26, sd = 4) + rnorm(n, sd = 10)),
# Binary outcome with missingness (0 = no, 1 = yes)
smoker = rbinom(n, 1, prob = 0.3),
# Categorical outcome with missingness
group = factor(sample(c("A", "B", "C"), n, replace = TRUE,
prob = c(0.4, 0.35, 0.25)))
)
# Introduce missing values
sim_data[sample(n, 40), "sbp"] <- NA
sim_data[sample(n, 30), "smoker"] <- NA
sim_data[sample(n, 30), "group"] <- NA
# Summarise missingness
sapply(sim_data, function(x) sum(is.na(x)))
#> age bmi sbp smoker group
#> 0 0 40 30 30The primary function is misl(). Supply a dataset and
specify:
m – the number of multiply imputed datasets to
createmaxit – the number of Gibbs sampling iterations per
datasetcon_method, bin_method,
cat_method – learners for each outcome typeUse list_learners() to see available options:
| learner | description | continuous | binomial | categorical | package | installed |
|---|---|---|---|---|---|---|
| glm | Linear / logistic regression | TRUE | TRUE | FALSE | stats | TRUE |
| mars | Multivariate adaptive regression splines | TRUE | TRUE | FALSE | earth | TRUE |
| multinom_reg | Multinomial regression | FALSE | FALSE | TRUE | nnet | TRUE |
| rand_forest | Random forest | TRUE | TRUE | TRUE | ranger | TRUE |
| boost_tree | Gradient boosted trees | TRUE | TRUE | TRUE | xgboost | TRUE |
Running misl() with a single learner per outcome type is
the fastest option and is well suited for exploratory work:
misl_imp <- misl(
sim_data,
m = 5,
maxit = 5,
con_method = "glm",
bin_method = "glm",
cat_method = "multinom_reg",
seed = 42,
quiet = TRUE
)misl() returns a list of length m. Each
element contains:
$datasets – the fully imputed tibble$trace – a long-format tibble of convergence
statistics# Number of imputed datasets
length(misl_imp)
#> [1] 5
# Confirm no missing values remain
anyNA(misl_imp[[1]]$datasets)
#> [1] FALSE
# Compare observed vs imputed distribution for sbp
obs_mean <- mean(sim_data$sbp, na.rm = TRUE)
imp_mean <- mean(misl_imp[[1]]$datasets$sbp)
cat("Observed mean sbp (non-missing):", round(obs_mean, 2), "\n")
#> Observed mean sbp (non-missing): 147.31
cat("Imputed mean sbp (all values): ", round(imp_mean, 2), "\n")
#> Imputed mean sbp (all values): 147.65
# Check binary imputations are valid
table(misl_imp[[1]]$datasets$smoker)
#>
#> 0 1
#> 212 88
# Check categorical imputations stay within observed levels
levels(misl_imp[[1]]$datasets$group)
#> [1] "A" "B" "C"When multiple learners are supplied, misl uses
cross-validation to learn optimal ensemble weights via the
stacks package. Use cv_folds to reduce the
number of folds and speed up computation:
After imputation, fit your analysis model to each of the
m datasets and pool the results using Rubin’s rules. Here
we implement pooling manually using base R:
# Fit a linear model to each imputed dataset
models <- lapply(misl_imp, function(imp) {
lm(sbp ~ age + bmi + smoker + group, data = imp$datasets)
})
# Pool point estimates and standard errors using Rubin's rules
m <- length(models)
ests <- sapply(models, function(fit) coef(fit))
vars <- sapply(models, function(fit) diag(vcov(fit)))
Q_bar <- rowMeans(ests) # pooled estimate
U_bar <- rowMeans(vars) # within-imputation variance
B <- apply(ests, 1, var) # between-imputation variance
T_total <- U_bar + (1 + 1 / m) * B # total variance
pooled <- data.frame(
term = names(Q_bar),
estimate = round(Q_bar, 4),
std.error = round(sqrt(T_total), 4),
conf.low = round(Q_bar - 1.96 * sqrt(T_total), 4),
conf.high = round(Q_bar + 1.96 * sqrt(T_total), 4)
)
print(pooled)
#> term estimate std.error conf.low conf.high
#> (Intercept) (Intercept) 154.9631 5.1777 144.8148 165.1114
#> age age -0.0421 0.0577 -0.1552 0.0711
#> bmi bmi -0.1890 0.1699 -0.5219 0.1439
#> smoker smoker -0.7820 1.5082 -3.7381 2.1740
#> groupB groupB -0.1393 1.7862 -3.6402 3.3617
#> groupC groupC -0.9780 1.6066 -4.1269 2.1709For a full implementation of Rubin’s rules including degrees of
freedom and p-values, the mice package provides
pool() and can be used directly with misl
output:
The $trace element records the mean and standard
deviation of imputed values at each iteration, which can be used to
assess convergence:
# Extract trace data across all m datasets
trace <- do.call(rbind, lapply(misl_imp, function(r) r$trace))
# Plot mean of imputed sbp values across iterations for each dataset
sbp_trace <- subset(trace, variable == "sbp" & statistic == "mean")
plot(
sbp_trace$iteration,
sbp_trace$value,
col = sbp_trace$m,
pch = 16,
xlab = "Iteration",
ylab = "Mean imputed sbp",
main = "Trace plot: mean of imputed sbp values",
xaxt = "n"
)
axis(1, at = 1:5)
legend("topright", legend = paste("m =", 1:5),
col = 1:5, pch = 16, cex = 0.8)Stable traces that mix well across datasets indicate the algorithm has converged.
The m imputed datasets are generated independently and
can be run in parallel using the future framework. Set a
parallel plan before calling misl():
library(future)
# Use all available cores
plan(multisession)
misl_parallel <- misl(
sim_data,
m = 10,
maxit = 5,
con_method = c("glm", "rand_forest"),
bin_method = c("glm", "rand_forest"),
cat_method = c("multinom_reg", "rand_forest"),
seed = 42
)
# Always reset the plan when done
plan(sequential)To limit the number of cores:
The largest speedup comes from running the m datasets
simultaneously. On a machine with 10 cores running m = 10,
all 10 datasets can be imputed in parallel. Check how many cores are
available with:
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.7.4
#>
#> Matrix products: default
#> BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] future_1.70.0 misl_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.6 xfun_0.54 bslib_0.9.0
#> [4] ggplot2_4.0.2 recipes_1.3.1 lattice_0.22-7
#> [7] vctrs_0.7.2 tools_4.5.2 generics_0.1.4
#> [10] parallel_4.5.2 tibble_3.3.1 pkgconfig_2.0.3
#> [13] Matrix_1.7-4 data.table_1.18.2.1 RColorBrewer_1.1-3
#> [16] S7_0.2.1 lifecycle_1.0.5 compiler_4.5.2
#> [19] farver_2.1.2 codetools_0.2-20 htmltools_0.5.8.1
#> [22] class_7.3-23 sass_0.4.10 yaml_2.3.10
#> [25] prodlim_2026.03.11 Formula_1.2-5 pillar_1.11.1
#> [28] jquerylib_0.1.4 tidyr_1.3.2 MASS_7.3-65
#> [31] cachem_1.1.0 gower_1.0.2 plotmo_3.7.0
#> [34] rpart_4.1.24 earth_5.3.5 parallelly_1.46.1
#> [37] lava_1.8.2 tidyselect_1.2.1 digest_0.6.39
#> [40] dplyr_1.2.0 purrr_1.2.1 listenv_0.10.1
#> [43] splines_4.5.2 fastmap_1.2.0 parsnip_1.4.1
#> [46] grid_4.5.2 cli_3.6.5 magrittr_2.0.4
#> [49] survival_3.8-3 future.apply_1.20.2 withr_3.0.2
#> [52] scales_1.4.0 plotrix_3.8-14 xgboost_3.2.1.1
#> [55] lubridate_1.9.5 timechange_0.4.0 rmarkdown_2.30
#> [58] globals_0.19.1 nnet_7.3-20 timeDate_4052.112
#> [61] ranger_0.18.0 workflows_1.3.0 evaluate_1.0.5
#> [64] knitr_1.50 hardhat_1.4.2 rlang_1.1.7
#> [67] Rcpp_1.1.1 glue_1.8.0 sparsevctrs_0.3.6
#> [70] ipred_0.9-15 rstudioapi_0.17.1 jsonlite_2.0.0
#> [73] R6_2.6.1