In this tutorial, we assume that you successfully ran an MCMC on a network model, and that it is now time to have a critical look at the output of the run.
This tutorial will present how to check that the MCMC run went fine, but it will explain only very briefly how to check that the model is compatible with the observed data. More details about this important step is presented in the next vignette about posterior predictive checks.
This vignette is using the MCMC run from the Quick start tutorial. Please go back and run the code of that vignette to generate the MCMC data if you haven’t already!
# Load the tidyverse package for easier data manipulation
library(tidyverse)
run_mcmc()
?Note: This short section is a quick reminder of what you learnt in the previous tutorial MCMC output format).
In the Quick Start tutorial, we generated the
run
object by running
run <- run_mcmc(m, iter = 1000)
. The output from
run_mcmc()
is a well-behaved mcmc.list
object
as implemented in the coda
package:
is(run, "mcmc.list")
## [1] TRUE
It makes it very easy to use the tools already available for this
class of objects, such as those implemented in the packages
coda
, bayesplot
, ggmcmc
or
MCMCvis
.
In addition to all those existing tools, the isotracer
package also adds some extra methods to easily calculate derived
parameters from an mcmc.list
. You will learn more about how
to do this in the vignette Derived parameters.
Tip: isotracer
uses Stan behind the scenes to run
the MCMC. If you prefer to get the raw stanfit
object
instead of the processed mcmc.list
, you can set
stanfit = TRUE
when you run the model:
<- run_mcmc(model = m, stanfit = TRUE) run
This might be especially useful if the MCMC sampling is difficult for
your model and you need the stanfit
object to perform some
in-depth diagnostics.
You should always run several chains when performing a Bayesian MCMC. Trace plots allow to get a feeling for:
If you are not satisfied with the traces, you need to run a longer run or maybe to tweak the settings of the Stan run.
This is the trace plot we obtained from the previous vignette:
plot(run)
# Note: the figure below only shows a few of the traceplots for vignette concision
In this case, the chains have converged towards the same region, and the mixing looks good for all of them. There is no obvious problem visible in those traces.
It can be useful to have a more formal test of the convergence of the
chains. The Gelman and Rubin’s convergence diagnostic is implemented in
the coda
package. We can run it with the
coda::gelman.diag()
function. See the coda documentation
?gelman.diag
for more details about this diagnostic.
Let’s have a look at the diagnostic for our chains:
library(coda)
%>% gelman.diag() run
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## eta 1.00 1.00
## lambda_algae 1.01 1.02
## lambda_daphnia 1.01 1.02
## lambda_NH4 1.01 1.02
## upsilon_algae_to_daphnia 1.02 1.03
## upsilon_daphnia_to_NH4 1.00 1.00
## upsilon_NH4_to_algae 1.01 1.02
## zeta 1.00 1.01
##
## Multivariate psrf
##
## 1.02
The diagnostic values should be very, very close to 1: it looks good in this case!
If some values were e.g. \(>1.05\), this would already be enough to cause us to wonder about the sampling quality.
In order to check the quality of the model fit, the consistency between the parameter posteriors and the observed data can be checked by plotting the credible envelopes for the estimated trajectories along with the observed data points. This is called a posterior predictive check and is very important to check that the model can actually predict the observed data reasonably well. If the observed data cannot be satisfactorily predicted from the fitted model, then our model is not a good model of the data!
To do a posterior predictive check, the first step is to generate
predictions for the model based on the MCMC posteriors with
predict()
:
# From the Quick Start tutorial:
# 'm' is the network model we used when calling 'run <- run_mcmc(m, iter = 1000)'
<- predict(m, run, probs = 0.95) predictions
We can then visualize the predictions along with the observations
with the plot()
method:
plot(predictions)
This plot enables to compare both the size and the proportion observations with the predictions.
The quickest way to get parameter estimates is to use the
summary()
function on the posterior:
%>% summary() run
##
## Iterations = 501:1000
## Thinning interval = 1
## Number of chains = 4
## Sample size per chain = 500
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## eta 0.12751 0.049964 0.0011172 0.0019548
## lambda_algae 0.10723 0.067599 0.0015116 0.0021817
## lambda_daphnia 0.03647 0.042346 0.0009469 0.0015642
## lambda_NH4 0.09269 0.068981 0.0015425 0.0027648
## upsilon_algae_to_daphnia 0.07785 0.023643 0.0005287 0.0009433
## upsilon_daphnia_to_NH4 0.04894 0.007169 0.0001603 0.0002337
## upsilon_NH4_to_algae 0.34347 0.045724 0.0010224 0.0017310
## zeta 0.43612 0.237678 0.0053146 0.0099599
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## eta 0.0654264 0.094176 0.11691 0.14817 0.25969
## lambda_algae 0.0085454 0.057541 0.09705 0.14604 0.26499
## lambda_daphnia 0.0007375 0.009573 0.02209 0.04620 0.15846
## lambda_NH4 0.0034180 0.041621 0.07744 0.13014 0.26348
## upsilon_algae_to_daphnia 0.0451278 0.062348 0.07370 0.08860 0.13730
## upsilon_daphnia_to_NH4 0.0361709 0.044109 0.04849 0.05325 0.06439
## upsilon_NH4_to_algae 0.2609275 0.314829 0.34115 0.37062 0.43671
## zeta 0.1844365 0.286386 0.36440 0.51207 1.09344
If you need to store those values, for example to plot them, you can
assign the output of summary()
to an object:
<- run %>% summary()
estimates names(estimates)
## [1] "statistics" "quantiles" "start" "end" "thin" "nchain"
The means and standard deviations are accessible in
$statistics
:
$statistics estimates
## Mean SD Naive SE Time-series SE
## eta 0.12751257 0.049963905 0.0011172269 0.0019547723
## lambda_algae 0.10723050 0.067599025 0.0015115601 0.0021817165
## lambda_daphnia 0.03647363 0.042345690 0.0009468784 0.0015642272
## lambda_NH4 0.09269130 0.068980569 0.0015424524 0.0027648434
## upsilon_algae_to_daphnia 0.07784906 0.023642981 0.0005286731 0.0009432531
## upsilon_daphnia_to_NH4 0.04893872 0.007169036 0.0001603045 0.0002336529
## upsilon_NH4_to_algae 0.34347224 0.045723698 0.0010224130 0.0017310105
## zeta 0.43611523 0.237677637 0.0053146335 0.0099599011
and the quantiles are in $quantiles
:
$quantiles estimates
## 2.5% 25% 50% 75% 97.5%
## eta 0.0654264365 0.094176064 0.11691344 0.14816691 0.25969422
## lambda_algae 0.0085454389 0.057540842 0.09704578 0.14603524 0.26498951
## lambda_daphnia 0.0007375003 0.009573074 0.02208908 0.04620389 0.15846417
## lambda_NH4 0.0034180067 0.041620929 0.07743919 0.13013928 0.26347716
## upsilon_algae_to_daphnia 0.0451277873 0.062348037 0.07370339 0.08859513 0.13729757
## upsilon_daphnia_to_NH4 0.0361709218 0.044109217 0.04849449 0.05324976 0.06439118
## upsilon_NH4_to_algae 0.2609274862 0.314829360 0.34115256 0.37062017 0.43670569
## zeta 0.1844365414 0.286385629 0.36440171 0.51207060 1.09343843
The dependencies between your model parameters might be of interest to you. If you would like to analyse the correlations between parameters during the MCMC run, you can use a few ready-made functions to get a quick overview of the correlation structure.
The isotracer
package comes with the minimalist function
mcmc_heatmap()
to draw the strength of parameter
correlations:
mcmc_heatmap(run)
But of course you could use other functions provided by other packages, such as:
ggmcmc
package
ggs_crosscorrelation()
ggs_pairs()
bayesplot
package
mcmc_pairs()
If you are interested in getting detailed tables containing all the
samples of the parameter posteriors, you can use the
tidy_mcmc()
function:
tidy_mcmc(run)
## # A tibble: 2,000 × 3
## mcmc.chain mcmc.iteration mcmc.parameters
## <int> <int> <list>
## 1 1 1 <dbl [8]>
## 2 1 2 <dbl [8]>
## 3 1 3 <dbl [8]>
## 4 1 4 <dbl [8]>
## 5 1 5 <dbl [8]>
## 6 1 6 <dbl [8]>
## 7 1 7 <dbl [8]>
## 8 1 8 <dbl [8]>
## 9 1 9 <dbl [8]>
## 10 1 10 <dbl [8]>
## # ℹ 1,990 more rows
By default the parameter values are nested into a list column, but
you can also get a flat table with spread = TRUE
:
tidy_mcmc(run, spread = TRUE)
## # A tibble: 2,000 × 10
## mcmc.chain mcmc.iteration eta lambda_algae lambda_daphnia lambda_NH4
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 0.113 0.0383 0.00535 0.0704
## 2 1 2 0.242 0.0941 0.0627 0.0302
## 3 1 3 0.130 0.140 0.0291 0.164
## 4 1 4 0.0921 0.203 0.0176 0.278
## 5 1 5 0.0901 0.299 0.0170 0.252
## 6 1 6 0.103 0.142 0.0596 0.156
## 7 1 7 0.113 0.160 0.0202 0.0660
## 8 1 8 0.0978 0.0302 0.0200 0.0110
## 9 1 9 0.142 0.0859 0.0325 0.0658
## 10 1 10 0.117 0.0398 0.0456 0.0897
## # ℹ 1,990 more rows
## # ℹ 4 more variables: upsilon_algae_to_daphnia <dbl>, upsilon_daphnia_to_NH4 <dbl>,
## # upsilon_NH4_to_algae <dbl>, zeta <dbl>
The above table only contains the primary parameters. If you are
interested in getting the predicted trajectories for individual MCMC
samples, you can use the tidy_trajectories()
function:
# We have to also provide the original network model `m`
<- tidy_trajectories(m, run, n = 200)
tt tt
## # A tibble: 200 × 4
## mcmc.chain mcmc.iteration mcmc.parameters trajectories
## <int> <int> <list> <list>
## 1 4 28 <dbl [8]> <tibble [1 × 5]>
## 2 2 87 <dbl [8]> <tibble [1 × 5]>
## 3 2 319 <dbl [8]> <tibble [1 × 5]>
## 4 4 295 <dbl [8]> <tibble [1 × 5]>
## 5 1 71 <dbl [8]> <tibble [1 × 5]>
## 6 2 184 <dbl [8]> <tibble [1 × 5]>
## 7 1 371 <dbl [8]> <tibble [1 × 5]>
## 8 2 257 <dbl [8]> <tibble [1 × 5]>
## 9 2 198 <dbl [8]> <tibble [1 × 5]>
## 10 1 307 <dbl [8]> <tibble [1 × 5]>
## # ℹ 190 more rows
As you can see, the tt
object is a tidy table which
contains the parameter values and the corresponding trajectories
calculated for 200 randomly selected MCMC samples. The calculated
trajectories are stored in the trajectories
column and
provide the quantities of unmarked and marked tracer (e.g. light and
heavy isotope) for each compartment at each time step:
$trajectories[[1]] tt
## # A tibble: 1 × 5
## timepoints unmarked marked sizes proportions
## <list> <list> <list> <list> <list>
## 1 <dbl [260]> <dbl [260 × 3]> <dbl [260 × 3]> <dbl [260 × 3]> <dbl [260 × 3]>
Because each trajectory is itself a table containing a time series
for each compartment, the output of tidy_trajectories()
has
several levels of nesting. This makes it a bit cumbersome to manipulate.
Note that the output format of this function might change in the
future.
Here is an example of what can be done using the predicted trajectories:
<- tt %>%
algae mutate(prop_algae = map(trajectories, function(tr) {
"proportions"]][[1]][, "algae"]
tr[[%>%
})) pull(prop_algae) %>%
do.call(rbind, .)
<- tt$trajectories[[1]]$timepoints[[1]]
time plot(0, type = "n", xlim = range(time), ylim = range(algae), las = 1,
xlab = "Time", ylab = "Proportion of marked tracer (algae)",
main = "Posterior sample of trajectories (for 15N prop. in algae)")
invisible(sapply(seq_len(nrow(algae)), function(i) {
lines(time, algae[i,], col = adjustcolor("seagreen3", alpha.f = 0.2))
}))
Finally, if what you are interested in are not the trajectories per
se but the actual flows of nutrient during the experiment, you can use
the tidy_flows()
function to extract flows in a similar
way:
#' Again, note that we also provide the original network model `m`
<- tidy_flows(m, run, n = 200)
tf tf
## # A tibble: 200 × 4
## mcmc.chain mcmc.iteration mcmc.parameters flows
## * <int> <int> <list> <list>
## 1 1 372 <dbl [8]> <gropd_df [6 × 3]>
## 2 2 215 <dbl [8]> <gropd_df [6 × 3]>
## 3 4 226 <dbl [8]> <gropd_df [6 × 3]>
## 4 1 172 <dbl [8]> <gropd_df [6 × 3]>
## 5 1 222 <dbl [8]> <gropd_df [6 × 3]>
## 6 2 221 <dbl [8]> <gropd_df [6 × 3]>
## 7 3 310 <dbl [8]> <gropd_df [6 × 3]>
## 8 2 205 <dbl [8]> <gropd_df [6 × 3]>
## 9 2 34 <dbl [8]> <gropd_df [6 × 3]>
## 10 2 110 <dbl [8]> <gropd_df [6 × 3]>
## # ℹ 190 more rows
The returned object is very similar to the output of
tidy_trajectories()
, except that the
trajectories
column is replaced by a flows
column:
$flows[[1]] tf
## # A tibble: 6 × 3
## # Groups: from [3]
## from to average_flow
## <chr> <chr> <dbl>
## 1 NH4 algae 0.0747
## 2 NH4 <NA> 0.0172
## 3 algae daphnia 0.0765
## 4 algae <NA> 0.0322
## 5 daphnia NH4 0.0884
## 6 daphnia <NA> 0.00267
The average flow values are given in flow per unit of time. See
?tidy_flows()
for more details, including the possibility
of calculating steady state flows for network systems that admits a
steady state equilibrium.
The tidy_trajectories()
and tidy_flows()
functions are especially useful when you want to do some calculations
related to some specific properties of the trajectories or of the
nutrient flows over the whole MCMC posterior.