Post-run diagnostics and analyses

2024-05-14

In this tutorial, we assume that you successfully ran an MCMC on a network model, and that it is now time to have a critical look at the output of the run.

This tutorial will present how to check that the MCMC run went fine, but it will explain only very briefly how to check that the model is compatible with the observed data. More details about this important step is presented in the next vignette about posterior predictive checks.

This vignette is using the MCMC run from the Quick start tutorial. Please go back and run the code of that vignette to generate the MCMC data if you haven’t already!

# Load the tidyverse package for easier data manipulation
library(tidyverse)

Reminder: what is the result format returned by run_mcmc()?

In the Quick Start tutorial, we generated the run object by running run <- run_mcmc(m, iter = 1000). The output from run_mcmc() is a well-behaved mcmc.list object as implemented in the coda package:

is(run, "mcmc.list")
## [1] TRUE

It makes it very easy to use the tools already available for this class of objects, such as those implemented in the packages coda, bayesplot, ggmcmc or MCMCvis.

In addition to all those existing tools, the isotracer package also adds some extra methods to easily calculate derived parameters from an mcmc.list. You will learn more about how to do this in the vignette Derived parameters.

General diagnostics

Trace plot

You should always run several chains when performing a Bayesian MCMC. Trace plots allow to get a feeling for:

If you are not satisfied with the traces, you need to run a longer run or maybe to tweak the settings of the Stan run.

This is the trace plot we obtained from the previous vignette:

plot(run)
# Note: the figure below only shows a few of the traceplots for vignette concision

In this case, the chains have converged towards the same region, and the mixing looks good for all of them. There is no obvious problem visible in those traces.

Gelman and Rubin’s convergence diagnostic

It can be useful to have a more formal test of the convergence of the chains. The Gelman and Rubin’s convergence diagnostic is implemented in the coda package. We can run it with the coda::gelman.diag() function. See the coda documentation ?gelman.diag for more details about this diagnostic.

Let’s have a look at the diagnostic for our chains:

library(coda)
run %>% gelman.diag()
## Potential scale reduction factors:
## 
##                          Point est. Upper C.I.
## eta                            1.00       1.00
## lambda_algae                   1.01       1.02
## lambda_daphnia                 1.01       1.02
## lambda_NH4                     1.01       1.02
## upsilon_algae_to_daphnia       1.02       1.03
## upsilon_daphnia_to_NH4         1.00       1.00
## upsilon_NH4_to_algae           1.01       1.02
## zeta                           1.00       1.01
## 
## Multivariate psrf
## 
## 1.02

The diagnostic values should be very, very close to 1: it looks good in this case!

If some values were e.g. \(>1.05\), this would already be enough to cause us to wonder about the sampling quality.

Predicted trajectories

In order to check the quality of the model fit, the consistency between the parameter posteriors and the observed data can be checked by plotting the credible envelopes for the estimated trajectories along with the observed data points. This is called a posterior predictive check and is very important to check that the model can actually predict the observed data reasonably well. If the observed data cannot be satisfactorily predicted from the fitted model, then our model is not a good model of the data!

To do a posterior predictive check, the first step is to generate predictions for the model based on the MCMC posteriors with predict():

# From the Quick Start tutorial:
# 'm' is the network model we used when calling 'run <- run_mcmc(m, iter = 1000)'
predictions <- predict(m, run, probs = 0.95)

We can then visualize the predictions along with the observations with the plot() method:

plot(predictions)

This plot enables to compare both the size and the proportion observations with the predictions.

Post-run analyses

Parameter estimates

The quickest way to get parameter estimates is to use the summary() function on the posterior:

run %>% summary()
## 
## Iterations = 501:1000
## Thinning interval = 1 
## Number of chains = 4 
## Sample size per chain = 500 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                             Mean       SD  Naive SE Time-series SE
## eta                      0.12751 0.049964 0.0011172      0.0019548
## lambda_algae             0.10723 0.067599 0.0015116      0.0021817
## lambda_daphnia           0.03647 0.042346 0.0009469      0.0015642
## lambda_NH4               0.09269 0.068981 0.0015425      0.0027648
## upsilon_algae_to_daphnia 0.07785 0.023643 0.0005287      0.0009433
## upsilon_daphnia_to_NH4   0.04894 0.007169 0.0001603      0.0002337
## upsilon_NH4_to_algae     0.34347 0.045724 0.0010224      0.0017310
## zeta                     0.43612 0.237678 0.0053146      0.0099599
## 
## 2. Quantiles for each variable:
## 
##                               2.5%      25%     50%     75%   97.5%
## eta                      0.0654264 0.094176 0.11691 0.14817 0.25969
## lambda_algae             0.0085454 0.057541 0.09705 0.14604 0.26499
## lambda_daphnia           0.0007375 0.009573 0.02209 0.04620 0.15846
## lambda_NH4               0.0034180 0.041621 0.07744 0.13014 0.26348
## upsilon_algae_to_daphnia 0.0451278 0.062348 0.07370 0.08860 0.13730
## upsilon_daphnia_to_NH4   0.0361709 0.044109 0.04849 0.05325 0.06439
## upsilon_NH4_to_algae     0.2609275 0.314829 0.34115 0.37062 0.43671
## zeta                     0.1844365 0.286386 0.36440 0.51207 1.09344

If you need to store those values, for example to plot them, you can assign the output of summary() to an object:

estimates <- run %>% summary()
names(estimates)
## [1] "statistics" "quantiles"  "start"      "end"        "thin"       "nchain"

The means and standard deviations are accessible in $statistics:

estimates$statistics
##                                Mean          SD     Naive SE Time-series SE
## eta                      0.12751257 0.049963905 0.0011172269   0.0019547723
## lambda_algae             0.10723050 0.067599025 0.0015115601   0.0021817165
## lambda_daphnia           0.03647363 0.042345690 0.0009468784   0.0015642272
## lambda_NH4               0.09269130 0.068980569 0.0015424524   0.0027648434
## upsilon_algae_to_daphnia 0.07784906 0.023642981 0.0005286731   0.0009432531
## upsilon_daphnia_to_NH4   0.04893872 0.007169036 0.0001603045   0.0002336529
## upsilon_NH4_to_algae     0.34347224 0.045723698 0.0010224130   0.0017310105
## zeta                     0.43611523 0.237677637 0.0053146335   0.0099599011

and the quantiles are in $quantiles:

estimates$quantiles
##                                  2.5%         25%        50%        75%      97.5%
## eta                      0.0654264365 0.094176064 0.11691344 0.14816691 0.25969422
## lambda_algae             0.0085454389 0.057540842 0.09704578 0.14603524 0.26498951
## lambda_daphnia           0.0007375003 0.009573074 0.02208908 0.04620389 0.15846417
## lambda_NH4               0.0034180067 0.041620929 0.07743919 0.13013928 0.26347716
## upsilon_algae_to_daphnia 0.0451277873 0.062348037 0.07370339 0.08859513 0.13729757
## upsilon_daphnia_to_NH4   0.0361709218 0.044109217 0.04849449 0.05324976 0.06439118
## upsilon_NH4_to_algae     0.2609274862 0.314829360 0.34115256 0.37062017 0.43670569
## zeta                     0.1844365414 0.286385629 0.36440171 0.51207060 1.09343843

Parameter correlations

The dependencies between your model parameters might be of interest to you. If you would like to analyse the correlations between parameters during the MCMC run, you can use a few ready-made functions to get a quick overview of the correlation structure.

The isotracer package comes with the minimalist function mcmc_heatmap() to draw the strength of parameter correlations:

mcmc_heatmap(run)

But of course you could use other functions provided by other packages, such as:

Extracting parameters, trajectories and flows

If you are interested in getting detailed tables containing all the samples of the parameter posteriors, you can use the tidy_mcmc() function:

tidy_mcmc(run)
## # A tibble: 2,000 × 3
##    mcmc.chain mcmc.iteration mcmc.parameters
##         <int>          <int> <list>         
##  1          1              1 <dbl [8]>      
##  2          1              2 <dbl [8]>      
##  3          1              3 <dbl [8]>      
##  4          1              4 <dbl [8]>      
##  5          1              5 <dbl [8]>      
##  6          1              6 <dbl [8]>      
##  7          1              7 <dbl [8]>      
##  8          1              8 <dbl [8]>      
##  9          1              9 <dbl [8]>      
## 10          1             10 <dbl [8]>      
## # ℹ 1,990 more rows

By default the parameter values are nested into a list column, but you can also get a flat table with spread = TRUE:

tidy_mcmc(run, spread = TRUE)
## # A tibble: 2,000 × 10
##    mcmc.chain mcmc.iteration    eta lambda_algae lambda_daphnia lambda_NH4
##         <int>          <int>  <dbl>        <dbl>          <dbl>      <dbl>
##  1          1              1 0.113        0.0383        0.00535     0.0704
##  2          1              2 0.242        0.0941        0.0627      0.0302
##  3          1              3 0.130        0.140         0.0291      0.164 
##  4          1              4 0.0921       0.203         0.0176      0.278 
##  5          1              5 0.0901       0.299         0.0170      0.252 
##  6          1              6 0.103        0.142         0.0596      0.156 
##  7          1              7 0.113        0.160         0.0202      0.0660
##  8          1              8 0.0978       0.0302        0.0200      0.0110
##  9          1              9 0.142        0.0859        0.0325      0.0658
## 10          1             10 0.117        0.0398        0.0456      0.0897
## # ℹ 1,990 more rows
## # ℹ 4 more variables: upsilon_algae_to_daphnia <dbl>, upsilon_daphnia_to_NH4 <dbl>,
## #   upsilon_NH4_to_algae <dbl>, zeta <dbl>

The above table only contains the primary parameters. If you are interested in getting the predicted trajectories for individual MCMC samples, you can use the tidy_trajectories() function:

# We have to also provide the original network model `m`
tt <- tidy_trajectories(m, run, n = 200)
tt
## # A tibble: 200 × 4
##    mcmc.chain mcmc.iteration mcmc.parameters trajectories    
##         <int>          <int> <list>          <list>          
##  1          4             28 <dbl [8]>       <tibble [1 × 5]>
##  2          2             87 <dbl [8]>       <tibble [1 × 5]>
##  3          2            319 <dbl [8]>       <tibble [1 × 5]>
##  4          4            295 <dbl [8]>       <tibble [1 × 5]>
##  5          1             71 <dbl [8]>       <tibble [1 × 5]>
##  6          2            184 <dbl [8]>       <tibble [1 × 5]>
##  7          1            371 <dbl [8]>       <tibble [1 × 5]>
##  8          2            257 <dbl [8]>       <tibble [1 × 5]>
##  9          2            198 <dbl [8]>       <tibble [1 × 5]>
## 10          1            307 <dbl [8]>       <tibble [1 × 5]>
## # ℹ 190 more rows

As you can see, the tt object is a tidy table which contains the parameter values and the corresponding trajectories calculated for 200 randomly selected MCMC samples. The calculated trajectories are stored in the trajectories column and provide the quantities of unmarked and marked tracer (e.g. light and heavy isotope) for each compartment at each time step:

tt$trajectories[[1]]
## # A tibble: 1 × 5
##   timepoints  unmarked        marked          sizes           proportions    
##   <list>      <list>          <list>          <list>          <list>         
## 1 <dbl [260]> <dbl [260 × 3]> <dbl [260 × 3]> <dbl [260 × 3]> <dbl [260 × 3]>

Because each trajectory is itself a table containing a time series for each compartment, the output of tidy_trajectories() has several levels of nesting. This makes it a bit cumbersome to manipulate. Note that the output format of this function might change in the future.

Here is an example of what can be done using the predicted trajectories:

algae <- tt %>%
  mutate(prop_algae = map(trajectories, function(tr) {
    tr[["proportions"]][[1]][, "algae"]
  })) %>%
  pull(prop_algae) %>%
  do.call(rbind, .)
time <- tt$trajectories[[1]]$timepoints[[1]]
plot(0, type = "n", xlim = range(time), ylim = range(algae), las = 1,
     xlab = "Time", ylab = "Proportion of marked tracer (algae)",
     main = "Posterior sample of trajectories (for 15N prop. in algae)")
invisible(sapply(seq_len(nrow(algae)), function(i) {
  lines(time, algae[i,], col = adjustcolor("seagreen3", alpha.f = 0.2))
}))

Finally, if what you are interested in are not the trajectories per se but the actual flows of nutrient during the experiment, you can use the tidy_flows() function to extract flows in a similar way:

#' Again, note that we also provide the original network model `m`
tf <- tidy_flows(m, run, n = 200)
tf
## # A tibble: 200 × 4
##    mcmc.chain mcmc.iteration mcmc.parameters flows             
##  *      <int>          <int> <list>          <list>            
##  1          1            372 <dbl [8]>       <gropd_df [6 × 3]>
##  2          2            215 <dbl [8]>       <gropd_df [6 × 3]>
##  3          4            226 <dbl [8]>       <gropd_df [6 × 3]>
##  4          1            172 <dbl [8]>       <gropd_df [6 × 3]>
##  5          1            222 <dbl [8]>       <gropd_df [6 × 3]>
##  6          2            221 <dbl [8]>       <gropd_df [6 × 3]>
##  7          3            310 <dbl [8]>       <gropd_df [6 × 3]>
##  8          2            205 <dbl [8]>       <gropd_df [6 × 3]>
##  9          2             34 <dbl [8]>       <gropd_df [6 × 3]>
## 10          2            110 <dbl [8]>       <gropd_df [6 × 3]>
## # ℹ 190 more rows

The returned object is very similar to the output of tidy_trajectories(), except that the trajectories column is replaced by a flows column:

tf$flows[[1]]
## # A tibble: 6 × 3
## # Groups:   from [3]
##   from    to      average_flow
##   <chr>   <chr>          <dbl>
## 1 NH4     algae        0.0747 
## 2 NH4     <NA>         0.0172 
## 3 algae   daphnia      0.0765 
## 4 algae   <NA>         0.0322 
## 5 daphnia NH4          0.0884 
## 6 daphnia <NA>         0.00267

The average flow values are given in flow per unit of time. See ?tidy_flows() for more details, including the possibility of calculating steady state flows for network systems that admits a steady state equilibrium.

The tidy_trajectories() and tidy_flows() functions are especially useful when you want to do some calculations related to some specific properties of the trajectories or of the nutrient flows over the whole MCMC posterior.