In this vignette, we explain how one can estimate the marginal likelihood using the LoRaD method (Wang et al. 2023).
Two DNA sequences, each of length 200 sites, were simulated under the K80 substitution model (Kimura 1980). This model has two parameters:
transition/transversion rate ratio, \(\kappa\): the ratio of the instantaneous rate of transition-type substitutions (A \(\leftrightarrow\)G, C\(\leftrightarrow\)T) to the instantaneous rate of transversion-type substitutions (A\(\leftrightarrow\)C, A\(\leftrightarrow\)T, G\(\leftrightarrow\)C, G\(\leftrightarrow\)T).
edge length, \(v\) : the evolutionary distance between the two sequences measured in expected number of substitutions per site.
A continuous-time, 4-state Markov model was used for both simulation and analysis. The instantaneous rate matrix \({\bf Q}\) that defines the K80 substitution model is
\[\begin{align} {\bf Q} &= \left[ \begin{array}{cccc} -\beta(\kappa + 2) & \beta & \kappa \beta & \beta \\ \beta & -\beta(\kappa + 2) & \beta & \kappa \beta \\ \kappa \beta & \beta & -\beta(\kappa + 2) & \beta \\ \beta & \kappa \beta & \beta & -\beta(\kappa + 2) \\ \end{array} \right], \end{align}\]where the order of (e.g. DNA) states for both rows and columns is A, C, G, T. The matrix of transition probabilities can be obtained by exponentiating the product of \({\bf Q}\) and time \(t\):
\[\begin{align} {\bf P} &= e^{{\bf Q}t}. \end{align}\]The product of the base rate \(\beta\) and time \(t\) can be determined from the edge length parameter \(v\) and transition/transversion rate ratio parameter \(\kappa\). The edge length is the product of the overall substitution rate (\(0.25 (8 \beta + 4 \kappa \beta)\)) and time (\(t\)), yielding
\[\begin{align} v &= 2 \beta t + \kappa \beta t \\ \beta t &= \frac{v}{2 + \kappa} \end{align}\]The transition probabilities may be used to simulate data for one of the two sequences given the other sequence. The state for each site in the starting sequence is drawn from the stationary distribution \(\pi_A=\pi_C=\pi_G=\pi_T=0.25\) implied by \({\bf Q}\).
Given simulated sequence data \({\bf D}\), the joint posterior distribution \(p(v, \kappa|{\bf D})\) was sampled using MCMC, with likelihoods computed using the K80 model and using Exponential priors with mean equal to 50 for both \(v\) and \(\kappa\). The posterior sample (10,000 points sampled 100 iterations apart) after a burn-in of 1000 iterations was saved tab-delimited text file k80-samples.txt. In the file there are four columns, iter (MCMC-iteration), log-kernel (log posterior kernel), edgelen (sampled edge length), and kappa (sampled kappa parameter).
In this example we will compute an estimate of the marginal likelihood under this model using the LoRaD method.
First, load the lorad package.
Read in the file containing the posterior sample.
Next, we must create a named vector to associate each column name in
k80samples
with a column specification. Here are the
possible column specifications:
All columns labeled posterior will be summed to create the log posterior kernel (sum of log likelihood and log joint prior).
Here are the column specifications appropriate for this example.
Column | Specification |
---|---|
iter | iteration |
log.kernel | posterior |
edgelen | positive |
kappa | positive |
# Create a named vector holding the column specifications
colspec <- c("iter"="iteration", "log.kernel"="posterior", "edgelen"="positive", "kappa"="positive")
The LoRaD requires all parameters to be unconstrained in their
support. Thus, the positive
column specification for both
edgelen
and kappa
results in a log
transformation prior to application of the LoRaD method.
To run the LoRaD function we need to specify the training fraction, the training sample selection method, and the coverage fraction. The training fraction and the coverage fraction must be between 0 and 1. The training sample selection method can be left (the first part of the sample), right (the second part of the sample), or random (a random part of the sample).
For this example we used 0.5 for the training fraction, 0.1 for the coverage fraction, and “left” for the training sample selection method.
results <- lorad_estimate(k80samples, colspec, 0.5, "left", 0.1)
lorad_summary(results)
#> This is lorad 0.0.1.0
#> Parameter sample comprises 10000 sampled points
#> Each sample point is a vector of 2 parameter values
#>
#> Training fraction is 0.5
#> Coverage specified is 0.1
#>
#> Partitioning samples into training and estimation:
#> Sample size is 10000
#> Training sample size is 5000
#> Estimation sample size 5000
#>
#> Processing training sample...
#> Lowest radial distance is 0.464997772
#> Log Delta -2.278161285
#>
#> Processing estimation sample...
#> Number of samples used is 510
#> Nominal coverage specified is 0.100000
#> Realized coverage is 0.102000
#> Log marginal likelihood is -460.822385
#>
For comparison, the two log marginal likelihood estimates reported by Wang et al. (2023) were -460.82239 (LoRaD method) and -460.86154 (Steppingstone method).