Part 4: Checking for convergence

Back to Part 3

We now illustrate the concept of burn-in. We will run a chain with a high starting value, and another with a low starting value.

dk.mcmc.l <- mcmcf(0.01, 20, 1e4, .12, 180)
dk.mcmc.h <- mcmcf(0.4, 20, 1e4, .12, 180)

plot(dk.mcmc.l$d, xlim = c(1,200), ylim = c(0,0.4), ty = "l")
lines(dk.mcmc.h$d, col="red")

# We use the low chain to calculate the mean
# and sd of d. We could have used the high chain
# as well.
mean.d <- mean(dk.mcmc.l$d)
sd.d <- sd(dk.mcmc.l$d)
abline(h = mean.d + 2 * c(-sd.d, sd.d), lty = 2)

The horizontal dashed lines indicate approximately the 95% CI. Notice how the chains move from either the high or low starting values towards the stationary phase (the area within the dashed lines). The area before the chain reaches stationarity is the burn-in.
Figure9

The simplest way to check for convergence is to run the MCMC chain several times (at least twice) with over-dispersed starting values. Then one can compare the outputs of the various chains and check how similar they are. If the chains have converged to the target posterior distribution, then they will have similar (almost identical) histograms, posterior means, 95% credibility intervals, etc.

We now run the efficient and inefficient chains again, but with different starting values (i.e. d.init and k.init)

# Efficient chain (good proposal step sizes)
dk.mcmc.b <- mcmcf(0.05, 5, 1e4, .12, 180)
# Inefficient chain (bad proposal step sizes)
dk.mcmc3.b <- mcmcf(0.05, 5, 1e4, 3, 5)

And we can now plot and compare the histograms:

par(mfrow=c(1,2))
bks <- seq(from=0, to=150, by=1)

hist(dk.mcmc.b$k, prob=TRUE, breaks=bks, border=NA,
     col=rgb(0, 0, 1, .5), las=1, xlab="kappa",
     xlim=c(0,100), ylim=c(0,.055))

hist(dk.mcmc$k, prob=TRUE, breaks=bks, border=NA,
     col=rgb(.5, .5, .5, .5), add=TRUE)

hist(dk.mcmc3.b$k, prob=TRUE, breaks=bks, border=NA,
     col=rgb(0, 0, 1, .5), las=1, xlab="kappa",
     xlim=c(0,100), ylim=c(0,.055))

hist(dk.mcmc3$k, prob=TRUE, breaks=bks, border=NA,
     col=rgb(.5, .5, .5, .5), add=TRUE)

Efficient chains converge quickly, and histograms overlap well. On the other hand, inefficient chains need to be run for a longer time to converge, here they were not run long enough, so that the histograms do not overlap as well. Your histograms may look slightly different due to the stochastic nature of the MCMC algorithm.

figure7

We can also calculate the posterior means:

# posterior means (similar for efficient chains):
mean(dk.mcmc$d); mean(dk.mcmc.b$d)
mean(dk.mcmc$k); mean(dk.mcmc.b$k)

# posterior means (not so similar for the inefficient chains):
mean(dk.mcmc3$d); mean(dk.mcmc3.b$d)
mean(dk.mcmc3$k); mean(dk.mcmc3.b$k)

The variance of the estimated posterior mean is the sample variance of the MCMC sample (ignoring the auto-correlation) divided by the efficiency times 1/(sample size). The standard error of the estimated posterior mean is the square-root of the variance:

# efficient chain, standard error of the means
sqrt(1/1e4 * var(dk.mcmc$d) / 0.23) # roughly 2.5e-4
sqrt(1/1e4 * var(dk.mcmc$k) / 0.20) # roughly 0.2

# inefficient chain, standard error of the means
sqrt(1/1e4 * var(dk.mcmc3$d) / 0.015) # roughly 9.7e-4
sqrt(1/1e4 * var(dk.mcmc3$k) / 0.003) # roughly 1.6

Thus, for the efficient chain on d, I got an estimated posterior mean of 0.105, which is within ±2 times 2.5e-4 of the true posterior mean. Note for the inefficient chains, the standard errors are about four times and eight times those of the efficient chains.

We can also plot densities (smoothed histograms) for the efficient and inefficient chains:

par(mfrow=c(1,2)); adj <- 1.5
# Efficient chains:
plot(density(dk.mcmc.b$k, adj=adj), col="blue", las=1,
     xlim=c(0, 100), ylim=c(0, .05), xaxs="i", yaxs="i")
lines(density(dk.mcmc$k, adj=adj), col="black")

# Inefficient chains:
plot(density(dk.mcmc3.b$k, adj=adj), col="blue", las=1,
     xlim=c(0, 100), ylim=c(0, .05), xaxs="i", yaxs="i")
lines(density(dk.mcmc3$k, adj=adj), col="black")

The resulting R plots are shown below.

figure8

A useful measure of convergence is the potential scale reduction statistic of Gelman and Rubin (also known as Gelman and Rubin convergence diagnostic). The statistic compares the within-chain variance and the among-chain variance for the posterior mean of a parameter. When the value of the statistic is close to 1, convergence has been achieved. Note that the diagnostic may fail (i.e. give a misleading reassurance of convergence) for strongly multi-modal posteriors. To avoid this it is important to run more than two chains with over-dispersed starting values. The R package coda for MCMC diagnostics, has a function to calculate the convergence diagnostic, gelman.diag. If you have the package installed, you may want to go ahead and calculate the diagnostic for the chains above.

Chapter 7 in Yang (2014, Molecular Evolution: A Statistical Approach) provides a good introduction to Bayesian computation including Bayesian MCMC, proposal densities, efficiency, convergence diagnostics, etc.

We hope you enjoyed the tutorial. If you have questions, comments or if you spot any errors, please post a message below or sends us an email!

Fabricia, Mario and Ziheng.

Advertisements

Part 3: Efficiency of the MCMC chain

Back to Part 2

Values sampled in an MCMC chain are autocorrelated because new states are either the previous state or a modification of it. The efficiency of an MCMC chain is closely related to the autocorrelation. Intuitively, if the autocorrelation is high, the chain will be inefficient, i.e. we will need to run the chain for a long time to obtain a good approximation to the posterior distribution.

The efficiency of a chain is defined as

\mathrm{eff} = 1 / (1 + 2(r_1 + r_2 + r_3 + ...)),

where r_i is the correlation coefficient for lag i. We will now calculate the efficiency of the MCMC. To calculate the efficiency it is appropriate to obtain a long sample (say one million). R’s autocorrelation function, acf, can be used to calculate the correlation coefficients and plot them in the so-called autocorrelation plot.

# run a very long chain (1e6 generations take about
# 40s in my MacBook Air) to calculate efficiency
dk.mcmc2 <- mcmcf(0.2, 20, 1e6, .12, 180)

# R's acf function (for AutoCorrelation Function) 
par(mfrow=c(1,2))
acf(dk.mcmc2$d)
acf(dk.mcmc2$k)

# Define efficiency function
eff <- function(acf) 1 / (1 + 2 * sum(acf$acf[-1]))

# the efficiencies are roughly 22% for d and 
# 20% for k:
# [1] 0.2255753 # mcmcf(0.2, 20, 1e7, .12, 180)
eff(acf(dk.mcmc2$d))
# [1] 0.2015054 # mcmcf(0.2, 20, 1e7, .12, 180)
eff(acf(dk.mcmc2$k))

The chain has an efficiency of roughly 20%-22% for k and d. This means that a sample of size N from the MCMC is as good as an independent sample of size N \times 0.2 (in terms of the variance of the estimate).

The plots for the autocorrelation are shown below:

figure3

Notice that the autocorrelation drops to zero after about 10-15 iterations for both d and k. That is, a sample observation obtained at iteration i, is roughly uncorrelated to an observation obtained at iteration i + 15 (i.e. for a lag of 15).

Also notice that the autocorrelation is very high for small lags. For example, for a lag of 2, the autocorrelation is roughly 60%. This is because neighbouring sates in the chain are highly similar. It is customary to ‘thin’ the chain, that is, save states only every other iteration or so. This is done to remove the highly correlated neighbouring states, which can be done without loosing too much information, and thus save hard drive space in the computer. However the estimates with the smallest variance (error) are obtained when all of the samples in the chain are used.

A concept closely related to efficiency is the effective-sample size (ESS) of the chain. It is defined as

\mathrm{ESS} = N * \mathrm{eff},

where N is the total number of samples taken from the chain. Thus for the chain above, the ESS’s are

1e6 * 0.23 # ~230K
1e6 * 0.20 # ~200K

or about 200,000s. This means that an independent sample of size 200K has about the same variance (error) for the parameter estimates (in this case the mean of d and k) than our autocorrelated MCMC sample of size 1 million. An MCMC chain with efficiency around 20% is a very efficient chain. We will see examples of inefficient chains next.

Inefficient chains:

Inefficient chains are those that do not ‘mix’ well. That is, parameter states are highly autocorrelated even when they are many iterations apart, and thus efficiency and ESS tend to be low. Inefficient chains usually occur due to poor proposal densities, or complex posteriors with highly correlated posteriors in over-parameterised models.

To illustrate inefficient chains, we will run our MCMC again by using a proposal density with a too large step size for d, and another with a too small step size for k.

# The window width for the d proposal density is too large,
# while it is too small for k
dk.mcmc3 <- mcmcf(0.2, 20, 1e4, 3, 5)

par(mfrow=c(1,2))
# because proposal width for d is too large,
# chain gets stuck at same values of d:
plot(dk.mcmc3$d, ty='l', main="Trace of d", cex.main=2.0,
     cex.lab=1.5, cex.axis=1.5, ylab="d")

# whereas proposal width for k is too small,
# so chain moves slowly:
plot(dk.mcmc3$k, ty='l', main="Trace of k", cex.main=2.0,
     cex.lab=1.5, cex.axis=1.5, ylab="k")

The trace plots for the inefficient chain are shown below. In the case of d, the proposal density is too wide, newly proposed values of d are so far from the current state that they may fall outside the area of high probability density in the posterior, and thus most proposals are rejected. This is evident in the trace plot, which has many horizontal sections, where the chain is stuck in the same value of d for many iterations. In the case of k the proposal density is too narrow, and so the chain moves slowly by making baby-sized steps.

figure4

Inefficient chains are characterised by acceptance proportions that are either too low or too high. For d the acceptance proportion (check your R output) was about ~3% (i.e. over 97% of proposals were rejected), while for k the acceptance proportion was over 90% (i.e. the vast majority of proposals were accepted). As a rule of thumb, for bell-shaped posterior densities, one should aim for acceptance proportions close to 30%, which lead to optimal mixing, large efficiency and large ESS. Try tweaking the values of w.d and w.k and running the MCMC again until you achieve acceptance proportions close to 30% for both parameters. Then check the trace plots. Finding the optimal step sizes is known as ‘fine-tuning’ the chain. Many MCMC programs use automatic fine-tuning algorithms and the user does not normally need to worry about this.

We will now calculate eff for the inefficient chains. Because the chain has so much autocorrelation, it is best to do a very long run (using 1e7 iterations is best, but this takes over 10min in my computer).

dk.mcmc4 <- mcmcf(0.2, 20, 1e6, 3, 5)

# Efficiencies are roughly 1.5% for d, and 0.35% for k:
# [1] 0.01530385  # mcmcf(0.2, 20, 1e7, 3, 5)
eff(acf(dk.mcmc4$d, lag.max=2e3))

# [1] 0.003493112 # mcmcf(0.2, 20, 1e7, 3, 5)
eff(acf(dk.mcmc4$k, lag.max=2e3))

The efficiencies are very slow, about 1.5% and 0.3% for d and k respectively. This means ESS’s around 15K and 3K for a chain of size 1 million. This is very inefficient. The autocorrelation plots are shown below. Notice that the correlations only drop to zero for about 200 iterations in the case of d and over 1,000 iterations in the case of k.

figure5

Finally, we plot the traces for efficient (part 2) and inefficient chains.

par(mfrow=c(2,2))

plot(dk.mcmc$d, ty='l', las=1, ylim=c(.05,.2),
     main="Trace of d, efficient chain", xlab='',
     ylab="Distance, d", cex.main=2.0, cex.lab=1.5)
plot(dk.mcmc3$d, ty='l', las=1, ylim=c(.05,.2),
     main="Trace of d, inefficient chain", xlab='',
     ylab='', cex.main=2.0, cex.lab=1.5)
plot(dk.mcmc$k, ty='l', las=1, ylim=c(0,100),
     main="Trace of k, efficient chain",
     xlab='', ylab="ts/tv ratio, k",
     cex.main=2.0, cex.lab=1.5)
plot(dk.mcmc3$k, ty='l', las=1, ylim=c(0,100),
     main="Trace of k, inefficient chain",
     xlab='', ylab='', cex.main=2.0, cex.lab=1.5)

figure6

In this part we explored efficiency and effective-sample size, and their relation to the acceptance proportion. These concepts are important when we assess whether the chain is mixing well or not, if we know that the chain has converged to the stationary distribution and is sampling from the high-probability region of the posterior. In the next section, we look at diagnostics of MCMC or how to tell that the chain has converged to the posterior.

Part 4: Checking for convergence

Part 2: Markov chain Monte Carlo (MCMC)

Back to Part 1

We now aim to obtain the posterior distribution of d and k by MCMC sampling. The draft MCMC algorithm is:

  1. Set initial states for d and k.
  2. Propose a new state d^* (from an appropriate proposal density).
  3. Accept the proposal with probability
    \alpha=\min\Big(1, \frac{p(d^*)p(k)p(D|d^*)}{p(d)p(k)p(D|d)}\Big).
    If the proposal is accepted set d=d^*, otherwise d=d.
  4. Repeat 2-3 for k.
  5. Save current values of d and k.
  6. Go to step 2.

To avoid numerical problems (i.e. underflows or overflows) it is best to work with the logarithm of the unnormalised posterior. This is defined as the sum of the log-prior and the log-likelihood. The R code for the log-unnormalised posterior is

# function returning the logarithm of the unscaled posterior:
# f(d) * f(k) * f(D|d,k)
# by default we set the priors as: 
# f(d) = Gamma(d | 2, 20) and f(k) = Gamma(k | 2, .1)
ulnPf <- function(d, k, n=948, ns=84, nv=6, a.d=2, b.d=20, 
        a.k=2, b.k=.1) {
  # The normalizing constant in the prior densities can 
  # be ignored:
  lnpriord <- (a.d - 1)*log(d) - b.d * d
  lnpriork <- (a.k - 1)*log(k) - b.k * k
  
  # log-Likelihood (K80 model):
  expd1 <- exp(-4*d/(k+2));
  expd2 <- exp(-2*d*(k+1)/(k+2))
  p0 <- .25 + .25 * expd1 + .5 * expd2
  p1 <- .25 + .25 * expd1 - .5 * expd2
  p2 <- .25 - .25 * expd1
  lnL <- ((n - ns - nv) * log(p0/4) + 
         ns * log(p1/4) + nv * log(p2/4))
  
  # Return unnormalised posterior:
  return (lnpriord + lnpriork + lnL)
}

Note that the joint prior on d and k are two independent gamma densities. Recall the gamma density is

\mathrm{Gamma}(x|a,b) = \frac{1}{\Gamma(a)}x^{a-1}e^{-bx},

where \Gamma is the complete gamma function. \Gamma(a) is a constant factor (i.e. does not depend on d or k), and so it will cancel out in the calculation of the posterior ratio in step 3 of the MCMC, and thus can be ignored in the calculation of the log-prior in ulnPf. Ignoring calculation of the constant speeds up the code. The calculation of the log-likelihood in ulnPf here involves two calls to the exp function and is slightly faster than the code in Part 1, which has two exp calls (For the K80 log-likelihood equation please see Part 1).

We will now write the R code implementing the MCMC algorithm above. We will use d and k to be the current values (d and k) and dprop and kprop to be the proposed values (d^* and k^*), with corresponding logarithms of the unnormalized posterior to be ulnP and ulnPprop. Note that if lnalpha > 0, we know \alpha = 1 so that the proposal is accepted. The code calculates the probability of acceptance \alpha only when lnalpha < 0. The corresponding R function for the MCMC algorithm is:

mcmcf <- function(init.d, init.k, N, w.d, w.k) {
  # init.d and init.k are the initial states
  # N is the number of MCMC iterations.
  # w.d is the width of the sliding-window proposal for d.
  # w.k is the width of the sliding-window proposal for k.
  
  # We keep the visited states (d, k) in sample.d and sample.k 
  # for easy plotting. In practical MCMC applications, these 
  # are usually written into a file.
  sample.d <- sample.k <- numeric(N+1)

  # STEP 1: Initialise the MCMC chain
  d <- init.d;  sample.d[1] <- init.d
  k <- init.k;  sample.k[1] <- init.k
  ulnP <- ulnPf(d, k)
  acc.d <- 0;  acc.k <- 0 # number of acceptances
  
  for (i in 1:N) {
    # STEP 2: Propose new state d* 
    # we use a uniform sliding window of width w with reflection
    # to propose new values d* and k* 
    # propose d* and accept/reject the proposal
    dprop <- d + runif(1, -w.d/2, w.d/2)
    # reflect if dprop is negative
    if (dprop < 0) dprop <- -dprop
    
    ulnPprop <- ulnPf(dprop, k)
    lnalpha <- ulnPprop - ulnP

    # STEP 3: Accept or reject the proposal
    # if ru < alpha accept proposed d*:
    if (lnalpha > 0 || runif(1) < exp(lnalpha)){
      d <- dprop;  ulnP <- ulnPprop; 
      acc.d  <- acc.d + 1
    }
    # else reject and stay where we are (so that nothing needs 
    # to be done).
    
    # STEP 4: Repeat steps 2-3 for k
    # propose k* and accept/reject the proposal
    kprop <- k + runif(1, -w.k/2, w.k/2)
    # reflect if kprop is negative
    if (kprop < 0) kprop <- -kprop
    
    ulnPprop <- ulnPf(d, kprop)
    lnalpha <- ulnPprop - ulnP
    # if ru < alpha accept proposed k*:
    if (lnalpha > 0 || runif(1) < exp(lnalpha)){
      k <- kprop;  ulnP <- ulnPprop
      acc.k  <- acc.k + 1
    }
    # else reject and stay where we are.
    
    # STEP 5: Save chain state
    sample.d[i+1] <- d;  sample.k[i+1] <- k
  }
  
  # print out the proportion of times
  # the proposals were accepted
  print("Acceptance proportions (d, k):")
  print(c(acc.d/N, acc.k/N))
  
  # return vector of d and k visited during MCMC
  
  return (list(d=sample.d, k=sample.k))
}

Note that as we use ulnP to store the log posterior for the current state, we make one call to the ulnPf function (to calculate the log unnormalized posterior for the newly proposed state) in each of the two proposals. This reduces the amount of computation by a half compared with the alternative strategy of calculating the log unnormalized posterior for both the current and proposed states making two calls to ulnPf. Note also that in the syntax “if(a || b)” on lines 32 and 48, the expression “a || b” evaluates to TRUE as soon as a is TRUE in which case b is not evaluated, saving one call to the exp function. Note that the algorithm is stochastic, so repetitions of it will give different results. Function system.time in R is useful to measure the amount of time an R expression takes to be evaluated. For example, running the MCMC algorithm above with 10,000 iterations takes about 0.3 seconds in a 2.2 GHz MacBook Air.

# Test run-time:
system.time(mcmcf(0.2, 20, 1e4, .12, 180)) # about 0.3s
# Run again and save MCMC output:
dk.mcmc <- mcmcf(0.2, 20, 1e4, .12, 180)

We now plot the ‘traces’ of the parameters. A trace plot is a plot of parameter state vs. MCMC iteration. Trace plots allow us to easily visualise the MCMC chain. They can also show us if there are convergence or mixing problems with the chain (we’ll see this in Part 3):

par(mfrow=c(1,3))

# trace plot of d
plot(dk.mcmc$d, ty='l', xlab="Iteration", ylab="d", 
     main="Trace of d")

# trace plot of k
plot(dk.mcmc$k, ty='l', xlab="Iteration", ylab="k", 
     main="Trace of k")

# We can also plot the joint sample of d and k
# (points sampled from posterior surface)
plot(dk.mcmc$d, dk.mcmc$k, pch='.', xlab="d", ylab="k", 
     main="Joint of d and k")

Trace plots for distance, d and kappa, k, and the plot for joint sampled from posterior surface.

Figure2

The trace plots show that the chain has mixed well (i.e., it has explored the parameter space well). The joint sample of d and k matches the plot of the posterior surface from Part 1. In Part 3 we will see examples of chains that have poor mixing. Chains with poor mixing may be the product of poorly chosen proposal densities of bad proposal step sizes. For example, trying running the MCMC chain again after modifying the step sizes (the widths) of the proposal densities, w.d and w.k.

Part 3: Efficiency of the MCMC chain

Part 1: Introduction

Back to Preamble

In this part we introduce the data, and write the appropriate R code to calculate the likelihood, the prior and the (un-normalised) posterior distribution. To avoid numerical problems, we will work with the logarithm of the densities.

The data are a pairwise alignment of the 12S rRNA gene of human and orangutang. The alignment has 948 base pairs (bp) and 90 differences (84 transitions and 6 transversions). See Table 1.3 in p.7 of Yang (2014) “Molecular Evolution: A Statistical Approach” for details.

To represent the data, we use the following R code:

n <- 948 # length of alignment in bp
ns <- 84 # total number of transitions (23+30+10+21)
nv <- 6  # total number of transversions (1+0+2+0+2+1+0+0)

The log-likelihood function, \log f(D|d,k), for data D = (n_s, n_v), using Kimura’s (1980) substitution model [see p.8 in Yang (2014)] is

\log f(D|d,k) = (n - n_s - n_v) \log(p_0/4) + n_s \log(p_1/4) + n_v \log(p_2/4) ,

where

p_0 = 1/4 + 1/4 \times e^{-4d/(k+2)}  + 1/2 \times e^{-2d(k+1)/(k+2)} ,
p_1 = 1/4 + 1/4 \times e^{-4d/(k+2)}  - 1/2 \times e^{-2d(k+1)/(k+2)} ,
p_2 = 1/4 - 1/4 \times e^{-4d/(k+2)} .

Note that the likelihood depends only on two parameters, the distance d and the ts/tv ratio k. All the other variables are part of the data. The corresponding R code is

# Kimura's (1980) likelihood for two sequences
k80.lnL <- function(d, k, n=948, ns=84, nv=6){

  p0 <- .25 + .25 * exp(-4*d/(k+2)) + 
        .5 * exp(-2*d*(k+1)/(k+2))

  p1 <- .25 + .25 * exp(-4*d/(k+2)) - 
        .5 * exp(-2*d*(k+1)/(k+2))

  p2 <- .25 - .25 * exp(-4*d/(k+2))

return ((n - ns - nv) * log(p0/4) +
        ns * log(p1/4) + nv * log(p2/4))
}

The posterior distribution is given by

f(d,k|D) = \frac{1}{z} f(d) f(k) f(D|d,k),

where f(d) and f(k) are the marginal priors on d and k, and z is the normalising constant, given by

z=\int\int f(d)f(k)f(D|d,k)\,\mathrm{d}d\,\mathrm{d}k.

We set the marginal priors to be gamma distributions: f(d) = \mathrm{Gamma}(d | 2, 20) and f(k) = \mathrm{Gamma}(k | 2, 0.1). The prior mean of d is 2/20 = 0.1, and the prior mean of k is 2/0.1 = 20. The prior densities reflect the biologist’s prior information about the model parameters.

The double integral in z cannot be calculated analytically. It may be calculated numerically (for example by using Gaussian quadrature) but this is cumbersome. The dimension of integration in z is usually the same as the number of parameters in the model. Thus a model with three parameters will involve a triple integral, and so on. Numerical methods are notoriously slow and inaccurate for calculating integrals with more than two dimensions. In fact, the intractability of the integrals made the Bayesian method unsuitable for practical data analysis for a long time. The development of the MCMC algorithm (in which calculation of z is avoided), together with the increase in power of modern computers, has led to an explosion of practical applications of the Bayesian method during the past two decades.

We will ignore calculation of z here, and will work with the un-normalised posterior instead:

\log (f(d)\times f(k)\times f(D|d,k))

We can now plot the likelihood, prior, and posterior surfaces. First we set up a 100 ⨉ 100 grid of d and k points over which the surfaces will be plotted:

dim <- 100  # dimension for the plot
d.v <- seq(from=0, to=0.3, len=dim) # vector of d values
k.v <- seq(from=0, to=100, len=dim) # vector of k values
dk <- expand.grid(d=d.v, k=k.v)

par(mfrow=c(1, 3))

We now plot the three surfaces:

# Prior surface, f(D)f(k)
Pri <- matrix(dgamma(dk$d, 2, 20) * dgamma(dk$k, 2, .1),
       ncol=dim)

image(d.v, k.v, -Pri, las=1, col=heat.colors(50),
      main="Prior", xlab="distance, d",
      ylab="kappa, k", cex.main=2.0,
      cex.lab=1.5, cex.axis=1.5)

contour(d.v, k.v, Pri, nlev=10, drawlab=FALSE, add=TRUE)

# Likelihood surface, f(D|d,k)
lnL <- matrix(k80.lnL(d=dk$d, k=dk$k), ncol=dim)

# for numerical reasons, we scale the likelihood to be 1
# at the maximum, i.e. we subtract max(lnL)
L <- exp(lnL - max(lnL))  

image(d.v, k.v, -L, las=1, col=heat.colors(50),
      main="Likelihood", xlab="distance, d",
      ylab="kappa, k", cex.main=2.0,
      cex.lab=1.5, cex.axis=1.5)

contour(d.v, k.v, L, nlev=10,
        drawlab=FALSE, add=TRUE) # creates a contour plot

# Unscaled posterior surface, f(d)f(k)f(D|d,k)
Pos <- Pri * L

image(d.v, k.v, -Pos, las=1, col=heat.colors(50),
      main="Posterior", xlab="distance, d",
      ylab="kappa, k", cex.main=2.0,
      cex.lab=1.5, cex.axis=1.5)

contour(d.v, k.v, Pos, nlev=10,
        drawlab=FALSE, add=TRUE)

The generated plots are shown below.

figure1

In part 2 we show an MCMC algorithm to sample from the posterior surface.

Part 2: Markov chain Monte Carlo (MCMC)

Tutorial: Bayesian MCMC using R

Bayesian MCMC methods have become incredibly popular in recent times as they allow the implementation of arbitrarily complex models for various statistical inference problems. In molecular phylogenetics, MCMC has been used to estimate species phylogenies, species divergence times, and species delimitation under the multi-species coalescent, among many other applications. However, many key concepts and issues of MCMC appear to be arcane to the average scientist. The purpose of this tutorial is thus to open the MCMC black box, and provide a step-by-step guide on how to write an MCMC program and explore the main concepts and issues involved.

This tutorial focuses on writing a Bayesian MCMC algorithm to estimate the molecular distance between two species (d) and the transition/transversion (ts/tv) ratio (k) under Kimura’s 1980 nucleotide substitution model. There are two unknown parameters
(d and k), and the data are counts from a trinomial distribution (multinomial with three categories). We assign gamma priors on d and k to estimate them. We illustrate concepts such as proposal densities, burn-in, convergence and mixing of the MCMC, autocorrelation and effective-sample size. To follow this tutorial, knowledge of K80 model of nucleotide substitution is useful (see Ziheng Yang’s 2014 book “Molecular Evolution: A statistical Approach”, Oxford University Press) but not essential. We include all mathematical details necessary for this tutorial. The tutorial reproduces the analysis for the figures in our review paper:

A biologist’s guide to Bayesian phylogenetic analysis
Nascimento FF, dos Reis M and Yang Z (2017) Nature Ecology and Evolution, 1:1446–1454.

The tutorial is divided into 4 parts:

Part 1: Introduction
Part 2: Markov chain Monte Carlo (MCMC)
Part 3: Efficiency of the MCMC chain
Part 4: Diagnosing the MCMC chain

The complete R script can be downloaded from github.com/thednainus/Bayesian_tutorial.