In this part we introduce the data, and write the appropriate R code to calculate the likelihood, the prior and the (un-normalised) posterior distribution. To avoid numerical problems, we will work with the logarithm of the densities.

The data are a pairwise alignment of the 12S rRNA gene of human and orangutang. The alignment has 948 base pairs (bp) and 90 differences (84 transitions and 6 transversions). See Table 1.3 in p.7 of Yang (2014) “Molecular Evolution: A Statistical Approach” for details.

To represent the data, we use the following R code:

n <- 948 # length of alignment in bp ns <- 84 # total number of transitions (23+30+10+21) nv <- 6 # total number of transversions (1+0+2+0+2+1+0+0)

The log-likelihood function, , for data , using Kimura’s (1980) substitution model [see p.8 in Yang (2014)] is

,

where

,

,

.

Note that the likelihood depends only on two parameters, the distance and the ts/tv ratio . All the other variables are part of the data. The corresponding R code is

# Kimura's (1980) likelihood for two sequences k80.lnL <- function(d, k, n=948, ns=84, nv=6){ p0 <- .25 + .25 * exp(-4*d/(k+2)) + .5 * exp(-2*d*(k+1)/(k+2)) p1 <- .25 + .25 * exp(-4*d/(k+2)) - .5 * exp(-2*d*(k+1)/(k+2)) p2 <- .25 - .25 * exp(-4*d/(k+2)) return ((n - ns - nv) * log(p0/4) + ns * log(p1/4) + nv * log(p2/4)) }

The posterior distribution is given by

,

where and are the marginal priors on and , and is the normalising constant, given by

.

We set the marginal priors to be gamma distributions: and . The prior mean of is 2/20 = 0.1, and the prior mean of is 2/0.1 = 20. The prior densities reflect the biologist’s prior information about the model parameters.

The double integral in cannot be calculated analytically. It may be calculated numerically (for example by using Gaussian quadrature) but this is cumbersome. The dimension of integration in is usually the same as the number of parameters in the model. Thus a model with three parameters will involve a triple integral, and so on. Numerical methods are notoriously slow and inaccurate for calculating integrals with more than two dimensions. In fact, the intractability of the integrals made the Bayesian method unsuitable for practical data analysis for a long time. The development of the MCMC algorithm (in which calculation of is avoided), together with the increase in power of modern computers, has led to an explosion of practical applications of the Bayesian method during the past two decades.

We will ignore calculation of here, and will work with the un-normalised posterior instead:

We can now plot the likelihood, prior, and posterior surfaces. First we set up a 100 ⨉ 100 grid of and points over which the surfaces will be plotted:

dim <- 100 # dimension for the plot d.v <- seq(from=0, to=0.3, len=dim) # vector of d values k.v <- seq(from=0, to=100, len=dim) # vector of k values dk <- expand.grid(d=d.v, k=k.v) par(mfrow=c(1, 3))

We now plot the three surfaces:

# Prior surface, f(D)f(k) Pri <- matrix(dgamma(dk$d, 2, 20) * dgamma(dk$k, 2, .1), ncol=dim) image(d.v, k.v, -Pri, las=1, col=heat.colors(50), main="Prior", xlab="distance, d", ylab="kappa, k", cex.main=2.0, cex.lab=1.5, cex.axis=1.5) contour(d.v, k.v, Pri, nlev=10, drawlab=FALSE, add=TRUE) # Likelihood surface, f(D|d,k) lnL <- matrix(k80.lnL(d=dk$d, k=dk$k), ncol=dim) # for numerical reasons, we scale the likelihood to be 1 # at the maximum, i.e. we subtract max(lnL) L <- exp(lnL - max(lnL)) image(d.v, k.v, -L, las=1, col=heat.colors(50), main="Likelihood", xlab="distance, d", ylab="kappa, k", cex.main=2.0, cex.lab=1.5, cex.axis=1.5) contour(d.v, k.v, L, nlev=10, drawlab=FALSE, add=TRUE) # creates a contour plot # Unscaled posterior surface, f(d)f(k)f(D|d,k) Pos <- Pri * L image(d.v, k.v, -Pos, las=1, col=heat.colors(50), main="Posterior", xlab="distance, d", ylab="kappa, k", cex.main=2.0, cex.lab=1.5, cex.axis=1.5) contour(d.v, k.v, Pos, nlev=10, drawlab=FALSE, add=TRUE)

The generated plots are shown below.

In part 2 we show an MCMC algorithm to sample from the posterior surface.

Pingback: Part 2: Markov chain Monte Carlo (MCMC) | The DNA in Us

Pingback: Tutorial: Bayesian MCMC phylogenetics using R | The DNA in Us