Part 1: Introduction

Back to Preamble

In this part we introduce the data, and write the appropriate R code to calculate the likelihood, the prior and the (un-normalised) posterior distribution. To avoid numerical problems, we will work with the logarithm of the densities.

The data are a pairwise alignment of the 12S rRNA gene of human and orangutang. The alignment has 948 base pairs (bp) and 90 differences (84 transitions and 6 transversions). See Table 1.3 in p.7 of Yang (2014) “Molecular Evolution: A Statistical Approach” for details.

To represent the data, we use the following R code:

n <- 948 # length of alignment in bp
ns <- 84 # total number of transitions (23+30+10+21)
nv <- 6  # total number of transversions (1+0+2+0+2+1+0+0)

The log-likelihood function, \log f(D|d,k), for data D = (n_s, n_v), using Kimura’s (1980) substitution model [see p.8 in Yang (2014)] is

\log f(D|d,k) = (n - n_s - n_v) \log(p_0/4) + n_s \log(p_1/4) + n_v \log(p_2/4) ,

where

p_0 = 1/4 + 1/4 \times e^{-4d/(k+2)}  + 1/2 \times e^{-2d(k+1)/(k+2)} ,
p_1 = 1/4 + 1/4 \times e^{-4d/(k+2)}  - 1/2 \times e^{-2d(k+1)/(k+2)} ,
p_2 = 1/4 - 1/4 \times e^{-4d/(k+2)} .

Note that the likelihood depends only on two parameters, the distance d and the ts/tv ratio k. All the other variables are part of the data. The corresponding R code is

# Kimura's (1980) likelihood for two sequences
k80.lnL <- function(d, k, n=948, ns=84, nv=6){

  p0 <- .25 + .25 * exp(-4*d/(k+2)) + 
        .5 * exp(-2*d*(k+1)/(k+2))

  p1 <- .25 + .25 * exp(-4*d/(k+2)) - 
        .5 * exp(-2*d*(k+1)/(k+2))

  p2 <- .25 - .25 * exp(-4*d/(k+2))

return ((n - ns - nv) * log(p0/4) +
        ns * log(p1/4) + nv * log(p2/4))
}

The posterior distribution is given by

f(d,k|D) = \frac{1}{z} f(d) f(k) f(D|d,k),

where f(d) and f(k) are the marginal priors on d and k, and z is the normalising constant, given by

z=\int\int f(d)f(k)f(D|d,k)\,\mathrm{d}d\,\mathrm{d}k.

We set the marginal priors to be gamma distributions: f(d) = \mathrm{Gamma}(d | 2, 20) and f(k) = \mathrm{Gamma}(k | 2, 0.1). The prior mean of d is 2/20 = 0.1, and the prior mean of k is 2/0.1 = 20. The prior densities reflect the biologist’s prior information about the model parameters.

The double integral in z cannot be calculated analytically. It may be calculated numerically (for example by using Gaussian quadrature) but this is cumbersome. The dimension of integration in z is usually the same as the number of parameters in the model. Thus a model with three parameters will involve a triple integral, and so on. Numerical methods are notoriously slow and inaccurate for calculating integrals with more than two dimensions. In fact, the intractability of the integrals made the Bayesian method unsuitable for practical data analysis for a long time. The development of the MCMC algorithm (in which calculation of z is avoided), together with the increase in power of modern computers, has led to an explosion of practical applications of the Bayesian method during the past two decades.

We will ignore calculation of z here, and will work with the un-normalised posterior instead:

\log (f(d)\times f(k)\times f(D|d,k))

We can now plot the likelihood, prior, and posterior surfaces. First we set up a 100 ⨉ 100 grid of d and k points over which the surfaces will be plotted:

dim <- 100  # dimension for the plot
d.v <- seq(from=0, to=0.3, len=dim) # vector of d values
k.v <- seq(from=0, to=100, len=dim) # vector of k values
dk <- expand.grid(d=d.v, k=k.v)

par(mfrow=c(1, 3))

We now plot the three surfaces:

# Prior surface, f(D)f(k)
Pri <- matrix(dgamma(dk$d, 2, 20) * dgamma(dk$k, 2, .1),
       ncol=dim)

image(d.v, k.v, -Pri, las=1, col=heat.colors(50),
      main="Prior", xlab="distance, d",
      ylab="kappa, k", cex.main=2.0,
      cex.lab=1.5, cex.axis=1.5)

contour(d.v, k.v, Pri, nlev=10, drawlab=FALSE, add=TRUE)

# Likelihood surface, f(D|d,k)
lnL <- matrix(k80.lnL(d=dk$d, k=dk$k), ncol=dim)

# for numerical reasons, we scale the likelihood to be 1
# at the maximum, i.e. we subtract max(lnL)
L <- exp(lnL - max(lnL))  

image(d.v, k.v, -L, las=1, col=heat.colors(50),
      main="Likelihood", xlab="distance, d",
      ylab="kappa, k", cex.main=2.0,
      cex.lab=1.5, cex.axis=1.5)

contour(d.v, k.v, L, nlev=10,
        drawlab=FALSE, add=TRUE) # creates a contour plot

# Unscaled posterior surface, f(d)f(k)f(D|d,k)
Pos <- Pri * L

image(d.v, k.v, -Pos, las=1, col=heat.colors(50),
      main="Posterior", xlab="distance, d",
      ylab="kappa, k", cex.main=2.0,
      cex.lab=1.5, cex.axis=1.5)

contour(d.v, k.v, Pos, nlev=10,
        drawlab=FALSE, add=TRUE)

The generated plots are shown below.

figure1

In part 2 we show an MCMC algorithm to sample from the posterior surface.

Part 2: Markov chain Monte Carlo (MCMC)

Advertisements

2 thoughts on “Part 1: Introduction

  1. Pingback: Part 2: Markov chain Monte Carlo (MCMC) | The DNA in Us

  2. Pingback: Tutorial: Bayesian MCMC phylogenetics using R | The DNA in Us

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s