Models of DNA evolution


A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are frequently used in molecular phylogenetic analyses. In particular, they are used during the calculation of likelihood of a tree and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences.

Introduction

These models are phenomenological descriptions of the evolution of DNA as a string of four discrete states. These Markov models do not explicitly depict the mechanism of mutation nor the action of natural selection. Rather they describe the relative rates of different changes. For example, mutational biases and purifying selection favoring conservative changes are probably both responsible for the relatively high rate of transitions compared to transversions in evolving sequences. However, the Kimura model described below only attempts to capture the effect of both forces in a parameter that reflects the relative rate of transitions to transversions.
Evolutionary analyses of sequences are conducted on a wide variety of time scales. Thus, it is convenient to express these models in terms of the instantaneous rates of change between different states. If we are given a starting state at one position, the model's Q matrix and a branch length expressing the expected number of changes to have occurred since the ancestor, then we can derive the probability of the descendant sequence having each of the four states. The mathematical details of this transformation from rate-matrix to probability matrix are described in the mathematics of substitution models section of the substitution model page. By expressing models in terms of the instantaneous rates of change we can avoid estimating a large numbers of parameters for each branch on a phylogenetic tree.
The models described on this page describe the evolution of a single site within a set of sequences. They are often used for analyzing the evolution of an entire locus by making the simplifying assumption that different sites evolve independently and are identically distributed. This assumption may be justifiable if the sites can be assumed to be evolving neutrally. If the primary effect of natural selection on the evolution of the sequences is to constrain some sites, then models of among-site rate-heterogeneity can be used. This approach allows one to estimate only one matrix of relative rates of substitution, and another set of parameters describing the variance in the total rate of substitution across sites.

DNA evolution as a continuous-time Markov chain

Continuous-time Markov chains

Continuous-time Markov chains have the usual transition matrices
which are, in addition, parameterized by time,. Specifically, if are the states, then the transition matrix
Example: We would like to model the substitution process in DNA sequences in a continuous-time fashion. The corresponding transition matrices will look like:
where the top-left and bottom-right 2 × 2 blocks correspond to transition probabilities and the top-right and bottom-left 2 × 2 blocks corresponds to transversion probabilities.
Assumption: If at some time, the Markov chain is in state, then the probability that at time, it will be in state depends only upon, and. This then allows us to write that probability as.
Theorem: Continuous-time transition matrices satisfy:
Note: There is here a possible confusion between two meanings of the word transition. In the context of Markov chains, transition is the general term for the change between two states. In the context of nucleotide changes in DNA sequences, transition is a specific term for the exchange between either the two purines or the two pyrimidines . By contrast, an exchange between one purine and one pyrimidine is called a transversion.

Deriving the dynamics of substitution

Consider a DNA sequence of fixed length m evolving in time by base replacement. Assume that the processes followed by the m sites are Markovian independent, identically distributed and that the process is constant over time. For a particular site, let
be the set of possible states for the site, and
their respective probabilities at time. For two distinct, let be the transition rate from state to state. Similarly, for any, let the total rate of change from be
The changes in the probability distribution for small increments of time are given by
In other words,, the frequency of 's at time is equal to the frequency at time minus the frequency of the lost 's plus the frequency of the newly created 's.
Similarly for the probabilities, and. These equations can be written compactly as
where
is known as the rate matrix. Note that, by definition, the sum of the entries in each row of is equal to zero. It follows that
For a stationary process, where does not depend on time t, this differential equation can be solved. First,
where denotes the exponential of the matrix. As a result,

Ergodicity

If the Markov chain is irreducible, i.e. if it is always possible to go from a state to a state , then it is also ergodic. As a result, it has a unique stationary distribution, where corresponds to the proportion of time spent in state after the Markov chain has run for an infinite amount of time. In DNA evolution, under the assumption of a common process for each site, the stationary frequencies correspond to equilibrium base compositions. Indeed, note that since the stationary distribution satisfies, we see that when the current distribution is the stationary distribution we have
In other words, the frequencies of do not change.

Time reversibility

Definition: A stationary Markov process is time reversible if the amount of change from state to is equal to the amount of change from to,. This means that:
Not all stationary processes are reversible, however, most commonly used DNA evolution models assume time reversibility, which is considered to be a reasonable assumption.
Under the time reversibility assumption, let, then it is easy to see that:
Definition The symmetric term is called the exchangeability between states and. In other words, is the fraction of the frequency of state that is the result of transitions from state to state.
Corollary The 12 off-diagonal entries of the rate matrix, can be completely determined by 9 numbers; these are: 6 exchangeability terms and 3 stationary frequencies,.

Scaling of branch lengths

By comparing extant sequences, one can determine the amount of sequence divergence. This raw measurement of divergence provides information about the number of changes that have occurred along the path separating the sequences. The simple count of differences between sequences will often underestimate the number of substitution because of multiple hits. Trying to estimate the exact number of changes that have occurred is difficult, and usually not necessary. Instead, branch lengths in phylogenetic analyses are usually expressed in the expected number of changes per site. The path length is the product of the duration of the path in time and the mean rate of substitutions. While their product can be estimated, the rate and time are not identifiable from sequence divergence.
The descriptions of rate matrices on this page accurately reflect the relative magnitude of different substitutions, but these rate matrices are not scaled such that a branch length of 1 yields one expected change. This scaling can be accomplished by multiplying every element of the matrix by the same factor, or simply by scaling the branch lengths. If we use the β to denote the scaling factor, and ν to denote the branch length measured in the expected number of substitutions per site then βν is used in the transition probability formulae below in place of μt. Note that ν is a parameter to be estimated from data, and is referred to as the branch length, while β is simply a number that can be calculated from the rate matrix.
The value of β can be found by forcing the expected rate of flux of states to 1. The diagonal entries of the rate-matrix represent -1 times the rate of leaving each state. For [|time-reversible] models, we know the equilibrium state frequencies. Thus we can find the expected rate of change by calculating the sum of flux out of each state weighted by the proportion of sites that are expected to be in that class. Setting β to be the reciprocal of this sum will guarantee that scaled process has an expected flux of 1:
For example, in the Jukes–Cantor, the scaling factor would be 4/ because the rate of leaving each state is 3μ/4.

Most common models of DNA evolution

JC69 model (Jukes and Cantor 1969)

JC69, the Jukes and Cantor 1969 model, is the simplest substitution model. There are several assumptions. It assumes equal base frequencies and equal mutation rates. The only parameter of this model is therefore, the overall substitution rate. As previously mentioned, this variable becomes a constant when we normalize the mean-rate to 1.
When branch length,, is measured in the expected number of changes per site then:
It is worth noticing that what stands for sum of any column of matrix multiplied by time and thus means expected number of substitutions in time for each particular site when the rate of substitution equals.
Given the proportion of sites that differ between the two sequences the Jukes–Cantor estimate of the evolutionary distance between two sequences is given by
The in this formula is frequently referred to as the -distance. It is a sufficient statistic for calculating the Jukes–Cantor distance correction, but is not sufficient for the calculation of the evolutionary distance under the more complex models that follow.