Benchmarking phylogenetic algorithms often required the availability of a dataset of true evolutionary histories. As these true histories are hardly known, we rely instead on simulated datasets. In this series of post, I will present simple algorithms for the simulation of phylogenetic trees depicting the evolutionary history of a gene family. This first part will focus on the simulation of species tree under either a pure birth or a birth-death process.
Simulations of phylogenetic trees are often conducted under a constant rate of evolution (molecular clock). The molecular clock can later be relaxed to allow variation in rates accross branches. In fact, the algorithms presented here will only consider elapsed time. Therefore, branch lengths will only represent time (no substitution rate).
Ultrametric binary tree with a pure birth model (Yule process).
With a pure birth model, the rate of only one event (birth of two new nodes from a parent one) is required. This model can be used for the simulation of species trees where the birth rate can be seen as the speciation rate. From a time \( t_0 \) corresponding to the starting time of evolution with a single ancestral species, new branches leading to new species are generated. This process is also referred to as the Yule process [1].
This is a really simple and quite unrealistic model since it entirely ignore the dynamics of evolution. At any time \( t_i \) there are \( n_i \) number of individuals in the population, and each individual is able to independently give birth to an offspring at constant rate \( \sigma \). At most one birth can occur within a time interval \( ]t_i, t_i + \Delta t[ \) and after a birth has happened, the parent and child evolve independently.
For a straightforward implementation (code below), we start with a pool of extant species of size one. After a waiting time which depends upon the speciation rate \( \sigma \), this single ancestral species is replaced by its two children in the pool of extant species. The process is then repeated several times, with a random extant species, until a stopping criterion is reached. The most common stopping criteria are :
- Total evolution time
- Total number of extant species (number of leaves)
See birth_only_tree at the end of the post for an implementation in python that use ETE and can be easily adapted to your need.
And an example of output is :
(((T1:0.523488,T2:0.523488)1:0.422493,(T3:0.79018,(T4:0.547927,T5:0.547927)1:0.242252)1:0.155801)1:0.294638,(T6:1.12469,T7:1.12469)1:0.115934);
As you can see, the simulated tree is perfectly ultrametric.
Simulating a tree in a birth-death model
In a birth-death model, lineage can go extinct. Therefore, a rate for death events should additionally be be considered. Similar to the Yule process, in a small interval of time \( ]t_i, t_i + \Delta t[ \) , only one of the three following transitions are possible: one death, one birth or no event. The birth-death model is one of the most used for the simulation of species trees. Implementation of this model (birth_death_tree ) follows a structure similar to the one above. The most important difference is to ensure that extinct species cannot be selected to give birth and should not have their branch lengths updated either.
And an example of output is :
(((T1:1.59667,T2:1.59667)1:0.63352,(T3:1.66694,T4:1.66694)1:0.563248)1:1.12677,(:1.27368,:0.239961)1:0.268062);
The lost nodes are shown in gray dashed line. You can see that the corresponding lineage are at a shorter distance from the root compared to extant species.
In the second part of this post, I will show how to simulate a gene family evolution with duplication, loss and transfer events, given a species tree.
Implementation
References
[1] Yule G., “A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis, F.R.S.”, Philosophical Transactions of the Royal Society of London, Series B, vol. 213, pp. 21–84, 1924
