Benchmarking phylogenetic algorithms often required the availability of a dataset of true evolutionary histories. As these true histories are hardly known, we rely instead on simulated datasets. In this series of post, I will present simple algorithms for the simulation of phylogenetic trees depicting the evolutionary history of a gene family. This first part will focus on the simulation of species tree
under either a pure birth
or a birth-death
process.
Simulations of phylogenetic trees are often conducted under a constant rate of evolution (molecular clock). The molecular clock can later be relaxed to allow variation in rates accross branches. In fact, the algorithms presented here will only consider elapsed time. Therefore, branch lengths will only represent time (no substitution rate).
Ultrametric binary tree with a pure birth model (Yule process).
With a pure birth model, the rate of only one event (birth of two new nodes from a parent one) is required. This model can be used for the simulation of species trees where the birth rate can be seen as the speciation rate. From a time
This is a really simple and quite unrealistic model since it entirely ignore the dynamics of evolution. At any time
For a straightforward implementation (code below), we start with a pool of extant species of size one. After a waiting time which depends upon the speciation rate
- Total evolution time
- Total number of extant species (number of leaves)
See birth_only_tree at the end of the post for an implementation in python that use ETE and can be easily adapted to your need.
And an example of output is :
(((T1:0.523488,T2:0.523488)1:0.422493,(T3:0.79018,(T4:0.547927,T5:0.547927)1:0.242252)1:0.155801)1:0.294638,(T6:1.12469,T7:1.12469)1:0.115934);
As you can see, the simulated tree is perfectly ultrametric.
Simulating a tree in a birth-death model
In a birth-death model, lineage can go extinct. Therefore, a rate for death events should additionally be be considered. Similar to the Yule process, in a small interval of time one death
, one birth
or no event
. The birth-death model is one of the most used for the simulation of species trees. Implementation of this model (birth_death_tree ) follows a structure similar to the one above. The most important difference is to ensure that extinct species cannot be selected to give birth and should not have their branch lengths updated either.
And an example of output is :
(((T1:1.59667,T2:1.59667)1:0.63352,(T3:1.66694,T4:1.66694)1:0.563248)1:1.12677,(:1.27368,:0.239961)1:0.268062);
The lost nodes are shown in gray dashed line. You can see that the corresponding lineage are at a shorter distance from the root compared to extant species.
In the second part of this post, I will show how to simulate a gene family evolution with duplication, loss and transfer events, given a species tree.
Implementation
#!/usr/bin/env python | |
from ete3 import Tree | |
import random | |
def delete_single_child_internal(t): | |
"""Utility function that removes internal nodes | |
with a single child from tree""" | |
for node in t.traverse("postorder"): | |
if(not node.is_leaf() and len(node.get_children()) < 2): | |
node.delete() | |
if len(t.get_children()) == 1: | |
t = t.children[0] | |
t.up = None | |
def birth_only_tree(birth, nsize=10, max_time=None): | |
"""Generates a uniform-rate birth-only tree. | |
Arguments: | |
- ``birth`` : birth rate | |
- ``nsize`` : desired number of leaves | |
- ``max_time`` : maximum allowed time for evolution | |
""" | |
done = False | |
total_time = 0 | |
# create initial root node and set length to 0 | |
tree = Tree() | |
tree.dist = 0.0 | |
# check if a stopping condition is provided | |
if not (nsize or max_time): | |
raise ValueError('A stopping criterion is required') | |
while True: | |
# get the current list of extant species | |
leaf_nodes = tree.get_leaves() | |
# get required waiting time before next speciation | |
wtime = random.expovariate(len(leaf_nodes) / birth) | |
# check if stopping criterion is reached then | |
# stop loop if yes | |
if len(leaf_nodes) >= nsize or (max_time and total_time + wtime >= max_time): | |
done = True | |
# update the length of all current leaves | |
# while not exceeding the maximum evolution time | |
max_limited_time = min( | |
wtime, (max_time or total_time + wtime) - total_time) | |
for leaf in leaf_nodes: | |
leaf.dist += max_limited_time | |
if done: | |
break | |
# update total time of evolution | |
total_time += max_limited_time | |
# add a new node to a randomly chosen leaf. | |
node = random.choice(leaf_nodes) | |
c1 = Tree() | |
c2 = Tree() | |
node.add_child(c1) | |
node.add_child(c2) | |
c1.dist = 0.0 | |
c2.dist = 0.0 | |
# Label leaves here and also update | |
# the last branch lengths | |
leaf_nodes = tree.get_leaves() | |
leaf_compteur = 1 | |
for (ind, node) in enumerate(leaf_nodes): | |
node.name = 'T%d' % leaf_compteur | |
leaf_compteur += 1 | |
return tree | |
def birth_death_tree(birth, death, nsize=10, max_time=None, remlosses=True, r=True): | |
"""Generates a birth-death tree. | |
Arguments: | |
- ``birth`` : birth rate | |
- ``death`` : death rate | |
- ``nsize`` : desired number of leaves | |
- ``max_time`` : maximum time of evolution | |
- ``remlosses`` : whether lost leaves (extinct taxa) should be pruned from tree | |
- ``r`` : repeat until success | |
""" | |
# initialize tree with root node | |
tree = Tree() | |
tree.add_features(extinct=False) | |
tree.dist = 0.0 | |
done = False | |
# get current list of leaves | |
leaf_nodes = tree.get_leaves() | |
curr_num_leaves = len(leaf_nodes) | |
total_time = 0 | |
died = set([]) | |
# total event rate to compute waiting time | |
event_rate = float(birth + death) | |
while True: | |
# waiting time based on event_rate | |
wtime = random.expovariate(event_rate) | |
total_time += wtime | |
for leaf in leaf_nodes: | |
# extinct leaves cannot update their branches length | |
if not leaf.extinct: | |
leaf.dist += wtime | |
if curr_num_leaves >= nsize: | |
done = True | |
if done: | |
break | |
# if event occurs within time constraints | |
if max_time is None or total_time <= max_time: | |
# select node at random, then find chance it died or give birth | |
# (speciation) | |
node = random.choice(leaf_nodes) | |
eprob = random.random() | |
leaf_nodes.remove(node) | |
curr_num_leaves -= 1 | |
# birth event (speciation) | |
if eprob < birth / event_rate: | |
child1 = Tree() | |
child1.dist = 0 | |
child1.add_features(extinct=False) | |
child2 = Tree() | |
child2.dist = 0 | |
child2.add_features(extinct=False) | |
node.add_child(child1) | |
node.add_child(child2) | |
leaf_nodes.append(child1) | |
leaf_nodes.append(child2) | |
# update add two new leave | |
# (remember that parent was removed) | |
curr_num_leaves += 2 | |
else: | |
# death of the chosen node | |
if curr_num_leaves > 0: | |
node.extinct = True | |
died.add(node) | |
else: | |
if not r: | |
raise ValueError( | |
"All lineage went extinct, please retry") | |
# Restart the simulation because the tree has gone | |
# extinct | |
tree = Tree() | |
leaf_nodes = tree.get_leaves() | |
tree.add_features(extinct=False) | |
tree.dist = 0.0 | |
curr_num_leaves = 1 | |
died = set([]) | |
total_time = 0 | |
# this should always hold true | |
assert curr_num_leaves == len(leaf_nodes) | |
if remlosses: | |
# prune lost leaves from tree | |
leaves = set(tree.get_leaves()) - died | |
tree.prune(leaves) | |
# remove all non binary nodes | |
delete_single_child_internal(tree) | |
leaf_nodes = tree.get_leaves() | |
leaf_compteur = 1 | |
for ind, node in enumerate(leaf_nodes): | |
# label only extant leaves | |
if not node.extinct: | |
# node.dist += wtime | |
node.name = "T%d" % leaf_compteur | |
leaf_compteur += 1 | |
return tree | |
if __name__ == '__main__': | |
yuletree = birth_only_tree(1.0, nsize=10) | |
bdtree = birth_death_tree(1.0, 0.5, nsize=10) |
References
[1] Yule G., “A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis, F.R.S.”, Philosophical Transactions of the Royal Society of London, Series B, vol. 213, pp. 21–84, 1924