Posts How to simulate a phylogenetic tree ? (part 1)
Post
Cancel

How to simulate a phylogenetic tree ? (part 1)

Benchmarking phylogenetic algorithms often required the availability of a dataset of true evolutionary histories. As these true histories are hardly known, we rely instead on simulated datasets. In this series of post, I will present simple algorithms for the simulation of phylogenetic trees depicting the evolutionary history of a gene family. This first part will focus on the simulation of species tree under either a pure birth or a birth-death process.

Simulations of phylogenetic trees are often conducted under a constant rate of evolution (molecular clock). The molecular clock can later be relaxed to allow variation in rates accross branches. In fact, the algorithms presented here will only consider elapsed time. Therefore, branch lengths will only represent time (no substitution rate).

Ultrametric binary tree with a pure birth model (Yule process).

With a pure birth model, the rate of only one event (birth of two new nodes from a parent one) is required. This model can be used for the simulation of species trees where the birth rate can be seen as the speciation rate. From a time t0 corresponding to the starting time of evolution with a single ancestral species, new branches leading to new species are generated. This process is also referred to as the Yule process [1].

This is a really simple and quite unrealistic model since it entirely ignore the dynamics of evolution. At any time ti there are ni number of individuals in the population, and each individual is able to independently give birth to an offspring at constant rate σ. At most one birth can occur within a time interval ]ti,ti+Δt[ and after a birth has happened, the parent and child evolve independently.

For a straightforward implementation (code below), we start with a pool of extant species of size one. After a waiting time which depends upon the speciation rate σ, this single ancestral species is replaced by its two children in the pool of extant species. The process is then repeated several times, with a random extant species, until a stopping criterion is reached. The most common stopping criteria are :

  • Total evolution time
  • Total number of extant species (number of leaves)

See birth_only_tree at the end of the post for an implementation in python that use ETE and can be easily adapted to your need.

And an example of output is :

(((T1:0.523488,T2:0.523488)1:0.422493,(T3:0.79018,(T4:0.547927,T5:0.547927)1:0.242252)1:0.155801)1:0.294638,(T6:1.12469,T7:1.12469)1:0.115934);

Pure Yule tree

As you can see, the simulated tree is perfectly ultrametric.

Simulating a tree in a birth-death model

In a birth-death model, lineage can go extinct. Therefore, a rate for death events should additionally be be considered. Similar to the Yule process, in a small interval of time ]ti,ti+Δt[ , only one of the three following transitions are possible: one death, one birth or no event. The birth-death model is one of the most used for the simulation of species trees. Implementation of this model (birth_death_tree ) follows a structure similar to the one above. The most important difference is to ensure that extinct species cannot be selected to give birth and should not have their branch lengths updated either.

And an example of output is :

(((T1:1.59667,T2:1.59667)1:0.63352,(T3:1.66694,T4:1.66694)1:0.563248)1:1.12677,(:1.27368,:0.239961)1:0.268062);

Birth-death tree

The lost nodes are shown in gray dashed line. You can see that the corresponding lineage are at a shorter distance from the root compared to extant species.

In the second part of this post, I will show how to simulate a gene family evolution with duplication, loss and transfer events, given a species tree.

Implementation

#!/usr/bin/env python
from ete3 import Tree
import random
def delete_single_child_internal(t):
"""Utility function that removes internal nodes
with a single child from tree"""
for node in t.traverse("postorder"):
if(not node.is_leaf() and len(node.get_children()) < 2):
node.delete()
if len(t.get_children()) == 1:
t = t.children[0]
t.up = None
def birth_only_tree(birth, nsize=10, max_time=None):
"""Generates a uniform-rate birth-only tree.
Arguments:
- ``birth`` : birth rate
- ``nsize`` : desired number of leaves
- ``max_time`` : maximum allowed time for evolution
"""
done = False
total_time = 0
# create initial root node and set length to 0
tree = Tree()
tree.dist = 0.0
# check if a stopping condition is provided
if not (nsize or max_time):
raise ValueError('A stopping criterion is required')
while True:
# get the current list of extant species
leaf_nodes = tree.get_leaves()
# get required waiting time before next speciation
wtime = random.expovariate(len(leaf_nodes) / birth)
# check if stopping criterion is reached then
# stop loop if yes
if len(leaf_nodes) >= nsize or (max_time and total_time + wtime >= max_time):
done = True
# update the length of all current leaves
# while not exceeding the maximum evolution time
max_limited_time = min(
wtime, (max_time or total_time + wtime) - total_time)
for leaf in leaf_nodes:
leaf.dist += max_limited_time
if done:
break
# update total time of evolution
total_time += max_limited_time
# add a new node to a randomly chosen leaf.
node = random.choice(leaf_nodes)
c1 = Tree()
c2 = Tree()
node.add_child(c1)
node.add_child(c2)
c1.dist = 0.0
c2.dist = 0.0
# Label leaves here and also update
# the last branch lengths
leaf_nodes = tree.get_leaves()
leaf_compteur = 1
for (ind, node) in enumerate(leaf_nodes):
node.name = 'T%d' % leaf_compteur
leaf_compteur += 1
return tree
def birth_death_tree(birth, death, nsize=10, max_time=None, remlosses=True, r=True):
"""Generates a birth-death tree.
Arguments:
- ``birth`` : birth rate
- ``death`` : death rate
- ``nsize`` : desired number of leaves
- ``max_time`` : maximum time of evolution
- ``remlosses`` : whether lost leaves (extinct taxa) should be pruned from tree
- ``r`` : repeat until success
"""
# initialize tree with root node
tree = Tree()
tree.add_features(extinct=False)
tree.dist = 0.0
done = False
# get current list of leaves
leaf_nodes = tree.get_leaves()
curr_num_leaves = len(leaf_nodes)
total_time = 0
died = set([])
# total event rate to compute waiting time
event_rate = float(birth + death)
while True:
# waiting time based on event_rate
wtime = random.expovariate(event_rate)
total_time += wtime
for leaf in leaf_nodes:
# extinct leaves cannot update their branches length
if not leaf.extinct:
leaf.dist += wtime
if curr_num_leaves >= nsize:
done = True
if done:
break
# if event occurs within time constraints
if max_time is None or total_time <= max_time:
# select node at random, then find chance it died or give birth
# (speciation)
node = random.choice(leaf_nodes)
eprob = random.random()
leaf_nodes.remove(node)
curr_num_leaves -= 1
# birth event (speciation)
if eprob < birth / event_rate:
child1 = Tree()
child1.dist = 0
child1.add_features(extinct=False)
child2 = Tree()
child2.dist = 0
child2.add_features(extinct=False)
node.add_child(child1)
node.add_child(child2)
leaf_nodes.append(child1)
leaf_nodes.append(child2)
# update add two new leave
# (remember that parent was removed)
curr_num_leaves += 2
else:
# death of the chosen node
if curr_num_leaves > 0:
node.extinct = True
died.add(node)
else:
if not r:
raise ValueError(
"All lineage went extinct, please retry")
# Restart the simulation because the tree has gone
# extinct
tree = Tree()
leaf_nodes = tree.get_leaves()
tree.add_features(extinct=False)
tree.dist = 0.0
curr_num_leaves = 1
died = set([])
total_time = 0
# this should always hold true
assert curr_num_leaves == len(leaf_nodes)
if remlosses:
# prune lost leaves from tree
leaves = set(tree.get_leaves()) - died
tree.prune(leaves)
# remove all non binary nodes
delete_single_child_internal(tree)
leaf_nodes = tree.get_leaves()
leaf_compteur = 1
for ind, node in enumerate(leaf_nodes):
# label only extant leaves
if not node.extinct:
# node.dist += wtime
node.name = "T%d" % leaf_compteur
leaf_compteur += 1
return tree
if __name__ == '__main__':
yuletree = birth_only_tree(1.0, nsize=10)
bdtree = birth_death_tree(1.0, 0.5, nsize=10)
view raw tree_simul.py hosted with ❤ by GitHub

References

[1] Yule G., “A mathematical theory of evolution, based on the conclusions of Dr. J.C. Willis, F.R.S.”, Philosophical Transactions of the Royal Society of London, Series B, vol. 213, pp. 21–84, 1924

This post is licensed under CC BY 4.0 by the author.