（ウイルスゲノム組換えのベイズ推定： DNA部 部分 配 列 の 間 の ト ポ ロ ジ ー 距 離 と そ の 分 布 ）

Leonardo de Oliveira Martins

レオナルドドリベラマルチン

Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Agricultural and Environmental Biology of the Graduate School of Agricultural and Life Sciences of the The University of Tokyo

October 2008

Leonardo de Oliveira Martins Academic Advisor: Hirohisa Kishino

This thesis is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License (http://creativecommons.org/licenses/by-sa/3.0/).

Abstract Recombinant DNA sequences can not be represented by a single topology since recombinant segments support distinct evolutionary histories. Existing methods for recombination detection can handle only a limited number of taxa, constraining the recombination analysis to cases where the phylogeny can be assumed known for the parental sequences. If the analysis is conducted independently for each putative recombinant sequence, potential recombinations between them are neglected. We newly developed a distance measure between unrooted topologies that closely resembles the number of recombinations. By introducing a prior distribution on these recombination distances, a Bayesian hierarchical model was devised to detect phylogenetic inconsistencies due to recombinations. On simulated datasets, our method correctly detected recombination break-points and the number of recombination events for each break-point. The procedure is robust to rate and transition:transversion heterogeneities for simulations with and without recombination. Applying the procedure to a genomic HIV-1 dataset, we found evidence for recombination hot-spots and intra-subtype recombination.

習うより慣れろ

Acknowledgements

These have been interesting five long years in Japan where I met many people who became part of me, without forgetting those I brought in memory with me. I would like to thank the friendship of Max, Jay, Edson and Issamu from the dorm, and Honda and Mahendra from the lab, which I know will last for long. Having good friends like them made even my worst days not only bearable but amusing, and thaught me many things beyond Science. Our paths wil cross again. My time here also gave me the opportunity to meet real professors, who by example helped me understanding how great Science is done in Japan. Among them I must not forget Professor Jun-Ichi Miyazaki from the Univeristy of Yamanashi, Professors Teruaki Watabe and Yasuhiro Kitazoe from Kochi University, Professor Kazuyuki Tanabe from Osaka University and Professor Masami Hasegawa from the Institute of Statistical Mathematics. I also acknowledge the efforts of the members of the review board, namely Professor Yukio Shirako, Professor Toru Shimada, Professor Yasuhiro Omori and Professor Yasushi Takano, who improved my thesis with diligence and respect. Professor Takano, together with Professors Hiroshi Omori and Tae-Kun Seo followed every step of my research, sharing the same lab, and in the process taught me the noblest meaning of the word senpai (先 先輩). お 世 話 に な り ま し た . Professors Peter Waddell and Ziheng Yang, by visiting our lab and providing me with uncountable pleasant discussions had a fundamental importance in my work. The one who made these friendships possible, my learning enjoyable and my research rewarding is Professor Hirohisa Kishino, my mentor. It is very easy to say good things about him, anyone who ever met him realized to be before a great man. But I prefer to dedicate to him my future work: I believe that the only way to properly appreciate his accomplishments is to work harder, disseminating the clarity he sparkles in us. こ れ か ら も 宜 し く お 願 い 致 し ま す . Our lab also have an angel, Sasaki-san, who appears to dissipate all the problems Science is yet to find a solution. It is a pleasure, a privilege and an honour to share my days with my beloved wife, Reiko, to whom I thank for the unconditional support and immovable strength in making me a better man. If I was satisfied with my freedom, in being enslaved by my will I am complete.

Contents 1

2

3

4

Introduction

8

1.1

Dendrograms, topologies, trees . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.2

Recombination detection methods . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.2.1

Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.2.2

Bayesian model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

1.3

Recombination hot-spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

1.4

The coalescent approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

Recombination in Retroviruses

20

2.1

The mechanism of strand transfer in HIV-1 . . . . . . . . . . . . . . . . . . . . .

21

2.2

Observable recombinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3

HIV-1 subtypes and Circulation Recombinant Forms (CRFs) . . . . . . . . . . .

24

2.4

HIV-1 recombination in South America . . . . . . . . . . . . . . . . . . . . . . .

28

Recombination Distance Between Topologies

34

3.1

Recombination distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.2

Approximation to dSPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

Bayesian Methods and Phylogenetics

51

4.1

Markov chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4.1.1

Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.1.2

Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.1.3

Metropolis-Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

Bayesian phylogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4.2

3

4.3

4.4

5

7

56

4.3.1

Likelihood of one site . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.3.2

Marginalizing over individual branch lengths . . . . . . . . . . . . . . .

59

4.3.3

Segmentation model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

Sampling from the posterior distribution . . . . . . . . . . . . . . . . . . . . . .

63

4.4.1

Update of continuous variables . . . . . . . . . . . . . . . . . . . . . . . .

65

4.4.2

Update of topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.4.3

Improving convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

4.4.4

Sampling stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

Results

73

5.1

Convergence diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

5.1.1

Basic statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

5.1.2

Autocorrelation function . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

5.1.3

Scaled regeneration quantiles . . . . . . . . . . . . . . . . . . . . . . . . .

77

5.1.4

Potential scale reduction factor . . . . . . . . . . . . . . . . . . . . . . . .

79

5.2

Summarizing the distribution of topologies . . . . . . . . . . . . . . . . . . . . .

80

5.3

Recombination detection on simulated sequences . . . . . . . . . . . . . . . . .

82

5.3.1

Simulation scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

5.3.2

Quality of posterior distributions . . . . . . . . . . . . . . . . . . . . . . .

88

5.3.3

Sensitivity and robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.4 6

Bayesian model for recombination . . . . . . . . . . . . . . . . . . . . . . . . . .

Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Detection of hot-spots in HIV-1

109

6.1

The dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.2

MCMC parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.3

Posterior distribution of SPR distances . . . . . . . . . . . . . . . . . . . . . . . . 111

Conclusions

120

List of Figures 1.1

Examples of graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.2

Rooting a tree by outgroup inclusion . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3

Example of dendrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.4

Example of rip output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.5

Procedure of DualBrothers program . . . . . . . . . . . . . . . . . . . . . . .

16

1.6

Example of dualbrothers output . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.7

Example of coalescent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.1

HIV-1 genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.2

Structure of HIV-1 virion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.3

Reverse transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.4

Replication cycle of HIV-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.5

Single ancestor hypothesis for CRFs . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.6

HIV-1 Circulating Recombinant Forms (CRFs) . . . . . . . . . . . . . . . . . . .

27

2.7

HIV-1 subtype distribution

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.8

Alternative hypotheses for CRFs . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.9

Possible quartet topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

2.10 DualBrothers HIV-1 genomic analysis . . . . . . . . . . . . . . . . . . . . . . .

31

2.11 Hierarchical clustering for all analyzed putative recombinant sequences . . . .

32

2.12 Hierarchical clustering of elected putative recombinant sequences . . . . . . . .

33

3.1

Example of phylogenetic incongruence . . . . . . . . . . . . . . . . . . . . . . .

35

3.2

Recombination example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.3

Subtree Prune-Regraft operation . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.4

Bipartitions induced by a topology . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5

3.5

Failure of current distances on estimating the number of recombinations . . . .

43

3.6

Leaf compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.7

Edge mapping used in SPR calculation . . . . . . . . . . . . . . . . . . . . . . . .

46

3.8

SPR simulation cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

3.9

Density plot of recombination distance measures . . . . . . . . . . . . . . . . . .

49

3.10 Density plot of overall distance measures performance . . . . . . . . . . . . . .

50

4.1

Bayesian posterior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.2

Example of rejection sampling technique . . . . . . . . . . . . . . . . . . . . . .

54

4.3

DNA alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.4

Segmentation of DNA alignment . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.5

Model parameters

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

4.6

Comparison between cold and heated chains . . . . . . . . . . . . . . . . . . . .

64

4.7

Addition, removal and shift of a break-point . . . . . . . . . . . . . . . . . . . .

67

4.8

Addition and removal of break-point within a mini-sampler . . . . . . . . . . .

70

4.9

Input file for biomc2 software . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

4.10 Output of biomc2 software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

5.1

Example of the autocorrelation function . . . . . . . . . . . . . . . . . . . . . . .

76

5.2

Example of Scaled Regeneration Quantile plot . . . . . . . . . . . . . . . . . . .

78

5.3

Trees used in simulated alignments for 8 taxa . . . . . . . . . . . . . . . . . . . .

85

5.4

Trees used in simulated alignments for 12 taxa . . . . . . . . . . . . . . . . . . .

86

5.5

Independent analysis on 12 taxa . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

5.6

Trees used in simulated alignments for 16 taxa . . . . . . . . . . . . . . . . . . .

88

5.7

Bayesian estimated distance for one dataset with 8 sequences . . . . . . . . . .

89

5.8

Summary of the distribution of topologies for a dataset on 8 sequences . . . . .

90

5.9

Bayesian estimated distance for one dataset with 12 sequences . . . . . . . . . .

91

5.10 Summary of the distribution of topologies for a dataset of 12 sequences . . . . .

92

5.11 Bayesian estimated distance for one dataset with 16 sequences . . . . . . . . . .

93

5.12 Summary of the distribution of topologies for a dataset on 16 sequences . . . .

95

5.13 Distribution of the modal number of recombinations and break-points . . . . .

96

5.14 Distribution of average distances over 100 replicates for 8 taxa . . . . . . . . . .

97

5.15 Distribution of average distances over 100 replicates for 12 taxa . . . . . . . . .

98

5.16 Distribution of average distances over 400 replicates for 16 taxa . . . . . . . . .

99

5.17 Topology reconstruction accuracy for simulations of 8 sequences . . . . . . . . . 100 5.18 Topology reconstruction accuracy for simulations of 12 sequences . . . . . . . . 101 5.19 Topology reconstruction accuracy for simulations of 16 sequences . . . . . . . . 101 5.20 Summary of the distribution of topologies for a dataset with no recombination

102

5.21 Posterior distribution of average substitution rate for a dataset with rate heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.22 Scale reduction factor of log [ P(θ | X )] . . . . . . . . . . . . . . . . . . . . . . . . 104 5.23 Scale reduction factor of the number of recombinations . . . . . . . . . . . . . . 105 5.24 SRQ plot for number of recombinations . . . . . . . . . . . . . . . . . . . . . . . 106 5.25 SRQ analyses for break-point locations . . . . . . . . . . . . . . . . . . . . . . . . 107 5.26 SRQ analyses for initial topologies . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.1

distribution of topology distances for HIV-1 dataset . . . . . . . . . . . . . . . . 112

6.2

Posterior distribution of the number of recombinations and break-points for HIV-1 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.3

MAP topologies for HIV-1 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.4

MAP topologies for HIV-1 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.5

Summary of the distribution of topologies for HIV-1 dataset . . . . . . . . . . . 117

6.6

Autocorrelation analysis for HIV-1 samples . . . . . . . . . . . . . . . . . . . . . 118

6.7

Potential scale reduction factor for HIV-1 dataset sampling . . . . . . . . . . . . 119

Chapter 1 Introduction Genomic regions participating recombination events will support distinct topologies, and phylogenetic analyses should incorporate this heterogeneity. Recombination can be therefore detected by comparing inconsistency in topologies between adjacent segments, taking account of uncertainty in the phylogenetic inference [1, 2]. Many techniques are based on sliding window procedures that compare the topology of one segment against neighbouring segments or the whole alignment. This comparison may be based on the phenetic distance [3], likelihood [4, 5] or posterior distribution [6] of the topologies. These methods are sensitive to ancestral recombination events and moderate contribution of recombination [2]. Variation in the selective pressure should be considered when estimating recombination events, since it may also lead to conflicting spatial phylogenetic signal. The most accurate detection methods [2] thus incorporate this variation, and use Hidden Markov Models [7, 8] or explicit topology and rate heterogeneity modelling [9] under a Bayesian approach. These procedures can not reliably estimate the history of recombination events when the number of taxa becomes moderate, due to the large degree of freedom on topologies. All current methods depend, in varying extent, on prior knowledge of the parental sequences - sequences whose ancestral relationships are assumed known - sometimes restricting the analyses to one putative recombinant. They also fail to provide a quantitative measure of recombination between segments, essential to address the existence and importance of recombination hot-spots. We propose an algorithm to calculate the minimum number of recombinations required for resolving the difference between two topologies, managing the degree of freedom on

8

1.1. DENDROGRAMS, TOPOLOGIES, TREES

the topologies to an appropriate extent. This distance does not make any assumption about parental or putative sequences and does not depend on rooting of the phylogenies. Our hierarchical Bayes model is based on dividing the alignment columns into arbitrarily small segments, which incorporates the prior distribution on distances between topologies of neighboring segments. By extracting topological inconsistencies which are directly related with recombinations it becomes feasible, then, to obtain well resolved topologies even for short segments, since these topologies are constrained by neighboring topologies. At the same time the posterior distribution of a distance between segments can be directly interpreted as the number of recombinations between these segments.

1.1

Dendrograms, topologies, trees

In Graph Theory a tree is a connected acyclic graph [10]. A graph is a finite nonempty set of nodes and a collection of unordered pairs of nodes called edges, which connects them. A tree is a connected graph since there is a path (a set of edges) between every pair of nodes. It is acyclic since this path and the nodes it traverses are unique for every pair of nodes. We shall assume further that there is at most one edge connecting any pair of nodes. If we define then the degree of a node as being the number of nodes connected to it by an edge, a phylogenetic tree is a tree with at most one node with degree 2. Phylogenetic trees are usually weighted graphs, in the sense that there is a value associated to each edge. Edges are commonly called branches in phylogenetic context, and these weights are the branch lengths, representing the (evolutionary) distance between nodes connected by them. Furthermore in a phylogenetic tree nodes of degree 1 (leaves or tips) have a special meaning, and represent the extant data (genes or species). Only the leaf labels are important since they contain information about the genes or taxa, and should be unique. The other nodes are called internal nodes. A directed graph is a graph where the edges represent ordered node pairs. A rooted phylogenetic tree is a directed tree. In a rooted tree there is an ancestrality relation between nodes, and the node ancestral to all other nodes is called the root - which can have degree 2. If we can not define which is the common ancestor then we have an unrooted phylogenetic tree. When all internal nodes have degree at most 3 the tree is said to be bifurcating, or binary, otherwise it is multifurcating, or polytomic. In Figure 1.1 we have examples of the forementioned structures. Every edge splits a tree into two subtrees, which are naturally rooted at the nodes 9

1.1. DENDROGRAMS, TOPOLOGIES, TREES

connected by the edge. We shall use the words tree and topology interchangeably, but when we refer to topologies it is implicit that we do not take the branch lengths into account. n1

n2 e4

n1

e1 e2

e3

n3

n4 e5

e6

e1 e2

n2

n3

e3

sp3

root

root

n4

sp4

e5

e4 n2

e4

e1

e6

e1 e2

e2

e1

n6

n5

n6

n1

n5

n1

n2

n2 n1

e7

n7

(a) connected cyclic graph

e3 e4

e7

n7

(b) connected acyclic graph

sp1

sp2

e5 e6 sp3

sp4

e3 e4

e7 sp5

(c) multifurcating rooted phylogenetic tree

sp1

sp2

e5 sp3

e6 sp4

(d) bifurcating rooted phylogenetic tree

e2

e3

sp1

sp2

(e) bifurcating unrooted phylogenetic tree

Figure 1.1: Examples of different types of graphs. Nodes are shown as circles and edges are represented by lines connecting the nodes. For phylogenetic trees the extant data are represented in gray. It is appealing to work with rooted topologies, since they bear a direct connection with time. For the case of recombination, we could be able to detect when, in the past, the recombination event took place. But a rooted topology depends on the placement of the root node, and for this usually we need to define a taxon (or group of taxa) which form the outgroup. By defining the outgroup and, by exclusion, the ingroup, the root node will be found in the edge connecting the subtrees spanned by them (Figure 1.2). Recombination introduces variation into the topological structure, in a way that will become clear in Chapter 3. With this variation the subtrees may loose meaning, and so the assumption of an outgroup may obstruct the detection of recombination. Polytomic trees indicate uncertainty about the relations between nodes, but are somewhat artificial in a biological context. They also increase the number of possible topologies for a given number of taxa, and can be replaced by their binary counterparts without loss of generality in the Bayesian context. And more importantly, multifurcating topologies may not have information about the recombination events. So we will be interested only in binary unrooted trees in the current analysis. Dendrograms are also trees in a graph-theoretical sense, but shall not be confused with phylogenetic trees since they have no relation with ancestral relationships between taxa a priori. There are exceptions, like the UPGMA method [11], but we will refer to dendrograms

10

1.1. DENDROGRAMS, TOPOLOGIES, TREES

root

root

n0

sp1

n0

n1

n2

sp2

sp3

n1

sp4

out

sp1

sp2

n2

sp3

n3

sp4

out1

out2

Figure 1.2: How to find the root of a tree based on outgroup information. Supposing we want to find the root of the topology depicted in 1.1e, and we know that the leaves labeled out (for single outgroup) or out1 and out2 are the most distantly related, the root placement is defined by the resulting topology. The root may be the node labeled root or n0, depending on the inclusion of the outgroup data into the analysis. when we want to make clear phylogenetic conclusions should not be drawn. A dendrogram is a two-dimensional diagram representing the fusions or divisions made at each stage of a Hierarchical Clustering [12], and their graphical representation is identical to rooted trees. Hierarchical clustering is a classification technique where a set of objects are iteratively merged or splitted based on dissimilarities between them. For example, suppose we have a set of objects 1, 2, 3, 4, 5 and we can represent the dissimiarity (distance) between them in matrix form: 1 1

2

3

4

5

0

2 2 D = 3 3 4 4 5 7

0 5

0

7

8

0

8

7

6

0

Then an agglomerative clustering of these objects can be conducted by merging the most similar groups (clusters) into a new group, starting with each object belonging to a distinct group. Agglomerative methods differ in how to define distance between groups. In our example we define that the distance between groups is that of the most distant pair of objects belonging to them (complete linkage or furthest neighbor method) [12]. Then the clustering

11

1.1. DENDROGRAMS, TOPOLOGIES, TREES

proceeds as follows, where the smallest distance at each step is highlighted in bold: 1 1

2

3

4

5

0 2 2 3 3 4 4 5 7

0 5

0

7

8

0

8

7

6

0

⇒ (1, 2)

(1, 2) 3 4 5 (1, 2) 3 4 5

0 5

0

7

8

0

8

7

6

0

⇒ ((1, 2), 3)

((1, 2), 3) 4 5 ((1, 2), 3) 4 5

0 8

0

8

6

0

⇒ (4, 5)

((1, 2), 3) (4, 5) ((1, 2), 3) (4, 5)

0 8

0

⇒ (((1, 2), 3), (4, 5))

In this example the clusters found at each stage are represented by the elements inside the parenthesis (where each element may be a cluster itself). This parenthetic representation, the basis of the dendrograms, is also used to represent phylogenetic trees by most, if not all, phylogenetic programs. In Figure 1.3 we show the dendrogram corresponding to this clustering, where the order of the internal nodes reflects the order of merging in the above procedure.

12

1.2. RECOMBINATION DETECTION METHODS

●

5

●

4

●

2

●

1

●

3

●

● ●

●

8

6

4 distance

2

0

Figure 1.3: Dendrogram representing the clustering of an arbitrary set of objects numbered from 1 to 5 (described in the text). The distances between nodes (clusters) on the scale represent the furthest neighbour distances.

1.2

Recombination detection methods

There are several procedures to detect recombination with varying degrees of success, where the most accurate methods are the most restrictive on the number of sequences and assumptions. In the study of viral recombination, the most widely used procedures are based on a sliding window approach, while the most powerful procedure is a Bayesian modelling on the number of recombination break-points. We will briefly review them below.

1.2.1

Sliding window

A sliding window is a technique where only a fraction of the sites (columns in the alignment) are analysed, and this analysis is conducted exhaustively for all column segments composing the alignment. The two arbitrary parameters in this kind of analysis are the window length (how many sites are analysed per batch) and the step size (how many sites to move the window forward after each batch). If you have a query sequence that is a putative recombinant between a set of known parental background sequences, the simplest analysis that can be conducted at each step is to calculate the distance between the query sequence and each of the background sequences. The query sequence is said to be involved in recombination if different windows (regions) support a different parental sequence as the more similar ones. This is the procedure implemented in the program rip, available as a web service at the Los

13

1.2. RECOMBINATION DETECTION METHODS

Alamos HIV Sequence Database 1 . The background sequences are the consensus sequences of the HIV-1 subtypes (assumed not to be recombinant themselves), and this scanning procedure is necessary condition for a sequence to be considered a recombinant form of HIV-1. The measure of confidence is a heuristic based on the difference between the distance to the closest parental and the distance to the second closest one. A more elaborate approach is to conduct a phylogenetic analysis within each window, and then recombination is inferred wherever the phylogenetic signal is contradictory between nearby windows. The most common phylogenetic analysed conducted inside a genomic window are the bootstrap analysis [5] and Bayesian inference [13]. Notice that despite the latter uses a Bayesian procedure as part of its heuristic, it can not be considered a Bayesian method since there is no modelling of recombination. On Figure 1.4 we show the output of the program rip, designed specifically to handle HIV-1 genomic sequences. The problem with these approaches is that only one putative recombinant can be analysed at once, and thus recombinations between the query sequences pass unnoticed. Another problem is the arbitrariness of the window size, since larger windows have more phylogenetic signal at the cost of decreased power in detecting incongruences. These procedures can not quantify the phylogenetic incongruences, thus hindering the detection of recombination hot-spots.

1.2.2

Bayesian model

The DualBrothers software is a dual multiple change-point model for recombination detection among aligned nucleotide sequences [9]. A multiple change-point procedure consists in assuming that there exists an unknown number K of non-overlapping segments that start at and stop before change-points. The parameter is assumed to be constant inside a segment and varying between segments. The model implemented in DualBrothers assumes that one change-point process is used to model spatial variation of tree topologies and the other, independent of the first, for spatial variation of substitution process parameters (namely, the average substitution rate per site and the transition:transversion ratio). In Figure 1.5 we see an example of the decoupling between the topology and substitution parameters variation for this model. They assume that the number of change-points for each process follows a Poisson distribution, and that the evolutionary process is according to the Hasegawa1

http://www.hiv.lanl.gov/content/hiv-db/RIP3/RIP.html

14

1.2. RECOMBINATION DETECTION METHODS

Figure 1.4: Example of the program rip output, showing the similarity between each background (parental) sequence and a query sequence (labeled 11 cpx.NG.94.NG3670b). The x-axis represent the site and the y-axis represents the similarity (complement of the distance). Each background sequence is compared independently from the others, and on the top we have the inferred mosaic structure. The window size was fixed at 400 base pairs (bp) at a step size of one nucleotide per batch. Kishino-Yano (HKY) model [14]. The posterior distribution of the change-point locations and parameters is then simulated through reversible-jump Markov chain Monte Carlo, since the number of parameters is variable, depending on the number of change-points. To reduce the number of possible topologies, this method is limited to the case where there is only one putative recombinant sequence, and the others are considered “parental” sequences - whose phylogeny is known without error. In fact their current implementation allows one to incorporate more than one recombinant sequence, but the convergence of the method is not guaranteed. In Chapter 4 we will se more details about these concepts, since our procedure follows similar ideas. One main difference is that in our model the number of parameters is kept constant, relaxing the change-point model. Another difference is that we do not impose any restriction on parental and recombinant sequences. We borrowed their clever idea of working with the average branch length per site, as will become clear in Chapter 4. The output of 15

1.3. RECOMBINATION HOT-SPOTS

Figure 1.5: Dual multiple change-point model, as implemented in DualBrothers software: the columns in the DNA alignment are decoupled into two independent change-point processes. In the substitution processes distinct segments (represented by different colors) present distinct evolutionary model parameters, and in the recombination process distinct segments support distinct topologies. DualBrothers over a datset of recombinant sequences is shown in Figure 1.6, where each panel is the posterior distribution of each parameter along the alignment. Each row of the figure is a putative recombinant sequence from HIV-1 viruses isolated in South America2 against a reference dataset composed of two subtype F sequences, two subtype B sequences and one subtype C sequence used as outgroup. The fragment used was the env gene, and in this figure we can detect the alignment regions with an average substitution rate higher than on other fragments, the homogeneity of the transition:transversion ratio over the alignment, and more importantly the recombination break-points detected as a change in most frequent topologies.

1.3

Recombination hot-spots

Recombination hot-spot detection so far have relied on spatially close recombination breakpoints. Even if we assume the existence of these neighboring break-points, this depends on 2 sequences

´ kindly provided by Dr. Elcio de Souza Leal from the Federal University of S˜ao Paulo

16

1.3. RECOMBINATION HOT-SPOTS

Figure 1.6: Example of DualBrothers output for the env gene of recombinant HIV-1 sequences from South America. The panel on the left shows the sitewise posterior distribution of each competing topology, where blue lines support subtype F ancestrality, red lines support B ancestrality and black lines are unresolved segments (since the recombinant clusters with the outgroup subtype C). Two distinct parental sequences from each subtype were used, resulting in different color shades. The middle panel represents the transition:transvertion ration, while the right panel represents the average substitution rate per site. an arbitrary value of proximity, since short segments will not have phylogenetic signal to distinguish between competing topologies. Thus detection procedures that consider all sequences at once will tend to cluster nearby break-points into a single one. The two detection procedures described above assume that only one “query” sequence (the putative recombinant) is analysed, and thus if we have several putative recombinants they are analysed independently. As we will see, independent analysis can not distinguish between ancestral and recurrent recombination, rendering it unreliable for hot-spots inference unless we assume phylogenetic independence between the putative recombinants [15]. If there is a strong structural restriction on some recombination break-point, a hot-spot would be better modelled by

17

1.4. THE COALESCENT APPROACH

one recombination break-point rather than several nearby break-points. In Chapter 2 we will see how recombination hot-spots occur in HIV populations, and in Chapter 3 we will propose a solution for the quantification of these hot-spots.

1.4

The coalescent approach

The coalescent process describes the ancestral tree of a sample of genes from a large population using population genetic data (arbitrary genomic units called loci, composed of distinct alleles). As a population genetic framework, it considers the interaction of the genealogical, genetic and demographic processes of the population [16, 17]. The genealogical process refers to linkage disequilibrium, the correlation between alleles carried at different positions in the genome for a population. The genetic process is ruled by the mutation and recombination rates, while the demographic history relates to the effective population size, the total number of individuals in an idealized population - also defined as the minimum number of individuals necessary to explain the observed variability. Ancestral lines coalesce when two ancestors of the sample share a common ancestor, and the coalescence times (for some convenient scale) are described as independent exponential distributions whose rates depend on the number of ancestors in the sample. In Figure 1.7 we have one such coalescent tree. The quantities of interest in coalescent-based approaches are the population recombination and mutation rates, dependent on the effective population size [18, 19].

Figure 1.7: Example of coalescent tree for 5 individuals (s1, · · · , s5) with coalescent times ti (i = 2, · · · , 5) describing the amount of time, in an arbitrary scale, when the sample had i ancestors.

18

1.4. THE COALESCENT APPROACH

Recombination will break the correlation between loci being dependent on the distance between them, and for very frequent recombination events the loci can be assumed to be independent (linkage equilibrium). This means that no pair of adjacent loci share the same phylogeny, and we should not expect any phylogenetic signal to be preserved. When recombination is pervasive along the genome then coalescent models can be used to estimate recombination based on the haplotype structure (alleles on the same chromosome) structure, as is the case for human populations ([20–22]). In this case good genetic markers are the single nucleotide polymorphisms (SNPs), single base changes in a DNA sequence thought to have very low mutation rates in humans [23]. In the coalescent context recombination hot-spots are defined as regions where the recombination rate is higher than the local background rate ([24, 25]). These methods can be applied to HIV datasets to estimate the recombination rate, at the cost of being unable to pinpoint the recombination events or the phylogenetic histories ([18, 26]). It is worth noticing that coalescent methods treat the coalescent trees as a nuisance parameter 3 , since they deal with quantities that should not depend on one particular topology, and on the other hand we are interested in describing these phylogenetic relationships between taxa. Coalescent models for recombination also assume that the recombination rate is higher than the mutation rate (otherwise recurrent mutations would mimic the effect of linkage equilibrium) and our model assumes that recombination is moderate enough so that some phylogenetic signal is preserved.

3a

nuisance parameter is one that we do not want to make inferences about. Mathematically, if we have parameters θ1 and θ2 where θ2 is a nuisance parameter, then we integrate it out to have an inference P(θ1 ) about θ1 : P ( θ1 ) =

Z

P(θ1 , θ2 )dθ2

19

Chapter 2 Recombination in Retroviruses In HIV-1, the reverse transcriptase switches RNA templates on average 3 times per replication cycle, yielding an average of about one recombinational strand transfer event per 3000 base pairs [27–29]. On macrophages, it may achieve 10 crossovers per replication [30]. A similar rate is also found in HIV-2 [31] and murine leukemia viruses [32]. Recombination also have been found to play a role in severe acute respiratory syndrome coronaviruses [33], hepatitis B and E [34, 35], enteroviruses [36], varicella-zoster [37, 38] and other primate lentiviruses [39]. It was observed that recombinations lead to emergence of the resistant mutants to multiple drugs [40, 41]. There is also evidence that the emergence of resistant strains may lead to an increase in the template switching frequency [42]. The recombinations enable favorable mutations that occur in different individuals to be combined into the same lineage [43]. Also, they may increase the chance that mutant-free individuals arise among the population of individuals with deleterious mutant genes. The advantage of recombinations has long studied both in theory [44–47] and in natural populations [48–50]. The theory predicts that, in the absence of recombination, individuals are doomed to an accumulation of deleterious mutations, a phenomenon known as the “Muller’s ratchet” [51]. Recent evidence implies that the extent of the advantage or disadvantage of the recombinations depends on the environment and its selection pressure. A similar type of genetic exchange in RNA viruses is called reassortment, where whole RNA molecules constituents of the segmented viral genome are swapped between individuals. Reassortment is responsible for antigenic shift in influenza A viruses [52].

20

2.1. THE MECHANISM OF STRAND TRANSFER IN HIV-1

2.1

The mechanism of strand transfer in HIV-1

Figure 2.1: Schematic representation of location of genes encoded by an HIV-1 RNA molecule. Almost all genes are subject to post-transcriptional modification. HIV-1 viruses possess two copies of its genomic material in the form of RNA. Each RNA molecule has around 10kB (kilobases) and encodes for the core structural proteins (gag), envelope proteins (env) and viral enzimes (pol). It also codes for auxiliary proteins, using the three reading frames (Figure 2.1). This RNA will be transcribed into DNA by a process called reverse transcription, where this viral DNA (called provirus) will become integrated into the host genome. There are no specific integration sites [53], but there are preferencial integration targets for genes active after retroviral infection [54, 55]. The provirus then uses the host replication machinery to produce new viruses capable of invading other cells. These free viruses coated in the host’s bilipid membrane are called virions (Figure 2.2).

Figure 2.2: Structure of HIV-1 virion. On the rightmost part we have the genes responsible for each protein. The envelope proteins are surrounded by a bilayer lipid membrane (bilipid layer) sequestered from the host cell. A simplified model for reverse transcription in HIV-1 (and other retroviruses) is repre21

2.2. OBSERVABLE RECOMBINATIONS

sented in Figure 2.3a. In reverse transcription, the nascent minus-stranded DNA has to dissociate from the 5’ end of the RNA template and reanneal to a repeat sequence at the 3’ end of the RNA [56]. This strand transfer occurs again when the minus-stranded proviral DNA serves as template for the transcription of the plus-strand DNA (step 5 in Figure 2.3). The viral reverse transcriptase (RT), responsible for transcription of RNA into proviral DNA, possesses a relatively low affinity for its template RNA and a low processivity as a result of this obligatory template jump. Crossover occurs then by template switching between sister templates, taking advantage of this low affinity, as represented by Figure 2.3b. As RT synthesizes DNA on one RNA template the RNase H activity hydrolyzes the template RNA behind the polymerase, allowing the nascent DNA to base pair with a second RNA template. Other factors that appear to facilitate this recombinational strand transfer event (crossover) are single-stranded regions of the alternative RNA template, such as loops, that are potentially more accessible for base pairing [57]. One mechanism for this influence of RNA folding in crossover rate is the hairpin-mediated strand transfer model for copy choice by RTs. In this model the transient unfolding of the nascent DNA favors its annealing with the alternative template RNA, increasing the chance of template switching [58]. Another mechanism is the decrease in polymerization rate by the hairpin structures, thus favoring a jump between templates[59]. These mechanisms help explain the existence of crossover hot-spots in HIV-1 by finding regions with folded structures[60, 61]. Some experiments, on the other hand, did not find evidence for crossover hot-spots [30]. It has been hypothesised that the dimerisation initiation signal (DIS) region of the HIV-1 gag gene is involved in the crossover rate [62], where sequences presenting different DIS sequences are likely to have crossover rates lower than sequences displaying the same pattern. This region may indeed have an increased crossover rate [63].

2.2

Observable recombinations

The mechanism described above does not take into consideration the difference between the sister templates. In fact, in the experiments reported above the two RNA copies were different so that we can distinguish between them [30, 57, 59, 60]. Thus if both copies are identical the crossover frequency may be higher than reported, despite this is irrelevant for our purposes. We should then make a distinction between crossover, the physical strand transfer 22

2.2. OBSERVABLE RECOMBINATIONS

(a) Simplified model 1. RNA primer binds to prime-binding site at 5’ end of template RNA and reverse trancription begins; 2. First template jump occurs where recognition point for nascent minus-stranded DNA is the R (for “repeat”) region at 3’ end of template RNA. RNA primer is degraded; 3. As transcription proceeds, the RNAse-H portion of the reverse transcriptase degrades the template RNA; 4. Transcription of plus-stranded DNA begins using minusstranded DNA as template; 5. Second template jump occurs to the 5’ end of the minus strand DNA.

(b) Template-switching mechanism Stages 1 and 2 are like in panel a. 3. As one template RNA is degraded, the chance of annealing between the nascent DNA and the other template increases. At the same time regions closer to the transcription site may anneal with the alternative RNA template. 4. Eventually the RT may switch template and start copying from the other RNA template. Despite only one crossover is represented here, The synthesis of plus-stranded DNA proceeds as usual. 5. Obligatory template-switch happens as in the simplified model, with the remark that the repeat regions (represented by “R”) should be similar in both templates, as represented by the annealing between minus and plus DNA strands.

Figure 2.3: Reverse transcription in HIV-1. Dark lines represent the RNA templates while light lines represent the nascent DNA. The lines are out of scale. The numbers from 1 to 5 depict the temporal order of events while the black arrows indicate the direction of transcription. Vertical black lines show the annealing (pairing) between templates and nascent DNA. The simplified model neglects the presence of another RNA template (represented by the green lines in panel b). event, and the fraction of crossing overs that we can observe. However when we use the word “recombination” we will usually be referring to the fraction of observed crossing overs that was mantained in the population. In other words, for us recombination is the product of crossing over and selective pressure. To make an analogy with the case of mutation, the relation between crossover and recombination rates is equivalent to the relation between the mutation rate (frequency of misreplicated bases per replication cycle) [64, 65] and substitution rate (fraction of mutations that led to viable individuals in a population) [11, 66]. For 23

2.3. HIV-1 SUBTYPES AND CIRCULATION RECOMBINANT FORMS (CRFS)

the case of recombination we have one complicating factor that has no analogous in the mutation case, the copackaging distribution [32, 67]. The copackaging distribution refers to the probability that two homologous RNA templates will be packaged into the same virion as a function of the difference between them. For HIV-1 viruses it was shown that homodimers and heterodimers are copackaged in Hardy-Weinberg proportions [32]. In Figure 2.4 we have a diagram of replication in HIV-1, showing the main routes through which the two RNA templates may have distinct origins. Reinfection of a cell is as frequent as the first infection in all stages of infection, ruling out the hypothesis that HIV-1 infection may provide benefit in terms of immunizing against reinfection [68]. Approximately 10% of the sequenced HIV-1 represent mosaic patterns between different subtypes (from data in Figure 2.7). It has been hypothesized that the randomness in copackaging of heterozygous virions is responsible for the high rate of observed recombination, as compared with other retroviruses [32].

2.3

HIV-1 subtypes and Circulation Recombinant Forms (CRFs)

The most distantly related HIV groups are the type 1 and type 2, hereafter called HIV-1 and HIV-2, respectively. HIV-2 is constrained to Africa and some countries in Europe1 , and appears to have an infectivity lower than HIV-1 [69]. Compared with HIV-1, the long terminal repeat (LTR) region of HIV-2 is larger, the positioning of some spliced genes differ, there is no vpu analog gene in HIV-2 but we observe a gene homologous to vpr with function redundancy called vpx [70]. We will focus our attention to HIV-1, where recombination has been rampant. Within HIV-1 three major groups are defined: group M (for “main”), group O (for “outlier”) and group N (for “new” or “neither”) [71]. This definition is based on phylogenetic studies, since sequences belonging to the same group will always cluster together. Using this same phylogenetic classification, group M is further divided into subtypes, going from subtype A to subtype K. There is no subtype I (capital “i”) to avoid confusion with the roman number I (one). This classification considers only the phylogenetic pattern, and thus sequences phenotypically distinct may be included in the same subtype (like the serologically distinct subtype B sequences from Brazil) while sequences belonging to different subtypes may be biologically identical. 1 as

of December, 2006

24

2.3. HIV-1 SUBTYPES AND CIRCULATION RECOMBINANT FORMS (CRFS)

(a) Usual replication cycle: The virion infects the host cell, its genome is integrated into the host’s genome, and new virions are released.

(b) Reinfection of same cell: A host cell harbouring a provirus is reinfected by another virus. Once new particles from both proviruses are being produced, their genetic material can be copackaged in the budding process.

(c) Syncytium-inducing HIV-1: virionproducing infected cells express viral envelope proteins in their surface. Such cells may fuse with other similar cells (lymphocytes) in the same way as a virion does. If both fused cells contains proviruses then the genetic material from distinct proviruses may be copackaged into the same virion.

Figure 2.4: Schematic representation of replication in HIV-1 showing the possible routes to template diversification. Red arrows represent the order of events and black events represent the movement of particles. The ovals represent the (nucleated) cells while the black, blue and red lines represent the genetic material. Syncytium means a multinucleated cell. Historically, this classification was conducted based on a small genomic region (like the genes env or gag), and when nearly full genome sequences became available, subtype E was 25

2.3. HIV-1 SUBTYPES AND CIRCULATION RECOMBINANT FORMS (CRFS)

never found in regions other than LTR and the env gene, and the other fragments always clustered with subtype A. Thus it was assumed that these sequences were a recombinant between an ancestral subtype A and an ancestral (extinct) subtype E [72, 73]. Thus near fulllength sequences from three or more unrelated patients which cluster with different subtypes for different genomic regions are considered to be a circulating recombinant form, or CRF for short. Thus a CRF is believed to have a common origin, which means unrelated sequences presenting an identical recombination mosaic pattern are result of an ancestral recombination event that spread over the population. This hypothesis is represented in Figure 2.5, where it is clear the monophyletic origin of the recombinants belonging to this hypothetical CRF.

Q1

Q1

Q1

Q2

Q2

Q4

Q2

Q1

Q2

Q3

Q3

F.001 outgroup

Q4

Q4

F.002 B.002

B.002

Q3

F.001 outgroup

B.001

Q4

F.002

F.002

B.002 B.001

Q3

F.001 outgroup

outgroup

F.001

B.001

F.002

B.002 B.001

Figure 2.5: CRF hypothesis of single ancestor for recombinants with same mosaic pattern. If sequences from same CRF have a single ancestor, then they should form a monophyletic group. For such a case only the second and third recombination break-points (represented by vertical black lines) are observable. The putative recombinant sequences are designed Q1 to Q4. Blue segments represent ancestrality to a subtype F, while red segments support an ancestor from subtype B. The number of CRFs is increasing exponentially in the last few years, where some of them are a mosaic between previously reported CRFs and some differ by the positioning of the break-points. A list with all currently available CRFs is given in Figure 2.6, based on the HIV-1 Sequence Database at Los Alamos national Laboratory (http://hiv.lanl.gov/). There are currently 34 described CRFs, but some of them are lacking in Figure 2.6 since they are still unpublished or under scrutiny. In Figure 2.7 we have a panel with the distribution of sequences deposited at the HIV-1 Sequence Database at Los Alamos national Laboratory (http://hiv.lanl.gov/), where we can see the contribution of CRFs to the panorama of HIV-1 infections around the world. We need to stress the fact that this figure is not related with HIV-1 incidence, given that these numbers are 26

2.3. HIV-1 SUBTYPES AND CIRCULATION RECOMBINANT FORMS (CRFS)

Figure 2.6: HIV-1 Circulating Recombinant Forms (CRFs) reported at Los Alamos Database as of November, 2006. For each CRF we have a diagram with the genome divided in reading frames, and the colors represent the mosaic structure of the composing parental subtypes. related to sequences deposited in the database, and so may not provide a good estimate of the actual subtypes’ distributions. For example, the regions with the largest number of deposited sequences are North America and Europe, surpassing Africa. So we may speculate that this graphic is also infuenced by access to genome sequencing (countries with large economical inequities are under-represented) and social participation in prevention programs (participative social groups are better represented), besides research bias (novel HIV-1 variants are more attractive than well-studied ones). For this reason we limit ourselves to the observation that CRFs are found worldwide, and their distribution qualitatively differ between geographical regions.

27

2.4. HIV-1 RECOMBINATION IN SOUTH AMERICA

Africa Europe Caribbean Middle−East Oceania Former USSR North America South America Central America

B CRF15_01B CRF07_BC CRF08_BC N CRF10_CD CRF13_CPX CRF12_BF CRF04_CPX CRF14_BG CRF03_AB CRF05_DF K CRF09_CPX J CRF11_CPX H O CRF06_CPX F G CRF02_AG D CRF01_AE A other C

Asia

Figure 2.7: Distribution of sequences deposited at Los Alamos Database as of November, 2006, classified by geographic region and subtype. In this bidimensional plot the blue boxes represent the relative number of sequences deposited, where darker boxes mean higher number of sequences. The dendrogram on the left corner is a classification of regions based on their subtype distribution similiarity, where the maximum value is 54, 111 sequences, for North America. The dendrogram on the top corner is a classification of subtypes based on their geographic distribution similiarity, where the maximum value is 94, 583 sequences, for subtype B.

2.4

HIV-1 recombination in South America

South America hosts a great diversity of HIV-1 recombinant sequences [74], with special attention to Argentina where most sequences have a similar recombination pattern between B and F subtypes despite no “pure” subtype F virus has been found to date [75, 76]. This findings suggest that HIV-1 BF recombinants found in Argentina have a common recombinant ancestor whose F segments come from Brazil, one of the few countries where non-recombinant F subtypes can be found [76]. These conclusions are based on the recombination pattern inferred from independent analysis of each putative recombinant sequence against reference parental subtypes. On the other hand, it has been hypothesised that the dimerisation ini-

28

2.4. HIV-1 RECOMBINATION IN SOUTH AMERICA

tiation signal (DIS) region of the HIV-1 gag gene is involved in the recombination rate [62], where sequences presenting different DIS sequences are likely to have observable recombination rates lower than sequences displaying the same pattern. Since subtypes B and F have distinct DIS sequences they may have a recombination restriction compared to intra-subtype recombinants - despite the recombination rate can be quite high even restricted by DIS compatibility [62]. ´ In collaboration with Dr. Elcio de Souza Leal, from the Federal University of S˜ao Paulo, in Brasil, we started the analysis of putative recombinant sequences from Brasil, Argentina and other South American countries. All of them were pre-analysed by bootscanning and determined to be variants of the subtype CRF12 BF. Our pilot experiment is the one depicted in Figure 1.6, where we made two observations: first, many regions have low phylogenetic signal; and second, some sequences (like B.AR.00.ARMS008) have a strong support for distinct parentals from the same subtype. This contradicts the CRF hypothesis of common ancestrality (represented in Figure 2.5). Other scenarios contradicting the common origin hypothesis behind the CRF classification are depicted in Figure 2.8. This findings led us to explore systematically near full genome sequences from these putative sequences. The procedure for choosing the recombinant sequences to be included in our analysis was thus by selecting sequences with the same recombination mosaic pattern, since in this case we can directly infer the monophyly of the recombinant sequences. Our initial dataset was a sample of aligned South-American BF recombinant sequences comprising 8402bp, including sequences from public databases and unpublished sequences, kindly ´ provided by Dr. Elcio Leal. We compared each putative recombinant sequence independently against reference subtypes F, B and C using the software DualBrothers [9]. We utilized one reference parental sequence from each subtype to increase the detection power, avoiding contradicting signals. The possible topologies using these parental sequences is described in Figure 2.9, where the same color scheme was used to describe the results and the putative recombinant sequences are labeled “query”. In this figure we observe that since the parental sequences form a triplet we can assume their topology is known without error - since there is only one possibility. For each putative recombinant, we can infer its mosaic pattern against each parental sequence after observing the sitewise posterior probability for each topology, as depicted in 29

2.4. HIV-1 RECOMBINATION IN SOUTH AMERICA

Q1

Q1

Q1

Q2

Q2

Q4

Q2

Q3 Q3

Q3

F.001 outgroup

F.001 outgroup

outgroup

Q4

Q4 B.002

Q3

F.001

Q4

F.002

B.001

Q2

F.002

B.002

B.002 B.001

outgroup

F.001

B.001

Q1

B.002 F.002 B.001

F.002 Q1

Q1

Q3

Q2

Q2

Q4

Q4

Q3

Q4

Q3

F.001

Q3

F.001 outgroup

Q4

F.001

B.002

Q1 outgroup

B.001

Q2

Q1 outgroup Q2

F.002 F.002

B.002

B.002 B.001

outgroup

F.001

B.001

F.002

B.002 B.001

F.002

Figure 2.8: Scenarios contradicting the common origin for CRFs described in figure 2.5. The mosaic pattern is the same as in figure 2.5, but we can observe that more than one recombination event is necessary to explain the differences (de novo recombination). The putative recombinant sequences are designed Q1 to Q4. Blue segments represent ancestrality to a subtype F, while red segments support an ancestor from subtype B. the recombination break-points are represented by black vertical lines.

query

query

query

F

C

C

B

F

C

B

F

B

Figure 2.9: Possible quartet topologies for recombination analysis using DualBrothers. Figure 2.10 for a few sequences from the dataset. In the same figure we have also the sitewise posterior probability of recombination as inferred by the fraction of samples a recombination change-point was observed at the (left of the) site. Since our objective is to obtain a group of sequences with mosaic patterns as similar as possible, we employed a hyerarchical cluster analysis like described in Section 1.1. Each query sequence is an object, and the distance d(si , s j ) between a pair of sequences si and s j is the

30

2.4. HIV-1 RECOMBINATION IN SOUTH AMERICA

Figure 2.10: DualBrothers analysis for 8 putative HIV-1 recombinant sequences. For each sequence we have the sitewise topology distribution (middle panel) with colors following Figure 2.9 (p. 30) and the posterior probability of recombination for each site (bottom). Using the information from the topology distribution along the alignment, we can draw the mosaic pattern, represented in the top panel for each sequence. Euclidean distance between their sitewise posterior probabilities: d ( si , s j ) =

h

8402

∑

f k (si , T1 ) − f k (s j , T1 )

2

∑

f k (si , T2 ) − f k (s j , T2 )

2

∑

f k (si , T3 ) − f k (s j , T3 )

k =1 8402

+

k =1 8402

+

2 i1/2

k =1

31

2.4. HIV-1 RECOMBINATION IN SOUTH AMERICA

where f k (s, T ) is the posterior probability of topology T at site k for sequence s. The complete linkage clustering for all analyzed sequences is represented in Figure 2.11. From these sequences we elected 8 having a similar mosaic pattern, to be included in the analysis using our newly developed procedure (described in the next chapters). These elected sequences are represented in Figure 2.12, and in fact are the same as those in Figure 2.10. Our final dataset consisted of these 8 arbitrarily chosen sequences and 3 reference subtype sequences, comprising 11 sequences in total.

Figure 2.11: Hierarchical clustering for all analyzed putative recombinant sequences. The mosaic patterns follow figure 2.10, and the method employed is a complete linkage clustering on Euclidean distances between the posterior topologies.

32

2.4. HIV-1 RECOMBINATION IN SOUTH AMERICA

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 12_BF.AR.97.A32879

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 12_BF.UY.99.URTR35

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| BF.AR.99.ARMA029

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 12_BF.AR.99.ARMA159

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| BF.AR.99.A027

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| BF.AR.99.ARMA097

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| CH12

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| BF.AR.99.A047

Figure 2.12: Hierarchical clustering of elected putative recombinant sequences, based on large cluster of figure 2.11.

33

Chapter 3 Recombination Distance Between Topologies The effect of recombination on a sequence aligment is that each segment involved in recombination will support a distinct topology. One example is described in figure 3.1, where the alignment is in fact composed by three topologies. It means that the phylogenetic relationships between the seven composing taxa can not be represented satisfactorily by one phylogenetic tree, and any attempt to do so may be neglecting important information about them. More than that, any subsequent analysis that depends on the phylogenetic history between them may give spurious results. Methods for recombination detection aim at finding the recombination break-points. In figure 3.1, these are between sites 200 and 201, and between sites 400 and 401. We will assume that a break-point always refer to the position immediately after a site, so that we can refer to the break-points in figure 3.1 as sites 200 and 400. But sometimes it is not enough to detect the location of the break-points, we want to quantify this phylogenetic difference between neighboring segments. We have seen in Chapter 2 (p. 20) how the distribution of crossover points may be biased toward some regions (hairpin structures, etc.). These prefered sites of crossover result in recombination hot-spots. Even in the absence of crossover hot-spots, we can still observe regions with higher recombination rates as a product of selective pressure on those regions. To use once more the analogy with the substitution rate, heterogeneity in the substitution rate where we observe regions with more substitutions than others may represent a higher error rate of the replicase, or may be the product of a specific selective scenario (e.g., positive selection). To address the existence

34

2

2

7

1

7

1

1

2

7 3

3

6

5

3 4 5

6

4

6

4 5

Figure 3.1: Example of recombinant alignment, as inferred by the phylogenetic incongruence between segments. The horizontal bar represents the alignment, and the topologies represent the true evolutionary history for the given alignment segment - divided into different colors. Each region of 200bps support a different phylogeny, and methods for recombination detection are based on this difference. of recombination hot-spots, current methods for recombination detection rely on detecting contiguous regions over the alignment where the number of recombination break-points is unusually large. But this is neither feasible nor necessary. It is not feasible because short alignment segments will not have phylogenetic signal to distinguish between competing topologies. In other words, it is very unlikely to find an alignment supporting different topologies for several contiguous segments of, e.g., 20bp, unless we obtain data from many taxa. Since in many recombination detection analyses we restrict our attention to quartets (four taxa) to alleviate the number of possible topologies, the chance of detecting recombination hot-spots becomes negligible. The arbitrariness on the definition of contiguous region also hinders the efficiency of the method. Besides that, the assumption that a recombination hot-spot may be represented by adjacent recombination break-points may not be true. In this case, again, current methods would fail to predict the recombination history. Fortunately, this assumption is not necessary because the topologies of neighboring segments have information about the location and number of recombinations. In figure 3.2 we show an example of how to interpret the recombination history based on adjacent topologies, using information from figure 3.1. According to figure 3.2a, the first recombination break-

35

point represents the transference of genetic material between the ancestor of sequence 5 and the ancestor of sequence 3, which is expressed in sequence 4. Sequence 4 is then said to be the recombinant sequence, and sequences 3 and 5 are iterpreted as the parental sequences, and we may refer to it as a recombination between sequences 3 and 5. This is not the only possible explanation but is the most economical one, and we usually aim at finding this minimum number of recombinations. For instance, we may say that the recombinants are sequences 3 and 6, but this would require two recombination events.

2

2

7

1

2

1

7

7

1

1

2

7

3

3

6

5

3 4 5

4

5

3

6

4

(a) First recombination break-point: one recombination

6

6

4 5

(b) Second recombination break-point: two recombinations

Figure 3.2: Example showing the relation between topologies and recombination for topologies in figure 3.1. Panel a represents the region between sites 1 and 400, where the first breakpoint appears. Panel b represents sites 201 to 600, harboring the second break-point. Notice that the region between sites 201 and 400 is considered twice. The choice of sequence 2 in panel b was arbitrary (sequences 1 or 7 could have been chosen without changing the conclusion of one recombination for the subtree). The classification between recombinant and parental sequences may not always be possible, as we see from figure 3.2b. In this case, we observe as before the recombination between sequences 3 and 5, evidenced by sequence 4. But we observe also a recombination involving sequences 1, 2 and 7. In this case it is impossible to tell which are the parentals and which is the recombinant - whichever we choose, the number of recombinations will be one. We realize then that there is at least one recombination between the first and second segments, and at least two recombinations between the second and third segments. This information alone indicates that the second break-point (at site 400) is, in fact, a recombination hot-spot. It is also worth noticing that, would we not have information about sequence 4, we would have 36

3.1. RECOMBINATION DISTANCE

detected only one recombination. Or, even worst, if we did not have sequence information about sequences 3 and 5, we might have defined sequence 4 as a parental sequence, and then when data from sequences 3 and 5 became available, we could label them as recombinants. This may have happened with the HIV-1 subtype classification [71]. That is why we will avoid this historical classification of parental and recombinants. In general, it is very hard to recognize the minimum number of recombinations between two topologies by eye, and a closed-form mathematical solution has not been found so far [77]. Thus our purpose is to devise a good approximation to this recombination distance between topologies.

3.1

Recombination distance

A recombination can be represented by an SPR move between rooted topologies, and thus the number of recombinations between neighboring sites can be estimated by the SPR distance between their underlying rooted trees [78, 79]. SPR operations on rooted topologies always have an equivalent on unrooted topologies [80] - replacing the root node by one extant taxa in the unrooted case [81]. Given a set S, a non-negative function dΘ : S × S → R is a metric on S if it satisfies the following properties [82]: • dΘ ( x, y) = 0 ⇔ x = y • dΘ ( x, y) = dΘ (y, x )

∀ x, y ∈ S

∀ x, y ∈ S

• dΘ ( x, y) + dΘ (y, z) ≥ dΘ ( x, z)

∀ x, y, z ∈ S

The recombination distance is related to the subtree prune-regraft distance [83], which counts the relocations of the subtrees required to make the revised topology consistent with the other topology. The SPR distance between unrooted topologies that we approximate by dSPR can be regarded as a lower bound on the number of recombinations between sites We write it dSPR since the subtree prune-regraft distance is a metric dΘ where Θ = SPR [77], and sometimes the term “SPR distance” is used. In figure 3.3 we have an example of one subtree prune-regraft (SPR) move. In this figure we observe that the recombinant subtree is composed by sequences 3 and 4, when comparing the topologies before and after the move. Two other related topology operations are the nearest-neighbor interchange (NNI) and tree 37

3.1. RECOMBINATION DISTANCE

bisection-reconnection (TBR), with SPR being a generalization of NNI and a special case of TBR [77, 84]. In the NNI move the regraft edge is always neighbour to the prune edge (separated by another edge, otherwise no move occurs). In the TBR move the prune edge is free to move around the pruned subtree. These moves are known as branch-swapping in topology optimization: when exploring the space of topologies a branch-swapping algorithm is applied to the current topology in order to produce a different (but related) topology. In Chapter 4 we will see how to use these operations to modify a tree. The dSPR , on the other hand, is related to the opposite problem: given two binary topologies over the same set of leaves, dSPR is the minimum number of SPR operations necessary to apply to one topology untill it agrees with the other. We are interested in the minimum number not only because it defines a metric [77], but because this will give us a conservative estimate on the number of recombinations between two segments. Any two topologies on N leaves are within a distance dSPR at most N − 3, and any topology on N leaves have exactly 2( N − 3)(2N − 7) topologies at one SPR apart (dSPR neighborhood) [77].

4 2

1

7

1 2

7

2

3

5

1

7

6

5

3 4

6

5

6

3

4

Figure 3.3: Subtree Prune-Regraft (SPR) operation on unrooted topologies. From left to right, the prune edge is depicted in red and the regraft edge in blue. Unfortunately, however, there is no straigthforward way of calculating dSPR other than exhaustive search for the general case [77, 85]. Recently a 5-approximation algorithm for the rooted SPR distance drSPR was developed, and theoretically can run in linear time. By 5approximation we mean that the true distance may by five times larger or five times smaller than the estimated one. This is a good approximation for large trees (hundreds of leaves), but 38

3.1. RECOMBINATION DISTANCE

since we work with much smaller topologies we need a better approximation, at least for a small number of recombinations. Besides, this approximation works with rooted topologies, which is an overestimation of the (unrooted) dSPR . It has been shown that the d TBR (TBR distance) is equivalent to the maximum agreement forest (MAF) distance [77]. The MAF distance counts the number of edges that should be removed from both trees so that the resulting forest (group of subtrees) is the same. Since the TBR operation is a generalization of the SPR operation, one would have an underestimate of the SPR distance - which is affordable. More than this, one TBR operation is equivalent to one or two SPR operations, and so we would have tigh bounds on dSPR by d TBR . Unfortunately there is no closed solution for the MAF problem as well, at best a 3-approximation for the rooted case [86], therefore we still have no solution for our problem. Heuristic approaches to calculate drSPR (rooted SPR distance) have been proposed in the context of horizontal gene transfer [80, 87–90]. These procedures were designed for handling large topologies with a limited number or type of recombinations. They can handle only rooted topologies and usually depend on a “real” topology, which is the main difference between horizontal gene transfer and viral recombination: for a moderate recombination frequency, you can detect the majority of genes that respect the “species” phylogeny and the other genes (a minority) will be dubbed “transferred material” or “recombinant”. With pervasive recombination like in viral populations there is no species phylogeny, not even by consensus, and thus again we can not utilize these procedures. The most widely used topology metric is the Robinson-Foulds distance, or simmetric difference [91]. This distance is based on the split decompositions of the two topologies, and is simply the number of edges that have no counterpart in the other topology. An unrooted binary tree T with N leaves will have 2N − 3 edges, where each edge ei , i = (1, · · · , 2N − 3) defines a partition of the leaves into two groups belonging to the opposite ends of the edge. Thus an edge ei will define a split (also called bipartition) B(ei ) of the leaves L such that

|Ω| = N 1 and B(ei ) = [e0 (ei )|e1 (ei )] 1 cardinality

s.t.

e0 (ei ) ∪ e1 (ei ) = Ω,

of Ω, meaning the number of elements in the set

39

e0 ( e i ) ∩ e1 ( e i ) = ∅

3.1. RECOMBINATION DISTANCE

where e0 (ei ) and e1 (ei ) are the leaves separated by edge ei , and Ω represents the whole set of leaves. Then a tree T can be represented by its split set B( T ) = { B(e1 ), · · · , B(e2N −3 )}. If we denote by A1 \ A2 the elements of set A1 which are not elements of A2 , then the RobinsonFoulds distance d RF ( T, T 0 ) between topologies T and T 0 can be defined as d RF ( T, T 0 ) = | B( T ) \ B( T 0 )| + | B( T 0 ) \ B( T )|. In figure 3.4 we have an example showing the split decomposition of two topologies, where we can see that the Robinson-Foulds distance would be two. In this figure we only represent the internal edges (edges not connected to leaves) since the external edges (directly connected to the leaves) will always be in agreement for trees on the same leaf set. We see that in this case d RF will be a number between 0 and 2N − 6. Sometimes the normalized Robinson-Foulds distance d∗RF is used. It is defined as [92] d∗RF ( T, T 0 ) =

| B( T ) \ B( T 0 )| + | B( T 0 ) \ B( T )| | B( T )| + | B( T 0 )|

The attractiveness of the Robinson-foulds distance is that it works with rooted, unrooted, binary, multifurcating, and even topologies with different leaves. We see in figure 3.5 that if the topologies can be explained by one recombination, the Robinson-Foulds distance will in fact give us the number of branches the recombinant subtree traversed. Another promising measure comes from the maximum agreement subtree (MAST), a topology describing the information both original topologies have in common by the largest number of leaves [93, 94]. The algorithms usually proceed by finding the maximum number of leaves two topologies have in common (MAST size), and the leaves are found as an extension to the algorithm. This comes from the fact that the size of the MAST topology is unique, but the topology itself may not be [95]. This extension to find the leaves is usually a backtracking, going from the last calculated value and arbitrarily choosing the leaves amongst the possibilities that respect the MAST size. Equivalently, the cMAST (complement MAST) size is the smallest number of leaves that should be removed from both topologies to make them agree [96]. Both the MAST and the cMAST sizes are metrics, and we will refer to the cMAST size as the cMAST distance between the topologies, since smaller cMAST sizes imply in more similar topologies. It is thus appealing to use the cMAST distance as an estimate of

40

3.1. RECOMBINATION DISTANCE

Topology

bipartitions induced by topology

2

1

4

e1 e3 e2

5

5

4 3

1

1

3 6

e2

3

2

2

1

5 6

e1

6

2

3

5

4

4

e3

6

2 1

3

e’1

4 5

e’2

4

e’3

6

e’1

6

1

5

2

e’2

6

1

3 e’3

4 5

2

2

3

1

3

5

4

6

Figure 3.4: Bipartitions (splits) induced by topologies. The edges e2 and e20 are unique, and so the Robinson-Foulds distance between these topologies would be two. the number of recombinations between two topologies, since both rely on the removal of the minimum number of edges to achieve agreement. The basic algorithm works by a dynamic programming where the MAST size of two topologies T and T 0 is found by iteratively calculating the MAST sizes M(r, r 0 ) of their subtrees rooted at nodes r and r 0 , respectively, for all pairs r and r 0 . If r or r 0 are a single leaf, then M(r, r 0 ) will be one if the leaf is present in both subtrees and zero otherwise. If both are internal nodes, then we denote by r1 and r2 the descendants of r and by r10 and r20 the descendants of r 0 , as in the diagram r r0 @

@ @

r1

@ @ @

r2

r10

@ @

41

r20

3.2. APPROXIMATION TO DSPR

Assuming binary trees, we have that M (r, r 0 ) can be calculated by M(r, r 0 ) = max M(r1 , r10 ) + M(r2 , r20 ), M(r1 , r20 ) + M (r2 , r10 ), M(r, r10 ), M(r, r20 ), M(r1 , r 0 ), M(r2 , r 0 ) For unrooted trees we need to calculate M(r, r 0 ) for all possible orderings of r, r 0 , and for that we may think of r and r 0 as the edges and not the nodes. If we remember that the removal of an edge produces two subtrees rooted at the nodes connected by this edge, then using the diagram @ @ @

@ ra

r

@ @

rb

@

0 @ ra

r0

@

rb0 @

@

@ @ @

@ @

the MAST size of an unrooted topology can be calculated iteratively by M(r, r 0 ) = max M(r a , r 0a ) + M(rb , rb0 ), M(r a , rb0 ) + M (rb , r 0a ) The described procedure has computation time O( N 2 ) for two topologies on N leaves, but many algorithms have been developed for the rooted and unrooted cases [97–99]. Its widespread use may have been hampered by the difficulty in expanding the algorithm to an arbitrary number of topologies [95, 100]. As we see from figure 3.5 the cMAST distance suffers from the same defficiency as the Robinson-Foulds distance whenever the recombinant subtree is larger than the path it traverses.

3.2

Approximation to dSPR

The SPR distance is related to the minimum number of recombination events that took place between two trees [101]. It is possible to use the SPR distance between unrooted topologies as the minimum number of recombinations [83], with the remark that the unrooted dSPR will always be a lower bound of the rooted dSPR since the rooting imposes a time constraint on events [81, 102]. There is a heuristic algorithm implementing the unrooted version of the dSPR , but unfortunately with prohibitive computation time to incorporate in our Bayesian analysis [103]. 42

3.2. APPROXIMATION TO DSPR

b

d

b

d a

c

c

2 7

2

1

7

3 6

5

7

b

3 4

a

d

c

6

5

b

d

6

5

4 2

1 (a) Real recombination represented by pruned edge in red and regraft edge in blue

a 7

6

5

4 3

b

7

6

6

5

c

7 5

4

a

c

2

1

3 4

d

a

c

2

1

b

d a

3 1

4 2

3 1

(b) Robinson-Foulds distance estimate based on edges in disagreement (in red)

(c) MAST topology in black, cMAST leaves in red

Figure 3.5: Example of failure of current distances on estimating the number of recombinations. The true dSPR is one, where the recombinant subtree is ((a,b),(c,d)). Here, we develop a novel algorithm which calculates the approximate dSPR through a label compression technique where equal subtrees in both topologies are replaced by a new leaf [77]. Recalling that a split, or bipartition, is a description of which leaves become disconnected by removal of the edge it represents, a topology T can be uniquely represented by its split set B( T ) = { B(e1 ), · · · , B(e N −3 )} if we consider only its internal edges e1 , · · · , e N −3 [104]. Then for given two topologies T and T 0 we can classify its edges B( T ) and B( T 0 ) into equivalent BE ( T ), BE ( T 0 ) and nonequivalent edges BN ( T ), BN ( T 0 ). They represent the set of identical and distinct edges on both topologies, respectively, as BE ( T ) = BE ( T 0 ) = B ( T ) ∩ B ( T 0 )

43

3.2. APPROXIMATION TO DSPR

and BN ( T ) = B( T ) \ B( T 0 )

BN ( T 0 ) = B( T 0 ) \ B( T )

The number of nonequivalent edges between the topologies (| BN ( T )| + | BN ( T 0 )|) is their (unnormalized) Robinson-Foulds distance [91]. For binary trees, as is always the case in our study, we have also that | BN ( T )| = | BN ( T 0 )|. The label compression can then be accomplished by iteratively looking at the bipartitions in BE ( T ) where there is e0 (ei ) or e1 (ei ) with exactly two leaves, and then replacing all occurrences of these leaves by a new leaf, as shown in figure 3.6. Ties (when both |e0 (ei )| = 2 and |e1 (ei )| = 2) are broken by an arbitrary ordering of the leaves. That is, we search over BE ( T ) for a split BE (ei ) such that |e0 (ei )| = 2 or

|e1 (ei )| = 2, and if it is found we remove all the occurrences of one of the elements of e0 (ei ) (assuming |e0 (ei )| = 2) from BE ( T ), BN ( T ) and BN ( T 0 ) The topologies on figure 3.6 are the same as in figure 3.5 (p. 43), where other methods failed. If we know that T 0 can be explained by one subtree prune-and-regraft (SPR) on T, then BN ( T ) (equivalently BN ( T 0 )) is in fact the path traversed by the pruned subtree (recombinant taxa) on T (T 0 ) as we saw in figure 3.5. In this case for some mapping of edges in BN ( T ) to edges in BN ( T 0 ) we will find the pruned subtree amongst the disagreements in this mapping. This should be the smallest disagreement, after the label compression. Our approximation lies in assuming that the recombinant subtree is represented by one split - the split partition with smallest number of leaves, in fact. For B(ei ) ∈ BN ( T ) and B(e0j ) ∈ BN ( T 0 ) let us define a disagreement split Bd (ei , e0j ) =

[e0 (ei , e0j )|e1 (ei , e0j )] as the two possible set of leaves that should be removed from ei and e0j such that B(ei ) \ e0 (ei , e0j ) = B(e0j ) \ e0 (ei , e0j ) (or, equivalently, B(ei ) \ e1 (ei , e0j ) = B(e0j ) \ e1 (ei , e0j )). Formally, we have e0 (ei , e0j )

=

e0 (ei ) \ e0 (e0j )

∪

e0 (e0j ) \ e0 (ei )

=

e1 (ei ) \ e1 (e0j )

∪

e1 (e0j ) \ e1 (ei )

and e1 (ei , e0j ) = e0 (ei ) \ e1 (e0j ) ∪ e1 (e0j ) \ e0 (ei ) = e1 (ei ) \ e0 (e0j ) ∪ e0 (e0j ) \ e1 (ei ) After calculating the disagreement split between all pairs of edges, we elect the smallest 44

3.2. APPROXIMATION TO DSPR

b

d b

d

c

a

c

a 7 6

5 2 1

4

7 3

2 4

6

5

3 1

(a) Original topologies

1

3 3

a

4 a

6 4

1

5

5

6

(b) Leaf-compressed topologies

Figure 3.6: Example of leaf compression thecnique on two topologies. The subtrees depicted in color on top panel are equivalent on both topologies. They are then replaced by a new leaf, uniquely labelled. The dSPR between the leaf-compressed topologies is the same as between the original topologies. set of leaves found among them (that is, e0 (· , ·) or b1 (· , ·)), with ties broken by same leaf ordering described before. In other words, searching for the smallest subtree in the mapping is equivalent to choosing the smallest disagreement split min{|e0 (ei , e0j )|, |e1 (ei , e0j )|} amongst all possible edge pairs ei , e0j belonging to BN ( T ) and BN ( T 0 ). An example of such a mapping is given in figure 3.7. Thus once we find this split we remove the leaves it represents from both topologies and increment dSPR by one. The procedure of label compression and removal of smallest disagreement split is repeated untill all edges are in agreement between the topologies (BN ( T ) = BN ( T 0 ) = ∅), and the approximate recombination distance between the topologies will be the iteration count of the procedure. If we assume the comparison between splits can be done in linear time, then the algorithm 45

3.2. APPROXIMATION TO DSPR

1

3

3

4 a

a

6

1 4

6

5

(b) T 0

(a) T

4

1

4

a

e1

a

1

3

6

e1 e30

3

6

a

e3 e10

a

3

6

a

4

e2 e20 a

1

5

4

6

5

e2 e10

3

4

e3 e20

5

1

4

3

4

5

4

e2 e30 1

6

6

1

1

5

3

a

3

e3

5

1

6

5

1 4

6 1

5

3

e1 e20

4 e2

a

4

3

a

5

3

4

1

3

e1 e10

6

e30

a 6

a

1

a

6

5

e20

1

5

6

5

a

3

3

5

1

4

3

6

e10

3

4

1

a

5

4

5

6

a

e3 e30

6 5

Figure 3.7: Mapping between edges and search for smallest disagreement split between topologies T and T 0 (from figure 3.6), at top panel. At bottom panel, we have all pairwise comparisons between edges from both topologies, where the smallest disagreement (one leaf) is highlighted in red. It is in fact the recombinant “sequence” (after leaf compression).

46

3.3. RESULTS

has complexity O(dN 2 ) where d is the recombination distance between the topologies on N leaves, since both the leaf compression and the mapping procedures have complexity O( N 2 ) and the procedure is repeated d times. This is a major bound for the complexity time, since at each iteration the number of “active” leaves (considering the leaf compression) is reduced (thus N may be replaced by log( N )), but on the other hand the calculation of the cardinality of a split takes O( N ) time. We should bear in mind that in many cases there may be several recombination histories involving the minimum number of recombinations. This algorithm can be extended to report the recombinant sequences instead of the number of recombinations. We have tried several other ad-hoc procedures, including MAST distance on reduced trees, but the one presented here was empirically the most successful. One simple case where our procedure fails is when the smallest set of leaves has the two pruned subtrees, but our procedure counts it as one SPR - taking the number of leaves into account decreases the performance for many other cases.

3.3

Results

In order to check the performance of our recombination distance (dSPR ) algorithm, we simulated recombination through applying subtree prune-and-regraft (SPR) moves on a random topology [105] and then estimated the distance between the original and recombinant topologies. Simulating recombinant topologies is not flawless as well [87], since a sequence of SPR moves may contain cycles – moves where two or more SPRs cancel each other out – and thus the resulting topology may be explained by less than the actual number of applied SPR operations [106]. An example of such a scenario is found in figure 3.8. In our simulations we try to circumvent this problem by allowing branches to participate in only one SPR move and by simulating recombination on large phylogenies. Other strategy based on exploiting the recombination neighbourhood of topologies [106] gave similar results with a much higher computational burden for simulation. In Figure 3.9 we have a comparison between our dSPR approximation and other distance measures (namely the Robinson-Foulds distance and the cMAST distance) for topologies with 16, 32 and 64 leaves (the largest our implementation can handle). We observe that not only the order of the values is much more accurate for our method but the variance is smaller, with a slight overestimation. This overestimation may be result of the heuristic nature of 47

3.3. RESULTS

2

2

1

2

7

7

7

1

2

1

1

7 3

5

3 4

6

3

3

6

6

6

4

4 5

5

4 5

(a) T1 , original topology (b) T2 , one SPR from(c) T3 , one SPR from(d) T4 , one SPR from T1 T2 and two SPRsT3 but only one SPR from T1 from T1 since T4 = T2

Figure 3.8: Example where a series of SPR moves cancel each other out. From left to right, the SPR move follows the prune edge in red and the regraft edge in blue. Despite three SPR moves were applied, the distance between the original and the resulting topologies is only one. the algorithm, but we believe that the inability in simulating topologies with an exact SPR distance is the main issue [87]. Underestimated values are certainly a failure of the algorithm. The fraction of correctly estimated distances is displayed in figure 3.10 for tree sizes ranging from 8 to 64 leaves. The approximation is very good for small values of dSPR and we observe a performance decrease when dSPR increases. The same behaviour is observed for smaller trees, with the observation that the performance decreases faster in these cases - probably since the failure in simulating recombination becomes more evident. For simulations of one recombination, the procedure always gives the correct answer.

48

3.3. RESULTS

Figure 3.9: Density plot of the performance of recombination distance measures for trees with 16, 32 and 64 leaves. The first column shows results using the Robinson-Foulds distance, the second column using the MAST distance, and the third column shows our new approximation distance, dSPR . On the x-axis we have the number of applied SPR operations as a proxy of the “real” recombination distance, while in the y-axis we have the estimated distance using each of the estimators. The true values are represented by the red line. For each point in the x-axis we simulated 1000 topology pairs with the given distance, and the density of estimated values are represented by shades of blue, where darker points represent higher densities and black dots are outliers. This figure was obtained through a kernel density estimate over the original scatterplot [107].

49

3.3. RESULTS

8

12 16 24 32 48 64 tree size

4

recombination 6 8 10 12

14

16

SPR

2

2

4

recombination 6 8 10 12

14 2

4

recombination 6 8 10 12

16

MAST 0.2 0.4 0.6 0.8 1

14

16

Robinson−Foulds

8

12 16 24 32 48 64 tree size

8

12 16 24 32 48 64 tree size

Figure 3.10: Comparison of overall performance between recombination distance measures, for trees ranging from 8 to 64 taxa. The first column shows results using the Robinson-Foulds distance, the second column using the MAST distance, and the third column shows our new approximation distance, dSPR . On the x-axis we have the tree size (in number of leaves) and on the y-axis we have the number of applied SPR operations as a proxy of the “real” recombination distance. Each point represents the number of successes over 1000 calculations, and the colors represent this fraction, following the legend on the first panel.

50

Chapter 4 Bayesian Methods and Phylogenetics When we design an experiment or observe a phenomenon in a systematic way, we have believes about the behaviour or design of this object of study even before obtaining the relevant data. After the data collection, our believes may change to varying degrees depending on how strong were our prior believes and how strong is the new information contained in the data. This methodological reasoning is the basis of Bayesian inference, where all measurable quantities are assumed to follow a probabilistic distribution. Our believes before seeing the data are incorporated into prior distributions for the parameters θ governing the data. The distribution of the data given the parameters is defined by the likelihood function, which in fact is the likelihood of the parameters after observing the data. The effect of the data over our prior believes gives us the posterior distribution of the parameters, and we are usually interested in some function of these parameters. An attractive argument in favor of Bayesian procedures is that instead of having a single point estimate of the parameter of interest, we have its distribution. Other advantages include the possibility of exploiting arbitrarily complex models (within the limits of overparametrization and mixing [108]) and choosing the prior distributions to achieve a manageable level of abstraction. If we represent the data vector by X and the parameter vector by θ, then the posterior distribution can be calculated using Bayes’ theorem: P(θ | X) =

P(X | θ) P(θ) , P(X)

P(X) =

Z

P(X | θ) P(θ)dθ

51

(4.1)

4.1. MARKOV CHAIN MONTE CARLO

where P(X) is called the marginal likelihood of data X, since it is marginalized (integrated) over the parameters. As an example, suppose we want to infer the fairness of a coin in a coin tossing experiment by counting the number x of tails in n flips. If we call q the probability of tail in each experiment, then the likelihood of q will be given by a binomial distribution on x: n x P( x | q, n) = θ (1 − q ) n − x , x

x = 0, 1, · · · , n

since the total number of flips is decided beforehand. Our prior believes about q can be accomodated by a beta distribution, like P ( q | α1 , α2 ) =

Γ ( α 1 + α 2 ) α1 −1 q (1 − q ) α2 −1 , Γ ( α1 ) Γ ( α2 )

0 ≤ q ≤ 1.

For appropriate choices of α1 , α2 > 0. Then applying Bayes’s theorem 4.1 we have that the posterior distribution of q is given by [109] P(q | x, α1 , α2 , n) =

P( x | q, n) P(q | α1 , α2 ) Γ ( α1 + α2 + n ) = q α1 + x −1 (1 − q ) α2 + n − x −1 . P ( x | α1 , α2 , n ) Γ ( α1 + x ) Γ ( α2 + n − x )

which is again a beta distribution. In figure 4.1 we have three realizations of this experiment, where we have used a biased coin with tails probability equal to 0.8 and for simplicity we assume x took the typical value qn. The two panels represent different prior believes about the fairness of the coin. In panel a we can observe that an uninformative prior leads to a posterior distribution coincident with the likelihood, and panel b represents a strong belief in the fairness of the coin. We can then observe that as we accumulate data the influence of the prior becomes negligible, and in our example the unfairness of the coin becomes evident.

4.1

Markov chain Monte Carlo

In the above example, the marginal likelihood of the data P(X) could be calculated analytically, but this is not always the case. In multidimensional problems, as the number of parameters increase the integral becomes more difficult to calculate. In hierarchical models, where the prior parameters themselves are assumed to be random variables, this is most surely the case. In the above example, a hierarchical setting would be to assume that α1 and 52

4.1. MARKOV CHAIN MONTE CARLO

1.0 0.8

prob

0.2 0.0 0.4

0.6

0.8

1.0

0.0

0.4

q

n= 50

n= 50

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

0.8 0.6

prob

0.0

0.2

0.4

0.8 0.6 0.4 0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

q

q

n= 500

n= 500 0.8 0.6 0.2 0.0

0.0

0.2

0.4

prob

0.6

0.8

1.0

0.2

1.0

0.0

0.4

prob

0.2

q 1.0

0.2

1.0

0.0

prob

0.6

0.8 0.6 0.4 0.0

0.2

prob

n= 5 prior likelihood posterior

0.4

1.0

n= 5 prior likelihood posterior

0.0

0.2

0.4

0.6

0.8

1.0

0.0

q

0.2

0.4

q

(a) Uninformative prior Beta(1,1)

(b) Informative prior Beta(20,20)

Figure 4.1: Example of posterior distribution calculation for the biased coin tossing experiment. The probability of tails is 0.8, and the two panels represent different prior believes. The probabilities were standardized to fit into the same scale. α2 are not constants, but follow some probabilistic distribution. This distribution is referred to as the hyperprior distribution, and is governed by its hyperparameters. There is an umbrella of techniques very efficient at calculating multidimensional integrals and multidimensional optimization in general called Monte Carlo methods. The basic idea behind Monte Carlo methods is to observe the fraction of randomly sorted numbers displaying the desired property - e.g., within the bounds of some function. In Bayesian analysis we work with a class of Monte Carlo techniques called Markov chain Monte Carlo (MCMC), where each iteration of the sampler depends only on the previous state. The most commonly used MCMC algorithms are the Metropolis-Hastings algorithm and the Gibbs sampler, and the simplest case is the rejection sampling technique, which already incorporates the rea-

53

4.1. MARKOV CHAIN MONTE CARLO

soning behind all other procedures. The issue of when and how to sample the simulated parameters as an approximation to the posterior distribution is discussed in Chapter 5.

4.1.1

Rejection sampling

If we want to obtain a draw from the distribution P(θ | X ) or even the unnormalized distribution Pu (θ | X ) = P( X | θ ) P(θ ), this can be accomplished if we can sample from a probability density proportional to a function g(θ ). This envelope function g(θ ) must have a finite integral and the importance ratio Pu (θ | X )/g(θ ) must have a bound M. The rejection sampling procedure thus consists in sampling values θ from the distribution proportional to g(θ ) and accepting it with probability Pu (θ | X )/( Mg(θ )), untill a value is accepted. In figure 4.2 we have an example with the envelope function and the target distribution. In this example we simple chose a square envelope, from wich we sample a value uniformily between between 1 and 9. Notice that the sampling procedure would not be affected if we had chosen a square envelope function with another bound factor (smaller than one), and that

1

2

this particular choice of the envelope, despite simple, will not be very efficient.

0

2

4

θ

6

8

10

Figure 4.2: Example of the rejection sampling technique, where the target distribution Pu (θ | X ) is represented by the red curve and the envelope function g(θ ) is shown in blue. In this case the bounding factor is 1/2, and the scale is arbitrary.

4.1.2

Gibbs sampler

In multidimensional problems, when we can not sample directly from P(θ | X ) but we can sample from subvectors of θ conditional on the remaining subvectors, we can obtain a 54

4.1. MARKOV CHAIN MONTE CARLO

sample from P(θ | X ) by sampling over all subvectors in an arbitrary order. Each subvector θi , i = 1, · · · , d may be a single parameter or a group of parameters (e.g. following the same distribution) and if we represent the vector of parameters as θ = (θ1 , · · · , θd ) then, at iteration t each subvector θit is updated by sampling from the distribution t −1 P ( θi | θ − i , X)

where t −1 t −1 t −1 t t θ− i = ( θ 1 , · · · , θ i −1 , θ i +1 , · · · , θ d )

if we always update θi before θi+1 for all i ∈ (1, · · · , d − 1). Even when we can not sample from the conditional distribution, it is still possible to use the Gibbs sampler if we use some other technique to sample from each conditional.

4.1.3

Metropolis-Hastings

The Metropolis-Hastings algorithm is based on a jumping distribution (also called proposal distribution) q(θ ∗ | θ ) from which we sample a new state θ ∗ based on the current state θ. This new state will then be accepted with probability a(θ ∗ ) = min(1, A(θ ∗ )) where A(θ ∗ ) =

P( X | θ ∗ ) P(θ ∗ ) q(θ | θ ∗ ) . P( X | θ ) P(θ ) q(θ ∗ | θ )

(4.2)

The variable a(θ ∗ ) is called the acceptance ratio, and in A(θ ∗ ) the first term is the posterior ratio while the second term is the proposal ratio. The Metropolis algortihm is a special case where q(θ ∗ | θ ) = q(θ | θ ∗ ) and then the proposal ratio vanishes. The Metropolis-Hastings algorithm is sensitive to the choice of the proposal distribution, and two common approaches are to choose a proposal distribution as close as possible to the posterior distribution, or to employ a sensible random-walk on the parameter space. This technique is particularly useful since several common terms are cancelled out in the posterior ratio, and can be used in conjuction with the Gibbs sampler for difficult conditional distributions. Thus each parameter (or block of parameters) can be updated using a Gibbs sampler, where the sampling from the conditional distribution is carried out using a Metropolis-Hastings acceptance.

55

4.2. BAYESIAN PHYLOGENETICS

4.2

Bayesian phylogenetics

The primary parameters of interest in phylogenetic studies are related to the phylogenetic tree. Bayesian methods have been used successfully in testing competing phylogenies [110], detecting recombination between sequences as incongruence between phylogenies [7, 111, 112] or using the coalescent [113–115], incorporating uncertainty on sequence alignment [116], inference of demographic history [117–119] and relaxed molecular models [119–122], amongst others. The incorporation of phylogenetic studies into a fully Bayesian perspective started in 1996 1 [123–125], but it appeared as a valid alternative to other phylogenetic methods (as parsimony and maximum likelihood) only a few years ago [126]. One obstacle to its widespread use was the computational burden, which has been since then overcome by faster computers. But other problems still persist. One is related to the distribution of topologies, since it is hard to describe our prior believes about possible phylogenies as a distribution [127]. Another is to design an efficient sampling scheme from the posterior distribution, since in most cases or the sampled values will be strongly correlated - and for topologies even this correlation is hard to infer - or we may be exploiting only a fraction of reasonable topologies [128]. There are Bayesian strategies that cope with uncertainty in the number of variables (dimensions), and these are called trans-dimensional Markov chain Monte Carlo (MCMC) or reversible jump MCMC [129]. These methods ask for very efficient sampling schemes, since when moving across dimensions the corresponding distributions may be very different. A particularly interesting strategy is to run a ”mini-sampler” after trying a change in dimension [130], and we will see how this scheme works even when the dimension (number of parameters) is constant.

4.3

Bayesian model for recombination

In this section we describe our new phylogenetic model that takes recombination into account and in the next section we show how to sample efficiently from this distribution.

56

4.3. BAYESIAN MODEL FOR RECOMBINATION

Figure 4.3: Example of DNA alignment.

4.3.1

Likelihood of one site

Our primary data is the DNA sequence alignment of putative recombinants, like in figure 4.3. We use the standard evolutionary model where the nucleotide substitution process at a given site is described by a continuous-time Markov chain and a phylogenetic tree describing the ancestral relations between extant taxa [131]. The infinitesimal substitution probability matrix Q for the Markov chain follows the Hasegawa-Kishino-Yano (HKY) model [14]: T T

C Q= A G

C

A

G

1 − κπC − π R

κπC

πA

πG

κπ T

1 − κπ T − π R

πA

πG

πT

πC

1 − κπG − πY

κπG

πT

πC

κπ A

1 − κπ A − πY

(4.3)

where the parameter κ is the transition:transversion ratio, π = (π T , πC , π A , πG ) is the vector of equilibrium frequencies of each base and πY = πC + π T , π R = π A + πG . If we classify the nucleotides into purines and pyrimidines, the transition:transversion ratio incorporates the substitution bias between transitions (within the same class) and transversions (between different classes). We notice that this matrix satisfies the reversibility condition, since πi Qij = π j Q ji , and this is fundamental when working with unrooted topologies as we will see shortly. The HKY is the most complex model that can be solved analytically for arbitrary time intervals, and is a good compromise between accuracy and simplicity [132]. The probability 1 Despite

the publications [123, 124] appear later, they relate to the authors’ Ph.D. theses, dated 1996

57

4.3. BAYESIAN MODEL FOR RECOMBINATION

P(i | j, t, κ, π ) of going from state j to state i in time t is given by 4

P(i | j, t, κ, π ) =

∑ exp(tΨk )Zik Zkj−1

(4.4)

k =1

where Ψ is the matrix of eigenvalues and Z, Z−1 are the matrices of eigenvectors, given by

0

−1 Ψ= −πY − κπ R −π R − κπY

and

1 1/πY 0 πC /πY 1 1/πY 0 −π A /πY Z= 1 −1/π R πG /π R 0 1 −1/π R −π A /π R 0

;

Z−1

π πR πT 0 1 T πC π R πC 0 −1 = π A − πY π A 1 0 π G − πY π G − 1 0

The matrix Q in equation 4.3 is scaled to an overall rate of one, so that the time-scale is given in substituions. This means we usually work with Q0 = Q/u, where the overall rate u =

− ∑i πi Qii is, for the HKY model, u = 2 [π R πY + κ (π T πC + π A πG )] The probability P( X | T, t, θ M ) of observing a site pattern X for a column of the alignment given the unrooted phylogenetic tree T with branch lengths vector t under the evolutionary model with parameters θ M is the likelihood of ( T, t, θ M ). It can be found by recursion, if we observe that the partial likelihood Lr (z | t, θ M ) of state z = T, C, A, G for the subtree rooted at an internal node r depends only on the partial likelihoods L a ( x | t, θ M ) and Lb (y | t, θ M ) of its descendants, assuming a binary tree: Lr (z | t, θ M ) =

∑ P(x | z, ta , θ M ) La (x | t, θ M ) ∑ P(y | z, tb , θ M ) Lb (y | t, θ M )

x

y

for a subtree with branch lengths t a , tb ∈ t given as 58

(4.5)

4.3. BAYESIAN MODEL FOR RECOMBINATION

r @ @ tb @ @

ta

a b Thus we can calculate the partial likelihoods recursively starting at the leaves up to some arbitrary internal node [84]. This arbitrariness is only possible for time-reversible models. The states at the terminal nodes should fit to the observed data. If we denote the vector of partial likelihoods at a node r by Lr , then the partial likelihood at a leaf l is defined as

1 0 Ll = for base T 0 0

0 1 Ll = for base C 0 0

0 0 Ll = for base G 1 0

0 0 Ll = for base A 0 1

For an unrooted tree, the likelihood at a site may then be obtained by choosing arbitralily an edge [r, r0 ] and summing over the sites as P( X | T, t, θ M ) =

∑∑ x

y

π x Lr ( x | t, θ M ) P(y | x, tr , θ M ) Lr0 (y | t, θ M ) .

(4.6)

An example of such an edge is given below: a c @ ta @ @

tc @

tb

r

tr

r0 @

[email protected] d @ @

d b Since we are assuming the HKY model, θ M = (κ, π ).

4.3.2

Marginalizing over individual branch lengths

In our model, individual branch lengths are a nuisance parameter in the sense that we are not interested in their posterior distribution. Besides that t does not have meaning across regions supporting distinct topologies. But the largest advantage of this marginalization is that our model can then accomodate for rate heterogeneity across lineages, since we average out individual branch lengths. If we assume, like in [133], that all branch lengths t j follow an exponential distribution with mean µ and are independent among branches, we can integrate 59

4.3. BAYESIAN MODEL FOR RECOMBINATION

equation 4.4 over branch lengths: P( x | y, µ, κ, π ) =

Z

P( x | y, t j , κ, π ) P(t j | µ)dt j =

Z

P( x | y, t j , κ, π )1/µ exp(−t j /µ)dt j .

For the HKY model (and other time-reversible models in general) it can be calculated analytically through [133, 134] 4

P( x | y, µ, κ, π ) =

−1 Zik Zkj

∑ 1 − ψk µ .

k =1

This way the likelihood at a site does not depend on the branch length vector t. We will refer to µ as the average substitution rate for a site. The likelihood in equation 4.6 reduces to P( X | T, µ, κ, π ) =

∑∑ x

y

π x Lr ( x | µ, κ, π ) P(y | x, µ, κ, π ) Lr0 (y | µ, κ, π ) .

(4.7)

where the partial likelihoods (equation 4.5) are given by Lr (z | µ, κ, π ) =

∑ P(x | z, µ, κ, π ) La (x | µ, κ, π ) ∑ P(y | z, µ, κ, π ) Lb (y | µ, κ, π )

x

(4.8)

y

It is worth mentioning that the integration2 Et [ Q(t)] over individual branch lengths for a site is not the same as assuming a fixed branch length t∗ for all branches since, in general, does not exist t∗ such that Et [ Q(t)] = Q(t∗ ). In other words, this marginalization regards branches at each site as independent realizations from random variables. We can confirm this by noticing that in general we can not find t∗ such that 4

−1 Zik Zkj

4

∑ 1 − ψk µ = ∑ exp(t∗ ψk )Zik Zkj−1

k =1

4.3.3

k =1

Segmentation model

The whole alignment X is assumed to be decomposed into K consecutive segments, as represented in figure 4.4. Neighboring segments may have different topologies due to recombinations, but we assume that all sites within a segment share the same topology. Our 2 Our

marginalization of the transition matrix Q as a function of branch length t

60

4.3. BAYESIAN MODEL FOR RECOMBINATION

Figure 4.4: DNA alignment divided into K segments, represented by different colors. The only places where recombination or change in model parameters are allowed are represented by the black vertical lines (on the border between two segments). procedure estimates the recombinant regions as a subset of K and fixes the evolutionary parameters within a segment. Therefore these segments can be arbitrarily small and should represent all regions with potentially conflicting phylogenetic signal. Each segment has its own ratio κi (i = 1, . . . , K ) of transitions to transvertions and average substitution rate µi . The equilibrium frequencies of nucleotides π are shared among all segments. They are empirically calculated from observed frequencies in the alignment and we will supress it from the notation. We write X = ( X1 , · · · , XK ) where Xi = ( Xi1 , . . . , Xini ) is the vector of alignment positions belonging to segment i and denote the topology of the segment i by Ti . Since all

between-segment parameters

d1 λ1 w1

d2 λ2 w2

d3 λ3 w3

d K −2 λ K −2 w K −2

d K −1 λ K −1 w K −1

within-segment parameters

µ1 κ1 T1

µ2 κ2 T2

µ3 κ3 T3

µ4 κ4 T4

µ K −2 κ K −2 TK −2

µ K −1 κ K −1 TK −1

µK κK TK

Figure 4.5: Parameters used in our model for each segment and between segments. Segments are represented by different colors, where the border is shown by vertical black lines. parameters are shared within a segment we have that the likelihood of segment i is given by: ni

P( Xi | Ti , µi , κi ) =

∏ P(Xih | Ti , µi , κi ).

h =1

Despite all sites within a segment i share µi , the marginalization over branches is in fact applied to each branch and to each site separately. This way the model allows the branch lengths 61

4.3. BAYESIAN MODEL FOR RECOMBINATION

to vary among sites while fixing the tree topology. When a large number of sequences are analyzed, our model assumes that the average branch length is common among sites within a segment but allows variable rates among segments. By partitioning the alignment into short segments (e.g. less than 10 base pairs) our procedure takes account of rate heterogeneity among sites, with better modelling for smaller segments [135, 136]. The marginalization over individual branches and the assumption of indepence among segments should accomodate for rate heterogeneity among lineages and sites. In our hierarchical setting the transition:transversion ratios κi and average substitution rates µi are independent from each other and among the segments, and follow exponential distributions with means µ0 and κ0 respectively. Further, µ0 and κ0 follow exponential distributions with means M and K respectively. Namely, P(µi | µ0 ) = (1/µ0 ) e−µi /µ0

i = (1, · · · , K )

P(κi | κ0 ) = (1/κ0 ) e−κi /κ0

i = (1, · · · , K )

P(µ0 | M) = (1/M) e−µ0 /M ;

P(κ0 | K) = (1/K) e−κ0 /K ;

All topologies are equaly likely a priori, and we introduce a prior distribution for the recombination distance between the topologies of neighboring segments. This prior sets a penalty against inconsistencies of topologies which need several recombinations (SPR moves) to be resolved. Denoting the recombination distance (dSPR ) at breakpoint i by di , our prior distribution is described as a modified truncated Poisson: d ( wi +1)

P ( d i | λ i , wi , m ) =

e − λ i ( wi +1) λ i i

η ( λ i , w i , m ) d i ! ( wi +1)

.

Here, m

η ( λ i , wi , m ) =

∑

d =0

d ( wi +1)

e − λ i ( wi +1) λ i

d!(wi +1)

is the normalizing constant to account for the fact that any two topologies with N taxa cannot have an SPR distance larger than N − 3, and wi is the weight on the penalty. The distance di between topologies at segments i and i + 1 is calculated using the approximation dSPR described in Chapter 3 (p. 34). The Bayesian hierarchical model incorporates further hyper62

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

priors to account for the uncertainty on the strength of the penalty. That is, λi and wi follow gamma distributions whose hyper-parameters αλ , β λ , αw and β w are shared across segments: α

P ( λi | α λ , β λ ) =

β λλ αλ −1 − β λ λi λ e Γ(αλ ) i

P ( wi | α w , β w ) =

βαww αw −1 − β w wi w e Γ(αw ) i

i = (1, · · · , K − 1)

i = (1, · · · , K − 1)

where Γ(α) is the gamma function [108].

4.4

Sampling from the posterior distribution

If we represent the parameter vector by θ, then the posterior probability can be written as

P(θ | X ) ∝

h

K

∏ P(Xi | Ti , µi , κi ) P(µi | µ0 ) P(κi | κ0 )

i

i =1

×

h K −1

∏ P ( d i | λ i , wi , m ) P ( λ i | α λ , β λ ) P ( wi | α w , β w )

i

i =1

× P(µ0 | M) P(κ0 | K) This distribution is numerically simulated by Metropolis coupled Markov chain Monte Carlo (MC-MCMC) [128]. We employ a Metropolis-within-Gibbs sampler where all parameters are updated sequencially (systematic-scan) , and the acceptance probability ah (θi∗ ) of a candidate state θi∗ given its current state θi is given by ah (θi∗ ) = min(1, Ah (θi∗ )) where Ah (θi∗ )

=

P( Xi | θi∗ ) P(θi∗ )

h

q(θi | θi∗ )

[ P( Xi | θi ) P(θi )]h q(θi∗ | θi )

.

(4.9)

P(θi ) is a shorthand for the prior distribution of parameter θi and q(· | θi ) is the proposal distribution. The parameter h (0 < h ≤ 1) is the heat value of the chain, and states sampled from the cold chain (h = 1) form an approximation of the posterior distribution. We run two chains concurrently, one cold and one heated (0 < h2 < 1), such that swap of states between

63

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

them are accepted with probability a(h1 , h2 ) = min(1, A(h1 , h2 )) where A ( h1 , h2 ) =

P ( θ h1 | X )

h2

P ( θ h2 | X )

P ( θ h2 | X )

h2

P ( θ h1 | X )

h1 h1 .

(4.10)

Here θhi represents the parameter vector θ of chain hi . This heating scheme is the basis of the MC-MCMC, since in multidimensional problems the chain may get trapped in a local optimum. The advantage of a heated chain is demonstrated in figure 4.6, where observe that the heated chain is a “flatter version” of the posterior distribution, allowing new states to be accepted more often. If we don’t want to allow large jumps between states, which may be the case in multidimensional problems since the acceptance probability will be usually too low, then a single, cold chain may not explore all the parameter space. Once it finds a local optimum (distribution mode) it may take a long time until a state far from this mode (and maybe closer to another mode) is accepted. When the parameter is a topology, the situation is more complicated. Not only we don’t want to propose a new topology completely unrelated to the current one (since only a tiny fraction of the (2N − 5)!! = 1 × 3 × · · · × (2N − 5) possible topologies will have a significant likelihood) but also we can not. In Chapter 3 we saw the possible branch swapping algorithms that generate topological diversity: they comprise our random topology generators, and usually consecutive topologies will be highly correlated. Like for other, real-valued parameters, there is a trade-off between acceptance rates and mixing of the chain, and MC-MCMC is one way of improving the mixing without sacrificing the

0.0 0.2 0.4 0.6 0.8 1.0

P(x)

frequency of acceptance.

0

5

10

15

20

x

Figure 4.6: Comparison between cold (black) and heated (red) distributions. Despite the heated distribution is different from the posterior (cold), the gap between the peaks is narrower than in the original distribution, making jumps between states being accepted more often. The distributions were rescaled to fit between zero and one.

64

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

4.4.1

Update of continuous variables

For the continuous variables, namely µi , κi , µ0 , κ0 , λi , wi a uniform random variable u ∼ uniform(0, 1) is drawn and the candidate state is set as θi∗ = θi eξ θi (u−0.5) where ξ θi is a tuning parameter. This choice leads to a uniform on θi for e−ξ θi /2 < θi∗ /θi < eξ θi /2 . The proposal ratio for these cases is q(θi | θi∗ ) θi∗ = q(θi∗ | θi ) θi

since

q(θi∗ | θi ) =

1 θi∗ ξ θi

Thus for the update of the segment average substitution rate µi , i = 1, . . . , K the acceptance probability reduces to (dropping the subscript h) A(µi∗ )

P( Xi | Ti , µi∗ , κi ) P(µi∗ | θ0 ) q(µi | µi∗ ) = P( Xi | Ti , µi , κi ) P(µi | θ0 ) q(µi∗ | µi )

since all other terms cancel out. In the same way, we have that the per segment transition:transversion ratio κi , i = 1, . . . , K is updated in accordance with A(κi∗ )

P( Xi | Ti , µi , κi∗ ) P(κi∗ | κ0 ) q(κi | κi∗ ) = P( Xi | Ti , µi , κi ) P(κi | θ0 ) q(κi∗ | κi )

Their priors µ0 and κ0 are updated following h

i ∏iK=1 P(µi | µ0∗ ) P(µ0∗ | M) q(µ0 | µ∗ ) 0 i A(µ0∗ ) = h ∗ |µ ) K q ( µ 0 0 ∏i=1 P(µi | µ0 ) P(µ0 | M) and h

i ∏iK=1 P(κi | κ0∗ ) P(κ0∗ | K) q(κ0 | κ ∗ ) 0 i A(κ0∗ ) = h ∗ |κ ) K q ( κ 0 0 ∏i=1 P(κi | κ0 ) P(κ0 | K) The parameters λi , wi , i = 1, . . . , K − 1 from the dSPR distribution have acceptance probabilities given by A(λi∗ ) =

P(di | λi∗ , wi , m) P(λi∗ | αλ , β λ ) q(λi | λi∗ ) P(di | λi , wi , m) P(λi | αλ , β λ ) q(λi∗ | λi )

65

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

and A(wi∗ )

4.4.2

P(di | λi , wi∗ , m) P(wi∗ | αw , β w ) q(wi | wi∗ ) = P(di | λi , wi , m) P(wi | αw , β w ) q(wi∗ | wi )

Update of topologies

To update the topologies, we always consider a block of consecutive segments which share the same topology. Let j1 and j2 be two segments such that Ti = Ti+1 for all i ∈ ( j1 , . . . , j2 − 1). If we call this topology TB , then our proposal topology TB∗ will be accepted with probability h A( TB∗ ) = h

× ×

j ∏i2= j1

P( Xi | TB∗ , µi , κi )

i i

j ∏i2= j1 P( Xi | TB , µi , κi ) P(d∗j1 −1 | λ j1 −1 , w j1 −1 , m) P(d∗j2

| λ j2 , w j2 , m)

P(d j1 −1 | λ j1 −1 , w j1 −1 , m) P(d j2 | λ j2 , w j2 , m) q( TB | TB∗ ) ×L q( TB∗ | TB )

since all segments inside the block share the same topology. The constant L refers to the proposal ratio, which is usually one. In our model the number of parameters is constant (since even the topologies are distinct for every segment) but updating all segments independently would have a very poor mixing. We thus borrowed ideas from reversible-jump MCMC (rj-MCMC) [133, 137] to increase and decrease the number of recombination break-points, and to change its location. In changepoint models the frequency of addition and removal updates of a change-point should ensure detailed balance [137]. For us it is enough to set up the frequency of break-point addition f , f ≤ 0.5 as being equal to the frequency of break-point removal. With probability (1 − 2 f ) a change in break-point location (shift) is tried. If we have Ti = Ti+1 for all i ∈ (k1 , . . . , k2 − 1) with both dk1 −1 and dk2 non-zero, then

(k1 , · · · , k2 ) is the largest non-recombinant region between k1 and k2 . The removal of one recombination break-point is equivalent to chosing TB∗ to be equal to Tk1 −1 or Tk2 +1 (with equal probability). The addition of a break-point can be tried by setting j1 = k1 + 1, · · · , k2 or j2 = k1 , · · · , k2 − 1 using the above formula, with TB∗ different from the border topologies Tk1 −1 (if we chose j1 ) or Tk2 +1 (if we pick up j2 ). If the proposal topology TB∗ and the pertinent border are the same, this is equivalent to shifting the recombination break-point. In figure 4.7 66

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

we have a diagram exemplifying these three types of moves. Tk1 −1

T

...

T

T∗

Tk2 +1

Addition

6

Tk1 −1

...

T k1

T k2

Tk2 +1

Removal ?

Tk1 −1 Tk1 −1

Tk1 −1

Tk1 −1 Tk2 +1

...

T k1

T k2

Tk2 +1

Shift ?

Tk1 −1

T

...

T

Tk2 +1 Tk2 +1

Figure 4.7: Diagram of addition, removal and shift of break-point. The top panel describes the addition/removal, where the original configuration is on the middle. The panel at the bottom represents the shift of a break-point, where the original configuration is at the top. Thin vertical lines represent segments, while thick vertical lines represent break-points (where neighboring topologies differ). Topologies Tk1 to Tk2 are represented simply as T to stress the fact that they are identical. Only modified terms enter into the acceptance rate calculation (the other cancel out), and these are depicted in red. The sampling of the new break-point positions (for addition and shift moves) or which topology to copy from (for deletion move) is uniformly chosen from the possibilities. If addition and removal of recombination break-points are tried with equal probability then detailed balance of the chain is satisfied, as we mentioned before. The exception are thus the regions before the first and after the last recombination break-points, where the frequency of removal updates is twice as large as the frequency of addition updates. For these cases we set L = 2 when proposing a break-point addition and L = 1/2 when proposing a deletion. For break-point shift moves, q( TB | TB∗ )/q( TB∗ | TB ) = 1 since the move is symmetric. 67

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

When adding a break-point q( TB | TB∗ )/q( TB∗ | TB ) = (2N − 6) if the new topology is sampled by applying an NNI (as described in Chapter 3) on the current topology, and q( TB | TB∗ )/q( TB∗ | TB ) = (2( N − 3)(2N − 7)) if the proposal topology differ from the current topology by an SPR branch swapping. These numbers correspond to the neighborhood size of the moves [77], and the removal of a break-point is deterministic. Thus, if we represent by f NN I the frequency at which an NNI-type move is tried and by (1 − f NN I ) the frequency of SPR moves, the overall acceptance ratio is given by −1 q( TB | TB∗ ) (1 − f NN I ) f NN I + . = q( TB∗ | TB ) 2( N − 3) 2( N − 3)(2N − 7)

(4.11)

This break-points update scheme is performed in a symmetric scan (from first breakpoint to last and then back). To decrease autocorrelation between samples, at every iteration we try to update all segments belonging to a non-recombinant region by proposing a new topology TB∗ , and this procedure is repeated n T times. Here, as in the break-point addition update, the new topology is chosen by applying one SPR or NNI move at the current topology TB . The frequency at which each these moves occur can be set up in order to optimize the acceptance rate. It is worth noting that in proposing a new topoogy TB∗ to a group of contiguous segments sharing the same topology, the proposed topology may by chance be equal to some neighbour topology - thus changing the total number of break-points. We do not impose any restriction to this behaviour, since our procedure is not a rj-MCMC.

4.4.3

Improving convergence

The method described in the previous section worked well in several simulation scenarios. But one point that was not clear is related to the sampling of a new topology: by design, a proposal topology will tipically have dSPR = 1 to some neighbouring segment (like topologies T and T ∗ in the topmost diagram of figure 4.7). On the other hand, if we want to be able to detect recombination hot-spots, we should create an implementation efficient in sampling topologies with dSPR > 1 from neighboring topologies. This is similar to the problem faced in rj-MCMC when changing dimensions, where a move across dimensions customarily will sample a new state far from a posterior mode, and thus will most certainly be rejected [138]. Fortunately, there are strategies aiming at increasing the acceptance rate in these cases. Let

68

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

us assume the posterior distribution for m dimensions is given by Pm ( x ). When proposing a jump from state x in m dimensions to a state x 0 in n dimensions (with m < n), we follow this proposal by k fixed-dimension steps satisfying detailed balance with respect to a different distribution Pn∗ ( x ). Then we can calculate the fractionary part of the acceptance rate of the new state x ∗ as [130] A( x∗ ) =

Pn ( x ∗ ) Pn∗ ( x 0 ) q( x | x 0 ) Pn∗ ( x ∗ ) Pm ( x ) q( x 0 | x )

(4.12)

There is asymmetry in the transition, hence when proposing a jump from x ∗ in n dimensions to x in m dimensions, we first apply k fixed-dimension steps starting at x ∗ under distribution Pn∗ ( x ), ending at x 0 (still in n dimensions). Then we sample as usual to obtain x, and this state x will be accepted with probability given by the inverse of equation 4.12. For a suitable choice of P∗ ( x ) we may be able to move the chain to a state closer to a mode before deciding the acceptance/rejection. h A natural choice for P∗ ( x ) is a heated distribution, that is, P∗ ( x ) = P( x ) for 0 < h ≤ 1. The acceptance probability given by 4.12 has some very desirable properties: for a very bad choice of P∗ ( x ), the tipical value of x ∗ will be x 0 and so we recover the original acceptance probability; if P∗ ( x ) = P( x ) then the new state will always be accepted (provided k > 0, x ∗ was subjected already to an acceptance-rejection); the number of iterations within the minisampler is arbitrary; the method works even for fixed-dimension updates (as is the case in our recombination detection procedure). To see this we can simply replace x 0 by x and assume h P∗ ( x ) = P( x ) in equation 4.12, and note that it reduces to equation 4.10. It takes then the form of a heated “mini-sampler”, a serial3 version of an MC-MCMC. We use this minisampler strategy when updating topologies without changing the break-points, and when trying addition/removal of a break-point, as represented in figure 4.8.

4.4.4

Sampling stages

The initial state of the samplers are chosen independently based on heated chains with variable temperature, whose initial values are picked up randomly from the priors or set up to arbitrary values. When 0 < h < 1, as usually is the case in Metropolis coupled MCMC, the updates are accepted more often, which allows a better exploration of the parameter space. 3 as

opposed to parallel, as when running multiple chains

69

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

Tx

Tx

Tx

T∗

T∗

T∗

Removal

Addition

6

?

Tx

Tx

Tx

T0

T0

T0

Removal

Addition

6

?

Tx

Tx

Tx

Tx

Tx

Tx

Figure 4.8: Diagram representing the addition and removal of a break-point using the minisampler strategy. On the other hand, using h > 1 is more effective in finding a near-optimum state, at the cost of low convergence if the chain is attracted by a local peak. In our procedure, the posterior distribution simulation is divided in three stages: the warm-up stage, the burn-in stage and the main sampler. The warm-up stage consists of a simulated annealing with cooling schedule hn = h0 log(n + e) for iteration n and initial temperature h0 , where both chains are run independently. The parameter hn is the temperature of the chain at iteration n, and can accommodate overdispersion of initial values (setting h N 1 where N is the last iteration) and initial values constrained to a particular state (setting h N such that h0 > 1). In the burn-in stage both chains run independently at temperature h = 1. The number of iterations in the burn-in stage (burn-in length) should be set to a reasonable number so that at the end of the the chains can be safely assumed to have achieved stationarity. In the main sampler both chains run concurrently at distinct temperatures, where the swap between states is accepted following equation 4.10 (p. 64). In figure 4.9 we have the template input file for the software developed incorporating these algorithms, called biomc2 (biomc2 ), and in figure 4.10 we have a tipical output of the program showing the simulation progress, with acceptance rates and recombination estimates.

70

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

[ comments are in nexus format, anything between brackets ] [ Almost all parameters have default values that should not be that bad... The only necessary parameters are "treefile" and "seqfile" with the file names. ] [ tree file in nexus format (mrbayes, paup) just for initial state of topology, branch length ] treefile

= paup.tre

[ seq data in nexus format, INTERLEAVED, with or without charset defs ] seqfile

= paup.nexus

[ hyper-prior for rate, follows exp E(X) = rate ] rate

= 0.1

[ hyper-prior for transititon:transversion ratio, follows exp E(X) = kappa ] kappa

= 2.

[ lambda of the modified poisson _per_segment_, follows gamma ] [ order: alpha beta with E(X) = alpha/beta ] distance

= 1 128

[ penalty (w in the paper), follows gamma ] [ order: alpha beta with E(X) = alpha/beta ] penalty

= 1 1

[ if this value is set, the CHARSET info of the seq data file is discarded ] nsegments = 128 [ usually (n_taxa - 3), but can be smaller ] [ zero will give a model where distance is 0 (same topo) or 1 (recomb) ] maxdist [ [ [ [

= 5

ngen is total number of generations after warmup nsamples is how many samples should be drawn after burnin warmup is number of iterations per warm-up cycle (5 cycles in total) burnin is number of iterations per burn in cycle (5 cycles in total)

ngen nsamples warmup burnin

= = = =

] ] ] ]

100000 500 500 2000

[ if you want to simulate the posterior (= 0) or the prior (= 1) ] prior

= 0

[ --below are the MCMC jump components --- ] [ --- do not change without reading the methods --- ] [ larger values, lower acceptance; lower values, slower convergence. ] [ should be tuned so that acceptance prob. lies between 0.2 and 0.6 ] [ order: rate kappa lambda penalty (all real numbers) ] update [ [ [ [ [ [

proposal [ [ [ [

= 1 1 1 1

how many iterations before trying more agressive branch swapping and how many topology update cycles (no break-point change) per iteration. The standard branch swapping is NNI. Values of zero are valid and mean "never". See also "minisampler" below, since if

] ] ] ] ]

= 10 2

Final temperature for warm-up simulated annealing stage (for each of 5 cycles). ] The temperature at iteration n is given by b_n = b_0 log(n+e) where "b_0" is ] initial temperature calculated based on number of iterations "n" and "e" is ] Euler constant. ]

annealing = 0.8 [ [ [ [

Metropolis-coupled MCMC (one cold and one heated chain). ] this is the 1/kT for the heated chain and interval at wich swap between chains should be tried. values for the 1/kT should be between 0.5 and 0.999, while swap_inteval should be small (1˜10) order: <1/kT> (real number)

] ]

mc3 = 0.8 2 [ [ [ [ [ [

Al-Awadhi update: before accepting/rejecting a move, run a "mini-sampler" on heated chain. It depends on number of iterations before accepting/rejecting and 1/temperature for these cycles (smaller than one to avoid local traps, larger than one to fall into these traps ;) original behaviour (without Al-Alwadhi moves) can be set by "awadhi = 1. 0" order: <1/kT> (real number)

] ] ] ] ]

awadhi = 0.4 4 [ [ [ [ [ [ [ [ [

minisampler: like Al-Awadhi update, but without changing the number/location of recombination breakpoints - it may happen that by chance the new topology is equal to a neighbour. This minisampler can be though of as a serial version of a heated chain. So the total number of trials for recombinant segment is 2xAxB here 2 = (symmetric scan), A =

] ] ] ] ] ] ] ]

minisampler = 0.6 2 [ This is the BIOMC2 version 1.5 control file template. ]

Figure 4.9: Input file for our software, biomc2.

71

4.4. SAMPLING FROM THE POSTERIOR DISTRIBUTION

| biomc2 version 1.5 November 2006 | Recombination detection program based on segment-wise topology distance | Leonardo de Oliveira Martins and Hirohisa Kishino | | This program is protected by the GPL license [ 0.288086 ] [ 0.220215 ] [ 0.408691 ] [ 0.083008 ] initial topology likelihood : -1490.860245 [ 0.288086 ] [ 0.220215 ] [ 0.408691 ] [ 0.083008 ] initial topology likelihood : -1490.860245 number of segments = 128 ngen = 100000 burnin = 2000(x10) sample_interval (after burnin) = 200 warm-up = 500(x10) period, in number of iterations, between SPR moves (otherwise NNI): 10 - in Al-Awadhi updates (changing the break-point structure): temperature = 0.400000 x normal, mini-sampler size = 4 - in topology updates (without changing the break-point structure): temperature = 0.600000 x normal, mini-sampler size = 2 how many times, per iteration, to update topologies: 2 heated chain: heat factor 1/kT = 0.800000, swap interval = 2 the acceptance rates below are (number of accepted)/(number of tries); prop means (number of tries for this move)/(total number of tries for moves) WARM-UP STAGE (simulated annealing) [warm-up] iteration 500, 5.5300 secs, cooling scheme (1/temperature): 0.128617 - 0.800000 Tadd prop|Tremove prop|Tshift prop| Tcycle Theat Tno.bkp lambda pnalty rate ch 1: 0.09981 0.54|0.07468 0.31|0.07877 0.15| 0.25311 0.09460 0.95771 0.90764 0.90787 0.87319 ch 2: 0.10578 0.57|0.07250 0.30|0.03455 0.14| 0.26872 0.11629 0.95658 0.90748 0.90668 0.87875 . . . [warm-up] iteration 2500, 5.6100 secs, cooling scheme (1/temperature): 0.128617 - 0.800000 Tadd prop|Tremove prop|Tshift prop| Tcycle Theat Tno.bkp lambda pnalty rate ch 1: 0.12461 0.51|0.09716 0.34|0.04844 0.14| 0.26463 0.10825 0.95399 0.90942 0.90529 0.88187 ch 2: 0.12991 0.54|0.10900 0.31|0.06195 0.15| 0.28405 0.13845 0.93862 0.90745 0.90849 0.88138

rate.p kappa kappa.p swap 0.30000 0.91000 0.32400 0.00000 0.30400 0.90869 0.31200 0.00000

[3 3] [2 2]

rate.p kappa kappa.p swap 0.36800 0.91706 0.29200 0.00000 0.27600 0.91031 0.31200 0.00000

[3 3] [3 3]

Estimated time to completion : 20 m 22 sec BURN-IN STAGE (no swap between cold and heated chains): Sampling initial state for convergence statistics [burn-in] iteration 2000, 21.0700 secs Tadd prop|Tremove prop|Tshift prop| Tcycle Theat Tno.bkp lambda pnalty rate rate.p kappa ch 1: 0.04412 0.54|0.01454 0.31|0.02564 0.15| 0.19643 0.02444 0.98557 0.90713 0.90746 0.85844 0.26900 0.89845 ch 2: 0.04779 0.55|0.02111 0.30|0.02816 0.15| 0.19734 0.02460 0.98416 0.90699 0.90789 0.85780 0.27000 0.89659 . . . [burn-in] iteration 10000, 21.0900 secs Tadd prop|Tremove prop|Tshift prop| Tcycle Theat Tno.bkp lambda pnalty rate rate.p kappa ch 1: 0.04792 0.55|0.01921 0.30|0.02329 0.15| 0.19939 0.02732 0.98099 0.90710 0.90844 0.85789 0.30300 0.89622 ch 2: 0.05058 0.55|0.01868 0.30|0.03202 0.15| 0.20070 0.02867 0.98326 0.90856 0.91053 0.85591 0.28400 0.89677

kappa.p swap 0.26900 0.00000 0.29900 0.00000

[3 3] [2 2]

kappa.p swap 0.28900 0.00000 0.27500 0.00000

[3 3] [2 2]

Tno.bkp lambda pnalty rate rate.p kappa kappa.p swap 0.98049 0.90522 0.90630 0.85706 0.28000 0.89373 0.27720 0.14640 0.96846 0.90889 0.91056 0.87120 0.31320 0.90681 0.30320 0.14640

[3 3] [3 3]

Tno.bkp lambda pnalty rate rate.p kappa kappa.p swap 0.98264 0.90586 0.90615 0.85661 0.28480 0.89644 0.29040 0.13920 0.97066 0.90922 0.90914 0.87176 0.33040 0.90691 0.30000 0.13920

[3 3] [3 3]

Tno.bkp lambda pnalty rate rate.p kappa kappa.p swap 0.98542 0.90460 0.90649 0.85772 0.29760 0.89549 0.28400 0.15040 0.97115 0.90824 0.90980 0.87012 0.31120 0.90824 0.32560 0.15040

[3 3] [4 4]

Tno.bkp lambda pnalty rate rate.p kappa kappa.p swap 0.98224 0.90559 0.90626 0.85696 0.29080 0.89474 0.28440 0.12440 0.97239 0.90866 0.91006 0.87135 0.30600 0.90612 0.29920 0.12440

[3 3] [2 2]

Estimated time to completion : 17 m 31 sec MAIN SAMPLER (saving prior/posterior parameters/topologies to file) [main sampler] iteration 5000, remaining time : 17 m 1 sec Tadd prop|Tremove prop|Tshift prop| Tcycle Theat cold: 0.04610 0.56|0.01957 0.30|0.03156 0.15| 0.20311 0.02986 heat: 0.07535 0.55|0.03954 0.30|0.04292 0.15| 0.22847 0.05577 [main sampler] iteration 10000, remaining time : 16 m 9 sec Tadd prop|Tremove prop|Tshift prop| Tcycle Theat cold: 0.04702 0.56|0.02075 0.30|0.02520 0.15| 0.20171 0.02714 heat: 0.07652 0.55|0.03941 0.30|0.04097 0.15| 0.23053 0.05435 . . . [main sampler] iteration 95000, remaining time : 53 sec Tadd prop|Tremove prop|Tshift prop| Tcycle Theat cold: 0.04681 0.55|0.01944 0.30|0.02826 0.15| 0.19658 0.02462 heat: 0.07936 0.55|0.04123 0.30|0.03539 0.15| 0.22522 0.05385 [main sampler] iteration 100000, remaining time : 0 sec Tadd prop|Tremove prop|Tshift prop| Tcycle Theat cold: 0.04798 0.55|0.02241 0.30|0.02802 0.15| 0.20117 0.02804 heat: 0.08079 0.55|0.04197 0.30|0.03533 0.15| 0.22641 0.04992 [cold chain] freq 0.8577 | freq 0.5268 | freq 0.8939 | freq 0.7729 | freq 0.6092 | freq 0.2204 | freq 0.0321 | freq 0.0203 |

Scaled Regeneration coeff.var -0.000907 | coeff.var -0.003755 | coeff.var 0.000429 | coeff.var -0.000862 | coeff.var 0.001089 | coeff.var 0.004320 | coeff.var 0.021986 | coeff.var 0.014535 |

Quantile plot information [ Y=BX slope 1.00 | corr.coef 0.99997 | slope 1.00 | corr.coef 0.99957 | slope 1.00 | corr.coef 0.99997 | slope 0.99 | corr.coef 0.99990 | slope 1.01 | corr.coef 0.99991 | slope 1.02 | corr.coef 0.99900 | slope 1.07 | corr.coef 0.99353 | slope 1.02 | corr.coef 0.99433 |

where Y=Ti/TN and X=i/N ]: topology at segment 0 topology at segment 32 topology at segment 61 topology at segment 100 topology at segment 127 break-point 32 break-point 61 break-point 100

[heat chain] freq 0.7799 | freq 0.7287 | freq 0.8015 | freq 0.2092 | freq 0.2474 | freq 0.3918 |

Scaled Regeneration coeff.var -0.000362 | coeff.var -0.008247 | coeff.var -0.016851 | coeff.var 0.000994 | coeff.var -0.003834 | coeff.var 0.007153 |

Quantile plot information [ Y=BX slope 1.00 | corr.coef 0.99998 | slope 0.99 | corr.coef 0.99994 | slope 1.00 | corr.coef 0.99993 | slope 0.98 | corr.coef 0.99976 | slope 0.96 | corr.coef 0.99891 | slope 0.96 | corr.coef 0.99793 |

where Y=Ti/TN and X=i/N ]: topology at segment 0 topology at segment 31 topology at segment 62 topology at segment 127 break-point 31 break-point 62

Figure 4.10: Output of biomc2, showing the acceptance rates for each simulation stage. The last two numbers, in square brackets, are the number of recombinations and the number of break-points. The output was truncated to fit in one page.

72

Chapter 5 Results We tested our algorithm on simulated datasets and on real HIV-1 data. The simulated datasets emphasize the strengths of our method where most other methods would fail: short sequences with several conflicting topologies that hinder the assumption of parental sequences. We have also analysed a dataset without recombination but with rate heterogeneity, since this scenario may lead to conflicting phylogenies. Another simulation allowed the comparison between our approach to recombination hot-spots and the assumptions made by other methods. Before presenting the results, we will make a brief review of statistics used in assessing simulation convergence and some statistics that we employed to summarise the posterior distribution of topologies for each segment.

5.1

Convergence diagnostics

Convergence diagnostics are statistical analyses set up to assess the proximity between sample values drawn during an MCMC simulation and the desired target1 (posterior) distribution [140]. If we design an ergodic and reversible transition rule for updating parameters θ with respect to a target distribution, it is guaranteed that once the chain achieves stationarity it will remain stationary at its target distribution [141]. This explains why we need a burnin period where all samples are discarded: we need to ensure the chain reaches stationarity 1 The

term “target” is preferred in the literature because many MCMC methods do not have a probabilistic interpretation [139]

73

5.1. CONVERGENCE DIAGNOSTICS

(with respect to the target distribution) before sampling. Loosely speaking ergodicity means the chain may go from any state to any other state in a finite number of steps (irreducible), in a non-deterministic way (aperiodic), and the reversibility condition is similar to the DNA substitution matrix case, where the transition matrix comes from the Metropolis acceptance function. The above statement about stationarity says that any sample will eventually be from the target distribution, but it does not imply that consecutive samples will. Consecutive MCMC samples will be correlated and so will fail to provide a good approximation to the target distribution. For a Metropolis-Hastings algorithm we can see that this is true since very low acceptance rates mean that the chain is not moving, while very high acceptance rates may indicate the chain is exploring only a tiny fraction of the target distribution (or good mixing, to be positive). For Gibbs samplers, all updates are accepted but we only update a fraction of the parameters at a time - if we can update all parameters at once then we don’t need an MCMC since we are already sampling directly from the posterior distribution. Thus we need to cope with the choice of thinning intervals, in order to reduce the correlation between samples [108]. The larger the thinning (interval between samples), the less correlated the samples, but the definition of “large” here is subjective and sometimes implies in being prohibitively time consuming. In some cases we can transform the variable of interest to improve the normality of the distribution. If the parameter is approximately normal, optimal acceptance rates are known for random-walk samplers [141] and some convergence monitoring statistics have a good power [142]. This is not possible in general, and discrete variables (like phylogenies) are an example. On the other hand, the transformed variable φ = log [ P(θ | X )] has been suggested as a general diagnosis benchmark [142], and since the target distribution is dominated by the likelihood of the topologies, it may be a good proxy for the distribution of topologies.

5.1.1

Basic statistics

There are two quantities that always should be observed, even though they may not give us information about convergence: the raw plots of the parameters and the acceptance rates. Both are related with the success of the sampler in proposing and accepting new states. The raw plot gives us information about the stability of the samples, and no trend should be observed. A raw plot is the time series of the parameter of interest for each sampled value.

74

5.1. CONVERGENCE DIAGNOSTICS

The acceptance rates are the fraction of the number of accepted updates over total number of tried updates for each proposal, and are an estimative of ah (θi∗ ) (equation 4.9 on page 63). As we saw above, this numbers alone can not provide information about convergence and their optimal values depend on the problem at hand. Nevertheless, the stability of raw plots and acceptance rates altogether with reasonable values for the acceptance rates are a necessary but not sufficient condition for a good sampling procedure.

5.1.2

Autocorrelation function

A natural statistic to check for the mixing of the chain is the autocorrelation function of the time series induced by the variable of interest [141]. The sample autocorrelation function ρˆ (h) at lag (interval) h is defined as [143] ρˆ (h) = γˆ (h)/γˆ (0) where γˆ (h) is the sample autocovariance function of the time series ( x1 , · · · , xn ) at lag h defined by 1 n−h γˆ (h) = ∑ ( x j+h − x¯ )( x j − x¯ ), 0 ≤ h < n. n j =1 Here x¯ is the sample mean x¯ = 1/n ∑nj=1 x j . Values of ρˆ (h) > 0 indicate that points h steps apart tend to lie on the same side of the mean, while values of ρˆ (h) < 0 reflect a tendency for observations at lag h to lie on opposite sides of the mean. Periodicity in the original time series will be mantained in the autocorrelation, and the stronger the trend in the time series the slower the decay of the autocorrelation function [143]. The asymptotic distribution of ρ(h) shows that ρ(h) ∼ N(0, n−1 ) ∀h > h0 where h0 is the lag when the series is no longer correlated. But we will see next that some MCMC sampling schemes present a periodicity, called regeneration, and so ideally we should observe small values for ρˆ (h) followed by peaks in a periodic manner. In Figure 5.1 we have three examples of the sample autocorrelation function. In the first 75

5.1. CONVERGENCE DIAGNOSTICS

example the points are iid, and thus uncorrelated. The second example is equivalent to a random walk, and thus correlated. In the last example we observe that xi and xi+4 are independent (they don’t have random deviates z j in common), and the autocorrelation function

0.4 0.0

ACF

0.8

0 1 2 3 4 −2

white noise

captures this property.

0

100

200

300

400

500

0

5

10

15

20

15

20

15

20

0.6 0.2 −0.2

ACF

1.0

0.0 0.5 1.0 −1.0

AR(1)

Series Lag x

0

100

200

300

400

500

0

5

10

0.6

ACF

−0.2

0.2

1 −1 −3

MA(3)

3

1.0

Series Lag x

0

100

200

300

400

500

0

5

10

Lag

Figure 5.1: Examples of time series (on the left) with corresponding autocorrelation functions (ACF, on the right). Assume zi , i = (1, · · · , 500) follow a standard normal distribution N(0, 1). The top panel shows the series xi = zi . The middle panel shows xi = 0.8xi−1 + 0.2zi with x0 = 0, an autoregressive model of order 1. The bottom panel shows xi = zi + 0.3zi+1 − 0.3zi+2 + 0.3zi+3 , a moving average model of order 3. The confidence limits in the ACF plot are represented as blue dashed lines and assume a white noise input. For our purposes, an autocorrelation plot is a visual interpretation of the parameter time series, in smaller scale – since h n if we want to obtain reasonable sample sizes. Nevertheless, it has been shown that for a reversible Markov chain we should have ρ(2m) > 0 for all m > 0 and furthermore (ρ(2m) + ρ(2m + 1)) should be strictly positive and a strictly decreasing function of m for all m > 0 [144]. This applies to consecutive iterations, but in our sampling procedure one iteration has in fact several steps (to decrease correlation) so we won’t observe this periodicity. Besides that we will show results mainly for sampled values, wich are already thinned.

76

5.1. CONVERGENCE DIAGNOSTICS

5.1.3

Scaled regeneration quantiles

Another convergence diagnostic is based on the regeneration times of the Markov chain. Regeneration times Xi , i = (0, · · · , n) occur each time a Markov chain returns to the same arbitrary state, and at each Xi the future of the process is independent of the past and identically distributed [145]. Thus we can follow the tour lengths Ni = Xi+1 − Xi and the approximation to a renewal process estimated by the coefficient of variation CV(N) of its average value N. The coefficient of variation can be estimated by c (N) = CV

n −1

∑

Ni − N

2

/ nN

2

i =0

c (N) it is useful to examine the scaled regeneration quantile In addition to computing CV (SRQ) plot of Xi /Xn against i/n. By the asymptotic uniform distribution of regenerations predicted by the renewal theory, this plot should should be close to a straight line through the origin with unity slope [145]. If some tours are larger than others it is an indication that the observation period is not long enough for the sampler to have reached equilibrium. The drawback of this procedure is that it depends on following the recurrence of the chain to a particular state, which means simultaneous regeneration of all parameters. But it was shown that even when we observe the recurrence times of one parameter in particular the SRQ line segments still can be interpreted as the ratio of the estimated probability of the recurrent state based on the entire chain over the estimated probability based on the tour length [124]. Instead of plotting the SRQ values, we can calculate the Pearson correlation coefficient ρ(Xi /Xn , i/n) between them, where ρ( x, y) =

∑nk=0 ( xk − x )(yk − y)

(∑nk=0 ( xk − x )2 ∑nk=0 (yk − y)2 )

1/2

In Figure 5.2 we have a few examples of SRQ plots with their estimated coefficients of variation and correlation coefficients. A natural parameter to follow up is the topology, and this statistic has been applied in Bayesian phylogenetic studies since then [111, 124, 133]. In these studies the topology elected to follow up is usually the maximum a posteriori (MAP) topology, which means the topological “mode”, and so it can be defined after running the sampler. In our algorithm, on the other 77

1.0 0.8 0.6

Xi/Xn

0.2

0.4

0.6 0.4 0.2

Xi/Xn

0.8

1.0

5.1. CONVERGENCE DIAGNOSTICS

0.0

0.2

0.4

0.6

0.8

correlation= 0.9937 coef var= 0.0111

0.0

0.0

correlation= 0.9993 coef var= 0.003 1.0

0.0

0.2

0.4

0.8

1.0

1.0 0.8 0.6

Xi/Xn

0.2

0.4

0.8 0.6 0.4 0.2

Xi/Xn

0.6

i/n

1.0

i/n

0.0

0.2

0.4

0.6

0.8

correlation= 0.9915 coef var= 0.0414

0.0

0.0

correlation= 0.9914 coef var= 0.0221 1.0

i/n

0.0

0.2

0.4

0.6

0.8

1.0

i/n

Figure 5.2: Examples of Scaled Regeneration Quantile plots, with estimated correlation coefficients and coefficients of variation. hand, we have a large collection of trees at each iteration (one per segment), and so we decide which topologies to follow before sampling. We follow the regeneration times of all distinct topologies on the segments immediately before the break-points sampled at the last iteration of the burn-in period (initial state of main sampler), together with the topology at the first and last segments. In this setting a bad choice of segments (and topologies) to follow up will lead to erratic regeneration times. If the chain achieved convergence before we start sampling (as it should) most values will have reasonable regeneration times. This is a more stringent assumption than looking at the best supported values (MAP estimates), but has the advantage that we don’t need to analyse the output a posteriori, since it is an online algorithm2 . For example, let us suppose that in the initial state (after the warm-up and burn-in stages) there is only one break-point, and this break-point is located between segments 10 and 11. 2 is

calculated as the simulation proceeds, as opposed to algorithms where the full data is required

78

5.1. CONVERGENCE DIAGNOSTICS

Let us assume further that this alignment is composed of 20 segments. We then store the topology of segments 1, 10 and 20 at this state, and check whenever the topology at these segments returns to the stored one. Furthermore, we store the number 10, and follow up the times at which there is a break-point between segments 10 and 11 - irrespective of the topology at segment 10. This is a worst-case scenario, since regions around break-points present high variability and low phylogenetic support, as we will see shortly.

5.1.4

Potential scale reduction factor

The convergence procedures seen so far depend only on one MCMC chain, and despite they are useful at detecting the mixing of the chain around an optimum (posterior mode), it may fail if the method is exploiting only a fraction of the potentially multi-modal distribution. In this case we need to run several independent chains at overdispersed starting points, and check if the chains converge to the same distribution after reaching apparent stationarity [108, 142]. This comparison can be done visually (plotting the distribution of the quantities of interest) or through some non-parametric statistic like the Kolmogorov-Smirnov test [146]. A test more specific to MCMC simulations was devised based on an Analysis of Variance (ANOVA) between chains and within a chain, which we will see now [142, 147]. For a random variable x with posterior distribution having (true) mean µ and variance σ2 , we denote by xij the i-th of the n iterations of chain j, over m independent chains. We can calculate the between-chain variance B/n and the within-chain variance W by 1 B/n = m−1

m

∑ (x¯.j − x¯.. )2

j =1

and 1 W= m ( n − 1)

m

n

∑ ∑ (xij − x¯.j )2

j =1 i =1

where x¯.j is the average x for chain j and x¯.. is the overall mean of x. The comparison of pooled ˆ 2 where and within-chain inferences is expressed as a scale reduction factor R = V/σ m ( n − 1 )W + ( m + 1 ) B Vˆ = . mn

79

5.2. SUMMARIZING THE DISTRIBUTION OF TOPOLOGIES

since we only have an underestimated σ2 given by W, the convergence diagnostics is given ˆ by the potential scale reduction factor (PSRF) Rˆ = V/W [142, 147]. When the PSRF is close to one, we may conclude that each of the chains is close to the target (posterior) distribution with regards to the parameter x. Due to random fluctuations in Rˆ it is advised to plot its value over several batches of the chain, to observe graphically its convergence. It is implicitly assumed that the marginal distribution of x follows a Normal distribution, and the assumption is made explicit when we incorporate a correction factor (d + 3)/(d + 1) where d is the degrees of freedom for the student-t approximation (estimated by the method ˆ var c (Vˆ ). An alternative interpretation of Rˆ is as a squared ratio of of moments and d ≈ 2V/ interval lengths, which do not depend on the Normal assumption. These interval lenghts can be the inter-quantile ranges or the empirical 100(1 − α)% interval for some parameter x. The statistic then becomes Rˆ =

5.2

interval length of pooled chains average length of within-chain intervals

Summarizing the distribution of topologies

The algorithm will produce the posterior distribution of topologies for each segment, together with the posterior distribution of several other quantities. If we denote by Tij the topology sampled at iteration i for segment j where i ∈ (1, · · · , Niter ) and j ∈ (1, · · · , Nseg ), then the posterior distribution of topologies at segment j can be represented by the vector T.j = ( T1j , · · · , TNiter j ). Equivalently, the topology configuration at iteration i is given by the vector Ti. = ( Ti1 , · · · , TiNseg ). In the same way we have the distribution of distances between segments dij = dSPR ( Tij , Ti( j+1) ) for j ∈ (1, · · · , Nseg − 1). Inferences about the location of break-points can be done directly by looking at dij . It also provides information about the total number of recombinations niSPR and break-points niCOP for each sample (iteration) i, if we observe that these quantities can be estimated by Nseg −1

niSPR

=

∑

dij

j =1

80

5.2. SUMMARIZING THE DISTRIBUTION OF TOPOLOGIES

and Nseg −1

niCOP

=

∑

I (dij )

j =1

where I (d) is the indicator function that is one if d > 0 and zero otherwise. The superscript COP stands for “crossing over points”. Based on the distribution of these quantities we can obtain point estimates like the maximum a posteriori (MAP), mean or median values, as well as create confidence intervals. Given the number of break-points we can estimate their locations by looking at the distribution of distances d.j at each segment j (defined in analogous manner as for the topologies). Another approach, which is useful when we want to reconstruct the topologies for a given non-recombinant segment, is to make inferences based on the distributions T.j of topologies. We can summarise this distribution by looking at the frequencies at which each distinct topology appears, as in a histogram. Thus for each segment j it is enough to observe the set of distinct topologies tkj and their total counts f (tkj ). Here k ∈ (1, · · · , n j ) and n j ≤ Niter is the number of distinct topologies in segment j. Notice that since we work with the total counts, ∑k f (tkj ) = Niter . The MAP topology estimate for this segment will be the topology t∗j such that f (t∗j ) > f (tkj )∀tkj 6= t∗j . Thus we will have a potential recombination break-point whenever t∗j 6= t∗j+1 . When we analyse each segment independently (like we are doing here) we have an overestimation of break-points since we do not respect the sampling structure – given by the distribution of nSPR and nCOP , as we saw before. Other quantities that we are interested not only because they provide information about the recombination process but also because they describe the reliability of the estimates are the number of distinct topologies n j , the support of the MAP topology s(t∗j ) and the second most frequent topology s(t∗∗ j ), the mean distance between segments d ( j, j + 1) and a measure e j equivalent to the entropy of the segment j. The second most frequent topology is defined in analogous manner as the MAP (“first” most frequent) topology, and the support of a tree T is simply its frequency f ( T )/Niter . The mean distance is defined as the weighted average distance between segments, namely nj

d( j, j + 1) =

n ( j +1)

∑k1 =1 ∑k2 =1 w j (k1 , k2 )dSPR (tk1 j , tk2 ( j+1) ) nj

n ( j +1)

∑ k 1 =1 ∑ k 2 =1 w j ( k 1 , k 2 ) 81

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

where w j ( k 1 , k 2 ) = f ( t k 1 j ) f ( t k 2 ( j +1) ) We define the entropy e j of a segment j as nj

ej = −

∑

k =1

f (tkj ) log nj

f (tkj ) nj

! .

This entropy will have an arbitrary scale, but the smaller the value the sharper the distribution. It is worth noticing that this is not actually an entropy in the information-theoretic sense, since the Shannon entropy Ej would be in fact nj

Ej = −

∑

k =1

5.3

f (tkj ) log Niter

f (tkj ) Niter

! .

Recombination detection on simulated sequences

The datasets We simulated sequences along a defined evolutionary model (tree and parameters) using the package PAML [148]. Each simulated dataset defines a non-recombinant segment, and the whole alignment is represented by the concatenation of these segments. If the topologies differ between segments the concatenation will model the effect of recombination. We simulated datasets with 8, 12 and 16 taxa using the HKY model (π T = 0.1, πC = 0.2, π A = 0.3, πG = 0.4), where each branch length was drawn from a uniform distribution between 0.2 and 1 and then rescaled. In the simulations of 8 and 12 taxa, we simulated 100 replicate datasets for each scenario, and in the 16 taxa simulation we examined 400 replicates divided in two sets of 200 alignments, as we will see. For the models with substitution rate (represented by the scaling parameter on total branch length of the tree) and transition:transversion ratio heterogeneity the values used were the same for all replicates. It is worth noticing that a total branch length of l substitutions will induce an average branch length of l/(2N − 3) substitutions per branch since a binary unrooted tree has 2N − 3 branches. This total branch length is the expected number of nucleotide changes per site, and is the parameter that we control when simulating 82

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

the alignments. It is a rate multiplier over individual branches, since our simulated topologies have distinct branch lengths. The model The hyperparameters for the average substitution rate M and for the transition:transversion rates K used in all MCMC analyses were the same: M = 0.1 and K = 2. The hyperpriors for the recombination distance were set up in a similar manner for all scenarios, with αλ = 1, β λ = s where s is the number of segments assumed for the scenario, and αw = β w = 1. This choice of parameters assumes a prior belief of one recombination on the dataset, but we checked that different choices do not affect the posterior distribution (results not shown). In our parametrization, if a variable X follows a gamma distribution with parameters α and β, then its mean value will be α/β and its variance α/( β)2 . The number of segments varied between simulation scenarios, where each segment ranged from 2 to 10 base pairs. The sampler The tuning parameters ξ θi were fixed at the value 1, but higher values can be set up at the cost of slightly lower acceptance rates. The temperature of the heated chain was chosen to be 0.8, and at every two iterations a chain swap was attempted. At every iteration, after updating the break-point locations, the sampler updated the topologies for each segment twice. When updating the break-point locations we try to add or remove a break-point with equal probability of 0.4, and with probability 0.2 we try to shift the location (to left or right, with equal probability). For each chain, when updating the break-point locations a mini-sampler of length 4 (number of update trials) with temperature equal to 0.4 the original temperature was run. This means that for the heated chain the temperature when in the mini-sampler was equal to 0.32. In the same way when updating only the topologies (but not the break-point locations) we ran a mini-sampler of length 5 of temperature 0.6. The sampler tried NNI updates 90% of the time and SPR moves on the remaining 10% of the trials. The initial topology configuration was set up to one arbitrary tree spanning all segments (thus di = 0 for all segments). The initial average rate for each segment was set to the Poissoncorrected distance when there is enough signal within the segment. If the segment is not informative the algorithm defaults to the hyper-parameter value, as it does for the transi-

83

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

tion:transversion ratio. This initial configuration is thus subject to a warm-up stage composed of five cycles of simulated annealing, where the temperature changed logarithmically from 0.15 to 0.8 on both chains (one of which will be heated in the sampling stage). Since the final temperature is still smaller than one, this setup allows the chains to explore the parameter space freely. Typical simulated annealing schemes, on the other hand, use an initial temperature larger than one so that mostly updates in the direction of the modes are accepted.

5.3.1

Simulation scenarios

Simulation of 8 taxa alignments Our first dataset was simulated for 8 sequences according to Figure 5.3, where each nonrecombinant region is composed of 64 base pairs (bp). The whole alignment has then 256bp. For this simulation we fixed κ = 1.4 and the branch lengths were rescaled so that each site had on average one substitution. Notice that the individual branch lengths differ for each topology and between topologies. The total number of recombinations is 3, since we have only one recombination per break-point. We simulated 100 replicate alignments under this scenario, and in the Bayesian analysis we considered each segment as composed of 2bp, giving a total of 128 segments. These segments have no relation with the region sizes used in the alignment simulation, since when using our procedure to detect recombination we assume that we have no information about the underlying process generating the sequences. Simulation of 12 taxa alignments For 12 taxa simulation, each non-recombinant segment supports not only a different topology, but also distinct evolutionary parameters. The average rate of each site (sum of branch lengths on the topology) was scaled to be between 1 and 4, and κ was set to a random number between 1 and 2 for each 32bp region, independent of the topology. Each segment is composed of 128bp following topologies displayed in Figure 5.4, composing an aligment of 512bp. In the same figure we can observe the expected number of nucleotide changes per site (total branch length) and the transition:transversion ratio for each region. For this simulation, each break-point bears (at least) two recombination events, giving a total of six recombinations where some sequences are involved more than once in recombination. For this scenario 100 independent replicates were simulated and for each dataset we sample from the posterior 84

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

s7 s7

s9

s4

s5 s9s6 s6

s4

s7 s7

s5

s2

s6 s5

s2

s6 s7 s7

s9 s9

s4 s4 s9 s9

s5

s2 s2

s1

s1

s3 s3

s9 s9 s6 s6

s5

s4 s4 s2 s2 s1

s1 s3

1-64 (1)

s5s5

s5

s1

s3 s1 s3

s3

s4 s4 s2

65-128 (1)

s1

129-192 (1)

s6 s6 s1 s3

s3

s7 s7

193-256

Figure 5.3: Trees used in simulating alignments for 8 taxa. The number ranges in gray are the alignment columns following each tree. The number of recombinations between each topology and the next are given in parenthesis. The recombinant subtrees – one possible recombination scenario – are shown in red. The branch lengths correspond to the relative amount of evolution between nodes. distribution assuming segments of 4bp (thus we have 128 segments). In this scenario we may be unable to reconstruct the phylogenetic history if we analyse each putative recombinant sequence independently, as is routinely done in HIV-1 subtyping. The first issue is to assert which sequences are recombinant and which sequences are the parentals. Assuming this task was accomplished successfully, and even without knowing the underlying phylogenies we managed to detect that sequences s3, s5, s6, s7, s8 and s9 are the recombinants, we might attempt to analyse each one of them against the parental sequences. This is not the only possible recombination scenario as we observe that the first break-point, for instance, also accepts sequence s11 (or s7) as recombinant, an then s8 might be classified as a parental3 . Unfortunately, even if we know the putative recombinants and use a infinitely precise algorithm to infer the mosaic structures independently, we may not be able to reconstruct the phylogenetic history relating these sequences, as seen in Figure 5.5. According to this figure, we might conclude that s8 is not in fact a recombinant, since the topology is the same for all regions. More than this, we would not detect the first recombination breakpoint, and only one sequence would have information about the second break-point. And at the third break-point four out of six sequences would have information about it. Then one might conclude that this is a strong hot-spot, or might count it as one ancestral recombination 3 this

arbitrariness may be solved by rooting the phylogeny, in which case we are assuming that the outgroup is non-recombinant

85

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

s11

s8 s11 s7 s6s8 s11 s7 s6

s5

s8

s8 s7

s5 s3

s3

s2

s9 s5

s4 s10 s3 s10

s5

s2 s2

s6 s3

s3

s5

s7 s7

s4 s4

s10 s10 s4 s4

s1 s1 s9 s9 s12s12

s3 s1 s1 s12 s12

257-384 (2)

s5 s5 s3

s1 s7 s7 s1 s12 s6 s6 s12

385-512

1.0

2.0

3.0

4.0

129-256 (2)

s9

s10 s10

s2s2 s6

s4 s4

1-128 (2)

s2 s2

s3s6

s10 s10

s6

s9s9 s1 s12 s12s1

s8 s11 s8 s11

s7

s2 s5

s4

s8s11 s9 s11 s8 s9

s11

0

128

256

384

512

site

Figure 5.4: Trees used in simulating alignments for 12 taxa (top panel) and model parameters used in simulations (bottom panel). The number ranges in gray are the alignment columns following each tree. The number of recombinations between each topology and the next are given in parenthesis. One possible scenario of recombinant subtrees are represented by red edges. The branch lengths correspond to the relative amount of evolution between nodes, summed over all nodes. The panel in the bottom shows the total branch length per site (in blue) and the transition:transversion ratio (in red). between sequences s11 and the ancestor of sequences s12 and s1. Simulation of 16 taxa alignments In the simulation with 16 taxa, we modelled a scenario where we have several adjacent break-points, with only one recombination per break-point – implicit assumption of other methods when used to estimate the existence of hot-spots. This simulation scenario os an alignment of 500bp with recombination break-points located at sites 245, 255 and 265, following Figure 5.6. We simulated 400 replicate alignments under this scenario with rates sampled between 2 and 5 and transition:transversion ratio sampled between 1 and 4 for each 5bp region. From these 400 replicates we analysed 200 alignments assuming 5bp segments (100 segments in total) and 200 alignments assuming segments of 10bp (50 segments). Here we have only 10bp supporting the intermediate topologies. In this situation most 86

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

s8

s11

s11 s7 s11

s6

s8

s11 s11

s2

s5 s4

s4 s10 s3 s10

s2 s2 s6 s3

s3

s10 s10

s10 s10

s4 s4

s3

s9 s1 s12 s12 s1

s8 s6 s6

s3 s1 s1 s12 s12

s9 s12s12

1-128 (2)

129-256 (2)

257-384 (2)

s11

s11 s9 s8 s11

s11

s11 s7 s11

s8

s10 s10

s5 s4 s4 s4

s5 s3

s7

s6

s5

385-512

s8

s2

s5 s4

s4 s10 s3 s10

s2 s2 s6

s3

s6

s4 s4

s6

s9 s12s12

1-128 (2)

129-256 (2)

257-384 (2)

s11

s8s11 s9 s11 s8

s11

s8 s11 s7 s6s8 s11

s8

s8 s7

s5 s3 s2

s5 s5

s4

s2 s2 s6

s3

1-128 (2)

385-512

s11 s9 s8 s11

s4 s10 s3 s10

s8

s11 s11

s6

s2 s2 s6

s4 s4

s3

257-384 (2)

s8

s4 s4

257-384 (2)

s11

s8s11 s9 s11 s9

s5

s2 s5

s4

s4 s10 s3 s10

s2 s2 s6

s3

385-512

s7 s7 s6

s1 s1 s12 s12

385-512

s8 s11 s11

s9s9 s1 s12 s12s1

1-128 (2)

s2 s2 s10 s10

s2s2

s6

s10 s10 s4 s4

s9 s9 s12s12

129-256 (2)

s9

s7

s10 s10 s1 s1

s9

s3 s1 s1 s12 s12

s4 s4 s5

s9

s4 s4

s6

s10 s10

s2s2

s7

s3

s1 s1 s12 s12

s7 s7

s6

s10 s10

s11

s11 s7

s10 s10

s7

s2 s2

s10 s10

s4 s4 s3 s1 s1 s12 s12

s11 s11

129-256 (2)

s5

s5

385-512

s8

s9 s12s12

s11

s2

s2s2

s1 s1 s12 s12

s7 s6

s9 s5

s2

1-128 (2)

s2 s2

s9 s12s12

129-256 (2)

s11

s9 s1 s12 s12 s1

s7

s10 s10 s1 s1

s8

s7 s8

s6

257-384 (2)

s1 s1

s1 s1 s12 s12

s8

s10 s10 s4 s4

s9 s1 s12 s12 s1

s7 s6 s6

129-256 (2)

s5

s9

s2

s4 s10 s3 s10

s3

s4

s3 s1 s1 s12 s12

s3 s1 s1 s12 s12

s4 s4 s5 s5

s7 s2

s5

s4 s4

s4 s4

s4 s4

s1 s1 s9 s1 s12 s12 s1

s10 s10

s7

s4 s4

s10 s10

s8

s10 s10

s2s2

s6

s10 s10

s11

s11 s7 s11 s7

s10 s10

s2s2 s6

s10 s10

s6

s3

s7

s9 s12s12

s5

s7

s6

s2 s2

s9 s1 s12 s12 s1

s8

s11 s11 s2 s2

s5

s5 s6

1-128 (2)

s9 s2

s4 s10 s3 s10

s2 s2 s5

s1 s1

s1 s1 s12 s12

s7

s3

s9 s5

s2

s4 s4

s4 s4

s1 s1

s11 s11

s5 s3 s2

s2s2

s8

s7

s7

s3s6

s8s11 s9 s11

s11

s8

s5

s2 s2

s5

s2

s11

s11 s7 s11

s6

s9

s3

s5

s8

s7

s5 s3

s8s11 s9 s11

s11

s8

257-384 (2)

s3 s1 s1 s12 s12

s4 s4 s5 s7 s6

s1 s1 s12 s12

385-512

Figure 5.5: Output of an idealized independent analysis of the simulation on 12 taxa. The original topologies relating all taxa is shown in Figure 5.4. If we assume prior knowledge of the putative recombinants and an efficient recombination detection procedure, then the output topologies for each non-recombinant segment are show inside the panels, where each panel represents the independent analysis on each putative recombinant. The detectable recombination events are show in red, and undetectable recombinations (no topology change) are shown in blue. methods would fail, since an independent analysis of each putative recombinant (assuming we could fix them a priori) could overestimate the number of recombination and random fluctuation of the break-point positions might underestimate this quantity even for methods that incorporate all data into the analysis but neglect the recombination distance. The overestimation may happen when we neglect the possibility of ancestral recombination, and thus all recombinants will share the same break-point but as we accumulate more data about these (ancestrally related) recombinants we may be mislead to conclude the existence of a hot-spot. One example is the third break-point in figure 5.3, which we might infer as a recombination hot-spot since two sequences (s2 and s4) present the same location – or nearby locations, if we take noise into consideration. On the other hand, even if we analyse all taxa at once, the underestimation is evident in our simulations with 12 and 16 taxa if we do not make a dis-

87

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

s16 s16

s6 s6

s14

s13 s13

s2s2

s16 s14 s16 s14

s2s2 s13 s13

s4 s4s7 s7

s13s13

s5s5

s15 s15

s5 s5

s15 s15

s11 s11

s3s3s11 s11

s7 s7

s3s3 s11

s4 s4

s11s12 s3 s3s11 s12

s7 s7

246-255 (1)

256-265 (1)

266-500

1

2

3

4

5

1-245 (1)

s5s5 s4 s4

s4 s4 s7 s7

s1

s15 s15

s8s8

s5s5

s8s8 s13 s13 s2 s2

s2 s2

s10 s10 s9s9

s15 s15 s3 s3

s10 s10 s9

s8s8

s1s1

s8s8

s14 s14 s6 s6 s16 s16

s10 s10 s9 s9

s12

s10 s10 s9s9 s12 s12 s1 s1

s12 s12 s1 s1

s6 s6 s14 s16 s14

s6 s6

0

125

250

375

500

site

Figure 5.6: Trees and parameters used in simulating alignments for 16 taxa, mimicking a hotspot as a region with large number of recombination break-points. The number ranges in gray are the alignment columns following each tree. The number of recombinations between each topology and the next are given in parenthesis. The recombinant subtrees are shown in red. The total branch lengths reported correspond to the relative amount of evolution for each site region. Notice the very short segments supporting the second and third topologies The panel in the bottom shows the total branch length per site (in blue) and the transition:transversion ratio (in red) used in simulations. tinction between number of recombinations (per break-point and over all break-points) and number of break-points.

5.3.2

Quality of posterior distributions

Posterior distributions for one replicate dataset First we show in Figure 5.7 (top panel) the distribution of the mean distances between segments for one replicate of the 8 taxa recombinant dataset. This figure suggests three recombination points with 20-40 bp wide credibility intervals around the true positions at sites 64, 128 and 192. The credibility intervals can be found by summing up the mean distance values around the peaks. The precision is especially high for the recombination in the central

88

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

region of this particular alignment, and the modal value of 3 for the number of recombinations

0.4

0.8

0.0

0.2

0.3

average distance

0.1

MAP distance

0.0

0.4

0.5

(and break-points) is in fact the true value.

0

64

128

192

256

site

Figure 5.7: Posterior recombination estimates for one simulated alignment with 8 sequences. On the top we have the average estimated recombination distance between segments, and on the bottom we have the punctual estimate based on neighboring MAP topologies. As discussed previously, another way of describing the posterior distribution of recombinations is by looking at the distribution of topologies for each segment. The plot of the distances between the MAP topologies, shown in Figure 5.7 (bottom panel) estimates the distances between segments to be 1 for each break-point in this simulation, and the MAP topologies correctly reconstruct the original ones. The point estimates represented by MAP topologies, despite useful for describing the topologies, may not provide reliable estimates of the break-points themselves due to the high topology variability for segments around recombination break-points. This can be seen in Figure 5.8, where we show statistics summarising the information contained in the distribution of topologies for the same replicate alignment. We observe that for regions around the estimated recombination break-points the difference in frequency between the two most frequent topologies is small, and thus the MAP topology may be not as reliable as in cases where its frequency is much higher than any other sampled topology. In the same figure we also observe that the number of trees (and the entropy) varies along the alignment and this, together with the support values, might be good indicators about the reliability of each segment. If we observe the posterior distribution of the mean distances between segments for one 89

0.5 7 −58.6 0.50 0.04

support

0.95

−109.4

entropy

−7.7

3

number of trees

12

0.0

distance

1.0

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

0

32

64

96

128

segment

Figure 5.8: Summary statistics based on the posterior distribution of topologies per segment for one dataset simulated over 8 taxa. The topmost panel shows the average distance between segments in gray and the distance between MAP topologies in black. The second panel, from top to bottom, shows the total number of distinct topologies while in the third one we plotted the entropy per segment. The panel at the bottom shows the frequency of the most frequent (MAP) topology in blue and the second most frequent topology in red for each segment. simulation of 12 sequences (Figure 5.9, top) we can infer the number of recombination hotspots as 3, like in the simulation with 8 taxa. But for 12 taxa the credibility intervals are narrower, and if we sum up the individual mean distances we will obtain a total of 6 recombinations (based on segment-wise mean distances). These figures are further supported if we observe the distribution of niSPR and niCOP , whose median and modal values are 6 and 3, respectively (result not shown). We thus can infer the three recombination break-points are in fact recombination hot-spots, since there are two recombination events taking place at each break-point. An alternative way of calculating the credibility intervals would be to observe the distribution of break-point locations given the modal number of break-points. That

90

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

is, instead of using all samples, we incorporate only the samples where the total number of break-points is equal to the modal value. We do not attempt this here since it suffices to look at the range of segments where the mean posterior distances are larger than zero. If instead we look at the MAP topologies for each segment we obtain similar results, as shown in the

0.5 1.0

2.0 0.0

1.0

average distance MAP distance

0.0

1.5

bottom panel of Figure 5.9.

0

128

256

384

512

site

Figure 5.9: Posterior recombination estimates for one simulated alignment with 12 sequences. On the top we have the average estimated recombination distance between segments, and on the bottom we have the punctual estimate based on neighboring MAP topologies. A more detailed description of the posterior distribution of topologies is summarised in Figure 5.10. In this figure we can see that some regions have more phylogenetic signal than others, as indicated by the number of trees and entropy (that takes into account the number and frequency of topologies). Recombinant regions are detected by flatter topology distributions (smaller difference between support values for MAP and second most frequent topologies). As one may be interested in doing further analyses using this dataset, the information described in figure 5.10 can be used to elect non-recombinant regions with strong phylogenetic signal. In this example, one might find that the region between segments 16 and 31 (from sites 61 to 124) would be a good candidate, since the posterior number of distinct tpoplogies is the smallest (only two) and the posterior frequency of the MAP topology (higher than 0.95) is much higher that other topologies. As mentioned before, for the simulated alignments of 16 taxa, we assumed segments of 5bp (100 segments) and 10bp (50 segments) when sampling from the posterior distribution. 91

1 7 −238.9 0.50 0.02

support

0.98

−448.8

entropy

−29.0

2

number of trees

12

0

distance

2

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

0

32

64

96

128

segment

Figure 5.10: Summary statistics based on the posterior distribution of topologies per segment for one dataset simulated over 12 taxa. The panels represent, from top to bottom: the average distance between segments in gray and the distance between MAP topologies in black; the total number of distinct topologies; the entropy per segment; the frequency of the most frequent (MAP) topology in blue and the second most frequent topology in red for each segment. For the 10bp segmentation the break-points lie within the segments, to mimic a situation that we may find routinely in practice. Specially for large alignments, it is advisable to do a preliminary analysis with a few segments (maybe 10 or 50), and then if we detect some recombinational signal, we may proceed to a (computationally intensive) analysis increasing the number of segments. In this situation we may misplace the recombination locations, and as an extreme case we may have several break-points lying within one segment. Thus even though we will not detect these particular break-point locations, we may still detect the presence of recombination if neighboring segments have such a signal. It is worth noticing that for sufficiently large segments the effect of intra-segment recombination will be that of conflicting phylogenetic signals in the resulting posterior distribution of topologies for this segment 92

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

[149]. The posterior distribution of topology distances between segments for such a scenario where we assumed 10bp segments is represented in Figure 5.11. The procedure detected only one recombination break-point, but with a recombination distance of 3. This means that for this dataset despite we could not reconstruct the exact break-point locations, we still can detect the existence of a recombination hot-spot. This is not a drawback of algorithm, since the stochastic error present in only 10bp segments (those supporting the intermediate topologies) affects not only the phylogenetic reconstruction but also the alignment simulation. In a nutshell, there is no procedure capable of distinguishing between the true and other competing topologies for the intermediate segments supporting alternative phylogenies4 . Thus our procedure used the phylogenetic signal existent in the other segments to infer the minimum number of recombinations necessary to explain the topological differences between them. Thus we may say that our procedure tends to cluster several recombination break-points to-

1.5 0.0

average distance MAP distance

3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

gether whenever the alignemtn does not have enough information about each break-point.

0

125

250

375

500

site

Figure 5.11: Posterior recombination estimates for one simulated alignment with 16 sequences, assuming 50 segments. On the top we have the average estimated recombination distance between segments, and on the bottom we have the punctual estimate based on neighboring MAP topologies. 4 The skeptic reader is invited to simulate 10bp alignments and use the state-of-the-art programs like MrBayes

or Beast to reconstruct the topologies, and check for the posterior support of the true topology against other topologies

93

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

The posterior distribution of topologies, represented in Figure 5.12, shows a very strong support for the MAP topologies, which are in fact the true ones (for the given segments). We also observe a loss of phylogenetic signal in the central region of the alignment, where we have conflicting signals. In this segmentation of 10bp regions, we are missing the true break-points on purpose, but we can see that even in doing so we still detect the footprint of recombination. In fact for the alignments where we assumed a segmentation model of 5bp segments we observed a similar result, as we will see in a moment. For all scenarios we have seen so far, we have observed a lower signal in the borders of the alignment (the first and last segments). This is due to the fact that these regions do not have as much information about recombination as the central regions, since they are constrained by neighboring segments. Estimation averaged over all datasets In the previous section we described only one replicate dataset, since this is the situation we will find in practice. But it is important to repeat the analysis for several datasets and observe the average behaviour of the algorithm. Then we can have an idea about how useful the procedure is in practice, since there is random error associated with the alignment simulation procedure as well. The first question is about how often the procedure will detect correctly the number of recombinations and the number of recombination break-points. For each dataset we may use the median or modal values as point estimates and then observe the distribution, over all datasets, of these quantities. In our case we work with the mode of niSPR and niCOP for each dataset, and can be detected as the peak in the histogram. Despite we don’t show here, we obtained very smimilar results using the median of these quantities. In Figure 5.13 we show the resulting distribution of these modal values for 100 datasets on 8 and 12 taxa each, and for 200 datasets on 16 taxa assuming 50 and 100 segments each. For all simulation scenarios, we correctly infer the number of recombinations in more than 80% of the simulation replicates. We detect the true number of recombination break-points in 75% of the simulations on 8 taxa and in 92% of the cases for 12 taxa. For 16 taxa we obtain a similar distribution irrespective of the number of segments, and in most cases it detected only one recombination break-point incorporating the recombination signal from the three recombination events. After checking the accuracy in detecting the number of recombinations, we are interested

94

1.5 5 −556.5 0.5 0.0

support

1.0

−1059.7

entropy

−53.3

1

number of trees

10

0.0

distance

3.0

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

0

10

20

30

40

50

segment

Figure 5.12: Summary statistics based on the posterior distribution of topologies per segment for one dataset simulated over 16 taxa, assuming 50 segments. The topmost panel shows the average distance between segments in gray and the distance between MAP topologies in black. The second panel, from top to bottom, shows the total number of distinct topologies while in the third one we plotted the entropy per segment. The panel at the bottom shows the frequency of the most frequent (MAP) topology in blue and the second most frequent topology in red for each segment. in observing the overall performance of the algorithm in detecting the break-points themselves. To accomplish this, we need point estimates for the posterior distribution of breakpoint locations from each dataset. We decided to use the mean distance between segments as the point estimate, with one value per segment. Thus if we include all datasets we will obtain for each segment5 the distribution of mean distances over all datasets. The “distribution” is over datasets, and the “mean distance” is over replicates in one dataset. This segment-wise distribution can be summarised by its quantiles (minimum, maximum, median, inter-quantile ranges) or by its average value. We present both, where the median is the 50%-quantile, and 5 the

“distance” of a segment is in fact the distance between the segment and the next

95

frequency

0

0

20

40

40 20

frequency

60

60

80

80

100

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

1

2

3

4

3

4

(a) 8 taxa

5

6

7

(b) 12 taxa

100 segments

0

0

50

100

frequency

100 50

frequency

150

150

50 segments

1

2

3

4

5

6

7

1

2

3

4

5

6

7

(c) 16 taxa

Figure 5.13: Histograms showing the distribution of the MAP estimates of the number of recombinations (in black) and number of break-points (in red) over 100 replicates (for 8 and 12 taxa simulations) and 200 replicates (for 16 taxa). the 95% inter-quantile range goes from the 2.5%-quantile to the 97.5%-quantile. Using the distance between MAP topologies we obtained a similar result, and here we will report only the results using the average distance between segments. For the 100 simulations over 8 taxa, we obtain a distribution per segment following Figure 5.14, where we observe that the average distances correctly reconstructed the true locations in most cases. The peaks for the average, median and 97.5%-quantile values of the mean distance correspond to the true values, and their values vanish for regions away from the break-points. Here, again, we can calculate the credibility intervals for the overall performance in detecting the break-point locations by summing up the average values (blue line) around the peaks. 96

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

This will give us a value of two within 32bp-wide regions around the true break-points, with

0.0

mean distance 0.2 0.4

0.6

a slightly larger value for the third recombination break-point at site 192.

0

64

128

192

256

Figure 5.14: Distribution of average recombination distances over 100 replicate datasets for 8 taxa simulations. For each segment (represented by the horizontal axis, the position in the alignment) we plot the distribution over replicates of the average distance per replicate. The gray bars are the 95% inter-quantile range and the white dots are the median values for the mean distance. The blue line is the mean value over simulated datasets of the average recombination distance per segment. The triangles represent the true recombination breakpoints. Despite for each replicate we obtain the posterior distribution of distances between segments, here we only consider the point estimate represented by the average distance, thus obtaining one value per segment per replicate. We observe a similar behaviour for the 100 alignments simulated over 12 taxa, like shown in Figure 5.15. In this case the credibility intervals are much narrower, since the alignment is twice as large compared with the 8 taxa simulations (more phylogenetic signal). The regions comprising the credibility interval will give us a recombination distance of two for each break-point. In this figure we further observe that for the segments just before the true recombination break-points even the 2.5%-quantile values are greater than zero, indicating that in more than 97 out of 100 simulations some recombination signal was detected for these regions. For the simulations of 16 taxa, we have two segmentation models (prior on the number of segments), and in Figure 5.16 we show the distribution of average distance per segment for both segmentations. For each segmentation we simulated 200 alignments, and from this figure we see that the result did not depend on the number of segments. The credibility intervals cover the range where the real recombinations took place, and the sum of their average values will recover the true value of 3 recombinations. We should bear in mind that wide credibility 97

0.0

mean distance 0.7 1.3

2.0

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

0

128

256 site

384

512

Figure 5.15: Distribution of average recombination distances over 100 replicate datasets for 12 taxa simulations. For each segment we show the distribution of the average distances, as described by the 95% inter-quantile range (gray bars), median (white dots) and mean values over simulated datasets. The triangles represent the true recombination break-points. intervals are not only a result of resolution of the recombination detection procedure, but also a product of the random noise in the simulation process. Given that the procedure performs well in detecting the number and location of the recombinations, we may be interested in knowing if the reconstructed topologies (as given by a point estimate like the MAP value) are similar to the true ones. Since we know the true topologies for each segment (for each site, in fact) we can estimate, for each dataset, if the MAP topology for a given segment is equal to the true topology used in the alignment simulation. Thus if we include information from all replicate datasets we can estimate the fraction of datasets where the true topology was correctly reconstructed, for each segment. We could extend this calculation to quantify the difference between the estimated and true topologies for the cases where they disagree, but the present results are enough to show the improvement our procedure offers. In Figure 5.17 we present the topology reconstruction accuracy over all 100 replicates, where we can see that for the non-recombinant segments the true topology was reconstucted in most cases by the MAP topologies. For the regions around the recombination break-points we have a lower resolution resulting from the conflicting phylogenetic signal (and subsequent low precision of the exact break-point locations). To compare the improvement in accuracy against traditional Bayesian methods that do not take the recombination distance into account, we also observed the posterior distribu-

98

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

0.0

mean distance 1.0 2.1

3.1

50 segments

0

100

200

300

400

500

site

0

mean distance 1 2

3

100 segments

0

100

200

300

400

500

site

Figure 5.16: Distribution of mean recombination distances over 400 replicates for simulations of 16 taxa. The distribution over replicates of the average recombination distance per replicate is represented by the 95% inter-quantile range (gray bars), median values (white dots) and the mean values (blue line) for each segment. The triangles represent the true recombination break-points. The top panel represents 200 replicates where we assumed the alignment to be composed of 50 segments (of 10bp each), and in the bottom panel we summarise the information for 200 datasets divided into 100 segments (of 5bp). tion assuming independence of the recombinant regions. To accomplish this we simulated another 100 alignment datasets under the exact same model and estimated the posterior distribution of topologies assuming the recombination break-points are known without error using the software MrBayes [126, 150]. This means that we analysed each 64bp site region independently, and calculated the fraction of the replicate datasets where the MAP topology is equivalent to the true topology. For each dataset, we ran the software MrBayes for 5 × 106 generations, sampling 100 times after discarding the first half of the iterations as burn-in, using one cold chain and three heated chains. The results are shown as red horizontal lines in Figure 5.17, where we observe that our procedure was more accurate than an independent

99

0.4

reconstruction accuracy 0.5 0.6 0.7

0.8

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

0

64

128 site

192

256

Figure 5.17: Phylogenetic reconstruction accuracy per segment over 100 replicate alignments of 8 sequences. The blue points represent the fraction of datasets where the MAP topology is equal to the true one, per segment, using our procedure. The red lines represent the same quantity if we assume that the segments are independent, and that we know the break-point locations without error. The independent analysis was conducted using MrBayes software for each non-recombinant segment over 100 replicate datasets. analysis for almost all regions. This is because an independent analysis does take neighboring segments into consideration, and thus the distribution of topologies will inflate the number of recombinations between regions – many sampled topologies will require a large number of recombinations to explain the topological differences from neighboring regions. The accuracy of our procedure in reconstructing the true topology for the 12 taxa alignments can be seen in Figure 5.18. We observe a high accuracy for almost all segments, and in this case even regions close to the recombination break-points have high predictive power. The only concern would be the last recombinant region, where the procedure reconstructed the true topology in a little bit more than half the replicates. This seems to be an artifact of the simulation scenario, since if we invert the order of the topologies in the alignment simulation we have much higher accuracy (results not shown). It is worth reminding that the MAP topology is not a good statistic in several situations, since it argued that much of the information contained in the posterior distribution of topologies is lost. A preferred alternative is to calculate a consensus topology, a multifurcating topology where to each edge we associate the fraction of topologies containing the split it spans [151]. We do not attempt this here since consensus topologies may have information not contained in any tree, and the inference of recombination would be hindered by the polytomic topologies. For the datasets of 16 sequences we calculated the accuracy only for the datasets where

100

0.8 0.7 0.6 0.5 0.4

reconstruction accuracy

0.9

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

0

128

256

384

512

site

Figure 5.18: Phylogenetic reconstruction accuracy per segment over 100 replicate alignments of 12 sequences. The blue points represent the fraction of datasets where the MAP topology is equal to the true one, per segment, using our procedure. The vertical gray lines represent the last segment before each recombination break-point. 100 segments were assumed, since for the 50 segments simulations we can not reconstruct the true topologies (some segments will span two topologies). The results are shown in Figure 5.19, where we observe a very good resolution result of the high information signal from the

0.0

reconstruction accuracy 0.2 0.4 0.6

0.8

alignment borders.

0

100

200

300

400

500

site

Figure 5.19: Phylogenetic reconstruction accuracy per segment over 200 replicate alignments of 16 sequences, assuming 100 segments of 5bp. The blue points represent the fraction of datasets where the MAP topology is equal to the true one, per segment, using our procedure. The vertical gray lines represent the last segment before the real recombination break-points.

5.3.3

Sensitivity and robustness

We have already seen how the procedure can handle rate heterogeneity among branches for one site since the simualted datasets have individual branches for each site but the Bayesian model integrate them out. The procedure is also robust to model mispecification, like in the 101

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

posterior sampling of 16 taxa assuming 50 segments. but even in the absence of recombination, spatially structured model heterogeneity can lead to apparent recombination [9]. To check against this possibility we simulated 256bp alignments composed of 32bp segments sharing the same topology but with rate heterogeneity over 8 taxa. The average substitution rate per site was set to 0.6 for almost all segments with exception of sites 129 to 192 where the average rate was fixed at 4.8 substitutions per site. κ was set to 1.4 for all sites, and the

0 2 −268.4 0.5 0.0

support

1.0

−460.5

entropy

−76.3

1

number of trees

4

distance

topology is the first tree in Figure 5.3.

0

32

64

96

128

segment

Figure 5.20: Summary statistics based on the posterior distribution of topologies per segment for one dataset simulated over 8 taxa, with no recombination but with rate heterogeneity. The topmost panel shows the average distance between segments in gray and the distance between MAP topologies in black. The second panel, from top to bottom, shows the total number of distinct topologies while in the third one we plotted the entropy per segment. The panel at the bottom shows the frequency of the most frequent (MAP) topology in blue and the second most frequent topology in red for each segment. Recombination detection was not affected by spatial rate heterogeneity, as can be seen from Figure 5.20. This figure describes the posterior distribution of topologies for one sim102

5.3. RECOMBINATION DETECTION ON SIMULATED SEQUENCES

ulated dataset. In this figure we observe that no recombination signal was found, and that the MAP topology has posterior support close to one, being equal to the true topology. Since our procedure integrates out individual substitution rates over branches, our parameter µi is the average substitution rate per branch. Thus an average of 0.6 substitutions per site over 8 taxa implies µi = 0.046 since we have 2 × 8 − 3 = 13 branches. The distribution of average substitution rates (averaged over branches) for each segment along one simulated alignment is displayed in Figure 5.21. The inter-quantile ranges are quite wide owing to the independence assumption over segments, but the overall distributions incorporate the true values in a satisfactory manner. We have in fact simulated 100 replicate datasets under this rate heterogeneity model, and for all datasets no signal of recombination appeared. The correct topology was reconstructed in 88% of the replicates. Our primary interest is not the estimation of individual site rates, but our method seems to be robust to model heterogeneity. In our model the independence of rates between segments accomodates this heterogeneity while avoiding

0.0

0.2

0.4

rate 0.6

0.8

1.0

1.2

overparametrization due to individual branches.

0

64

128 site

192

256

Figure 5.21: Posterior distribution of the average substitution rate for one replicate dataset with 8 taxa with rate heterogeneity. The blue bars are the 95% inter-quantile ranges, and the red dots are the median values of the average substitution rate per segment. The true values are represented by the black line.

103

5.4. CONVERGENCE

5.4

Convergence

First we have in Figures 5.22 and 5.23 the scaled reduction factor for two datasets on 8 taxa, simulated as described previously. In figure 5.22 we observe log[ P(θ | X )], the unnormalized posterior probability in log scale, and in figure 5.23 we observe the total number of recombinations, niSPR , based on two independent chains for each dataset. We plot the values of the scaled reduction factor over distinct segments of the simulation since we are interested in observing the stabilization of its value. Even running the chains for a period shorter than those used in the posterior estimation of break-points described before, we observe the stabilization of the chains for both parameters at a scale reduction factor close to one. Note that for

−1950

−1800

this calculation at least two independent and overdispersed chains are necessary.

0

100

200

300

400

500

300

400

500

300

400

500

300

400

500

1.00

shrink factor 1.10

Iterations

0

100

200

−2100 −2000 −1900

last iteration in chain

0

100

200

shrink factor 1.00 1.04 1.08

Iterations

0

100

200 last iteration in chain

Figure 5.22: Scale reduction factor of log [ P(θ | X )] for two simulated alignments of 8 taxa (the two top panels for one alignment and the two bottom panels for the other). Two independent chains were run for 105 generations, with a sampling interval of 200 generations, after a burnin period of 2 × 104 generations. For each alignment, the top panel is the raw trace file while the bottom panel is the potential scale reduction factor calculated over batches of length 20. The confidence intervals assume normality.

104

2

3

4

5

6

5.4. CONVERGENCE

0

100

200

300

400

500

300

400

500

300

400

500

300

400

500

1.00

shrink factor 1.10

Iterations

0

100

200

2.0

3.0

4.0

5.0

last iteration in chain

0

100

200

1.0

shrink factor 1.2

Iterations

0

100

200 last iteration in chain

Figure 5.23: Scale reduction factor of the total number of recombinations for two simulated alignments of 8 taxa (the two top panels for one alignment and the two bottom panels for the other). Two independent chains were run for 105 generations, with a sampling interval of 200 generations, after a burnin period of 2 × 104 generations. For each alignment, the top panel is the raw trace file while the bottom panel is the potential scale reduction factor calculated over batches of length 10. The confidence intervals assume normality. If we have only one posterior sampling, we may use instead the scaled regeneration quantile (SRQ) statistics. The traditional way of using the SRQ plots is by conditioning the observations on the modal value, which means observing the regeneration times a parameter returns to its modal value. We have in Figure 5.24 the SRQ plot for all replicate datasets, where we observed the times at which the total number of recombinations was equal to the modal value. Interpreting these graphics is somewhat arbitrary since the more visited a state is the smaller the apparent departure from the straight line, but since in all simulations the posterior sample size is the same, we can compare the simulation scenarios. The scenario with the higher variability in convergence was the simulation of 16 taxa assuming 100bp segments, and the scenario with fastest convergence was the simulation of 12 taxa. There does not seem to have a trend of slow convergence towards simulations that missed the true num-

105

5.4. CONVERGENCE

ber of recombinations. This may indicate that the procedure was converging, even though to a different configuration, wich suggests that this was the recombiation information contained

1.0 0.8 0.6 Ti/Tn 0.4 0.2 0.0

0.0

0.2

0.4

Ti/Tn

0.6

0.8

1.0

into the datasets.

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

i/n

(a) 8 taxa

0.8

1.0

(b) 12 taxa

0.8 0.6 0.4 0.2 0.0

0.0

0.2

0.4

Ti/Tn

0.6

0.8

1.0

100 segments

1.0

50 segments

Ti/Tn

0.6 i/n

0.0

0.2

0.4

0.6

0.8

1.0

i/n

0.0

0.2

0.4

0.6

0.8

1.0

i/n

(c) 16 taxa

Figure 5.24: Scaled regeneration quantile (SRQ) plots for the total number of recombinations, following MAP values, for simulations of 8, 12 and 16 sequence datasets. Each line corresponds to one replicate simulation, where the black lines represent simulations where the modal number of recombinations inferred correctly the number of recombinations (three recombinations for 8 and 16 taxa, and six recombinations for 12 taxa). The red lines represent datasets whose MAP number of recombinations was different from the true one. For the 16 taxa datasets, the left panel represent Bayesian inferences if the alignment is divided into 50 segments (of 10bps), and in the right panel inferences were made assuming 100 segments of 5bps.

106

60

80

60

80

segment

40

60

80

100

40

60

80

100

0

segment

Frequency

150

0.8

1.0

200

40

0

20

10

20

30

10

20

30

segment

40

60

80

100

80

100

80

100

0.0

0.4

0.8

segment posterior frequency

0

SRQ coeff variation

1.0 0.8 0.6

20

30

40

0

20

segment

0.4 0

20

0.8

100

0.6

100

Frequency 10

0.4

posterior frequency 20

0 50

100 200 300 0

segment

0.8 0

SRQ coeff variation

0.95

40

100

segment

0.85

20

80

0.4

100

segment

0

60

0.4

SRQ coeff variation

40

60

segment SRQ coeff variation

40

40

0.2

40

0.8

20

0.0

posterior frequency

0.8 0.4

20

0.0

0 0

segment

0.0 0

1.0

0.4

100

0.0

80

0.9

1.0

60

0.8

0

80

Frequency

120 40

0.7

SRQ coeff variation

0

40

Frequency

30 20

Frequency

10

20

segment posterior frequency

0.6

SRQ coeff variation

0 0

SRQ coeff variation

Frequency

0.4 0.5 0.6 0.7 0.8 0.9 1.0

16 taxa (100 seg)

0.0

1.00

100 200 300

0.95

0.8

0.90

SRQ coeff variation

0.6

0.85

16 taxa (50 seg)

0

50

150

12 taxa

0

20

40

Frequency

8 taxa

0

Frequency

5.4. CONVERGENCE

0

20

40

60

segment

Figure 5.25: Scaled regeneration quantile analyses for locations of break-points, for all replicates. The initial configurations (initial state of the sampler after the burn-in period) were followed up for regeneration times, and here we report the results of all replicate datasets based on the coefficient of variation calculated from the SRQ plot. Each column represents one scenario, where 100 replicates were pooled for 8 and 12 taxa, and 200 datasets were generated for 16 taxa (assuming 50 and 100 segments). The first row shows the distribution (histogram) of the SRQ coefficients of variation for all replicates. The second row shows the distribution of segments chosen to be followed up, and represents the distribution of inital states of the sampler. The third row shows the scatterplot of initial configuration against posterior frequency of such initial configuration, where the posterior frequency is the ratio between regenerations and iterations. The last row is a scatterplot between initial configuration and coefficient of variation of the regeneration times.

107

5.4. CONVERGENCE

20

40

60

80

segment

100

0

20

40

60

80

100

40

10

20

30

segment

40

0.8 0.4

50

0

50

20

40

60

80

100

80

100

segment 0.8

30

0.4

20

segment

0

segment

10

0.0

0

0.0

posterior frequency

0.8 0.4

100

SRQ coeff variation

80

0.8

60

segment

0.4

40

0.0

20

16 taxa (100 seg)

0.0

posterior frequency

0.8 0

SRQ coeff variation

0.8 0.4 0.0 0

0.4

100

SRQ coeff variation

80

0.8

60

segment

0.4

40

0.0

20

16 taxa (50 seg)

0.0

posterior frequency

0.8 0.4 0

SRQ coeff variation

12 taxa

0.0

posterior frequency

8 taxa

0

20

40

60

segment

Figure 5.26: Scaled regeneration quantile analyses for initial configuration of topologies (at a break-point region), for all replicates. The topologies before the first iteration of the sampler were followed up for regeneration times, and here we report the results of all replicate datasets based on the coefficient of variation calculated from the SRQ plot. Each column represents one scenario, where 100 replicates were pooled for 8 and 12 taxa, and 200 datasets were generated for 16 taxa (assuming 50 and 100 segments). The top panels show the scatterplot of initial configuration (topologies before each break-point) against posterior frequency of such initial configuration, where the posterior frequency is the fraction of the time the sampler visited the same topology for that segment. The panels at the bottom show the scatterplot between initial configuration and coefficient of variation of the regeneration times, where a coefficient of variation equal to zero means that the sampler regenerated less than six times.

108

Chapter 6 Detection of hot-spots in HIV-1 Recombinant HIV-1 variants that spread epidemically throughout a population of unrelated individuals are designated CRFs (circulating recombinant forms), and genomes of CRF viruses are mosaics comprised by regions derived from two or more distinct parental subtypes (Chapter 2 (p. 20)). Although in South American countries subtype B remains the most prevalent clade in the HIV-1 infection, there is a great variety of different BF recombinants (result of recombination between subtypes B and F) co-circulating in these countries [74–76]. In particular, in Argentina BF recombinants have reached elevated prevalence and seem to spread with enormous rate of dissemination [76, 152]. It has been suggested that BF viruses circulating in Argentina emerged from a single BF ancestor originating in Brazil [74]. These recombinants are routinely detected by phylogenetic methods based on a local sequence similarity between the putative recombinant and all possible parentals [3, 5]. Therefore, the regional distribution and the interaction among these recombinant forms might reflect a complex and an intricate network of interconnected host transmissions. In this context, it is expected that recombinations among HIV-1 BF variants will occur frequently and this events are currently been neglected by methods that exploit only the mosaic structure parentage structure along the sequence - of DNA sequences. By this reason we explored in more details the pattern of recombination in BF viruses from South American countries. Structural factors may be responsible for a variable crossover rate along the viral genome [57, 59], and help explain the existence of crossover hotspots in HIV-1, defined as regions where the template switching occurs preferencially [60, 61]. Recombinations detected by our

109

6.1. THE DATASET

procedure are a fraction of crossover events mantained by selection, and may be or not related with these crossover hotspots.

6.1

The dataset

To validate our procedure with experimental DNA sequences we analyzed near full-length HIV-1 genomes. First we selected BF recombinant sequences with similar mosaic pattern. These sequences where selected from an alignment of South-American BF recombinant sequences comprising 8402bp; we compared each one independently against reference subtypes F, B and C using the software DualBrothers [9]. Our final dataset consisted of 8 BF recombinant sequences with similar mosaic pattern plus 3 reference subtype sequences. Notice that by repeating the analysis for each putative recombinant we assume that the parental sequences are not involved in recombination, which is not necessary in our method. This analysis is described in Section 2.4 (p. 28), and the final 8 putative recombinant sequences are those in figures 2.10 and 2.12. We also included in the analysis three supposedly nonrecombinant sequences (“parentals”), from subtypes F, B and C. These are routinely used as reference parental subtypes. The putative recombinant sequences belong to the circulating recombinant form CRF12 BF, and in Table 6.1 we have a description of each sequence, including the “parental” subtypes. Table 6.1: Description of the 11 HIV-1 sequences used in the recombination detection analysis. The “code” field refers to the coding used when plotting the topologies. Code

Sequence name

Accession number

Year of sampling

Subtype

Country

1 2 3 4 5 6 7 8 F B C

CH12 12 BF.AR.99.ARMA159 12 BF.UY.99.URTR35 12 BF.AR.97.A32879 BF.AR.99.ARMA029 BF.AR.99.ARMA097 BF.AR.99.A027 BF.AR.99.A047 F1.BR.89.BZ126 B.BR.89.BZ167 C.92.BR025

AY536235 AF385936 AF385935 AF408629 AY037283 AY037280 AF332867 AF408627 AY173957 AY173956 U52953

2001 1999 1999 1997 1999 1999 1999 1999 1989 1989 1992

CRF 12 CRF 12 CRF 12 CRF 12 CRF 12 CRF 12 CRF 12 CRF 12 F1 B C

Chile Argentina Uruguay Argentina Argentina Argentina Argentina Argentina Brasil Brasil Brasil

110

6.2. MCMC PARAMETERS

6.2

MCMC parameters

We ran the sampler for 104 iterations as a warm-up, and then ran for 5 × 105 iterations sampling 1000 times, where the initial states for the warm-up were chosen based on five cycles of 500 iterations of simulated annealing with final temperature of 1.2 (initial temperature of 0.2). In this analysis we assumed 10bp segments, and since the genomic alignments are composed of 8402bp, we have 840 segments. This procedure was repeated for two independent chains to access convergence from overdispersed starting points, and the results reported here are based on the pooled chains. The starting point (initial state) was in fact the same, but the simulated annealing stage will overdisperse theses states. The convergence was acessed by visual inspection of the time series of the samples, posterior distribution and the scaled reduction factor [142] for the (unscaled) posterior probability, number of breakpoints and total estimated number of SPR moves. Each run took approximately 24 hours to complete.

6.3

Posterior distribution of SPR distances

In figure 6.1 we have the support for recombination based on dSPR estimated by our method and the posterior probability of recombination estimated by DualBrothers program. The results indicate that regions with higher probability of recombination (as indicated by DualBrothers, upper panel of figure 6.1) were also detected by dSPR (figure 6.1,lower panel). Therefore, both methods agreed in identifying recombination along HIV-1 sequences. However, our method detected much more phylogenetic heterogeneity, undetected by the independent recombination analysis we conducted with DualBrothers, indicating that despite the change in topology, the grouping of the recombinants among the parentals is preserved for these recombinations. This comes from the fact that in our analysis with DualBrothers we neglected the correlation between the recombinants, as is usually done when estimating the mosaic structure. We recall that while DualBrothers only gives us information about the probability of having a recombination break-point, our procedure provides us with the distribution of the total number of recombinations over a region. Our result with the proposed Bayesian hierarchical method may be indicating ongoing recombinations among CRF 12 viruses. The posterior distribution on the number of recombination breakpoints ranged between 30 and 47 with mode (and median) 37, while the sum 111

probability of recombination 0.0 0.1 0.2

6.3. POSTERIOR DISTRIBUTION OF SPR DISTANCES

DualBrothers

gag

vif

tat

tat

vpu

pol

nef

rev env

vpr rev

1000

2000

3000

4000

5000

6000

7000

8000

mean distance 1.0 1.5 2.0

2.5

0

0.0

0.5

biomc2

gag

vif

tat

tat

vpu

pol

nef

rev env

vpr rev

0

1000

2000

3000

4000

5000

6000

7000

8000

Figure 6.1: Posterior distribution of topology distances between segments for HIV-1 dataset, assuming 840 segments of 10bp. We show in the top panel the posterior probability of having a recombination break-point based on the DualBrothers software. On the bottom panel we have the mean distance per segment (distance between the segment and the next one) as inferred by our procedure, implemented in the software biomc2. Since the analysis using DualBrothers was conducted independently for each putative recombinant, this probability of break-point is given by the sum over individual distributions. of dSPR over the genome had a credibility interval of 55-77 SPR events with mode equal to 65 (figure 6.2. This finding supports the existence of recombination hotspots, since there are breakpoints harbouring more than one recombination event. Examples include the beginning and the end of the pol gene and at the tat/rev genes (lower panel of figure 6.1). The prior µ0 for the average branch length was around 0.25, and since we have 11 sequences (19 branches)

112

6.3. POSTERIOR DISTRIBUTION OF SPR DISTANCES

we have an expected substitution at every two sites, compatible with the values used in the simulations. number of recombinations 200 150 100

Frequency

100

0

0

50

50

Frequency

150

number of breakpoints

30

35

40

45

55

60

65

70

75

Figure 6.2: Posterior distribution of the total number of recombination break-points (on the left) and of the total number of recombinations (on the right) for the HIV-1 dataset, over 1000 samples for each chain. The pooled samples from two independent (converging) chains were used. The distances between MAP topologies for each segment indicate 49 and 52 breakpoints (for each independently sampled chain), an overestimation compared with sampled distances. This overestimation can be explained if we look at figure 6.5, where is shown the support (posterior frequency) for the two most frequent topologies for each segment along the alignment. There is virtually no difference in frequency between the MAP topology (most frequent) and some other topology after site 7500 of the alignment. Actually this figure also shows us which are the regions with higher phylogenetic signal (for instance, between sites 4400 and 4900) and regions where we the posterior distribution of trees is flatter and less reliable (like the region around site 2000 or after 7500). The accuracy of our recombination detection method is confirmed by observing the MAP topologies, represented by figures 6.3 and 6.4 (one figure for each independent chain). Breakpoints detected by DualBrothers indicate, in fact, inter-subtype recombinations according to our algorithm (looking at the clustering of the recombinant sequences with the parental ones). The agreement between recombination detection methods is further confirmed by observing these MAP topologies: regions affected by break-points detected by DualBrothers indicate, in fact, inter-subtype recombinations according to our procedure (regions where the positioning of B, F and C subtypes change). Despite the reported distances are just an approximation to 113

6.3. POSTERIOR DISTRIBUTION OF SPR DISTANCES

the recombination distance, they should not be inflated if the real recombination distances were one for all segments. Looking at the statistics based on the distribution of posterior topologies shown in Figure 6.5, we observe that the region in the end of the alignment (comprising genes nef, rev. tat and part of env) has conflicting phylogenetic signal even accounting for recombination. The convergence analysis is summarised in figures 6.6 and 6.7.

114

6.3. POSTERIOR DISTRIBUTION OF SPR DISTANCES

C C7 7F FB B5 54 43 3 22 88 66 11

1

2090

4390

5900

7500

7820

B BC CF F8 86 62 24 4 77 55 33 11 B BC CF F8 83 37 76 6 44 55 22 11 B BC C6 6F F8 84 42 2 33 77 55 11 B BC C5 53 38 82 26 6 77 44 FF 11 F F8 87 75 56 63 32 2 44 CC BB 11

B BC C2 2F F7 75 54 4 33 88 66 11

160

2190

5030

5970

7510

7870

F FC CB B8 86 67 75 5 22 44 33 11 B BC C3 36 64 47 78 8 55 22 FF 11 B BC C6 67 75 52 2F F 88 33 44 11 B BC C5 53 36 68 82 2 44 77 FF 11 B B5 5C CF F8 87 76 6 33 22 44 11

B BC C4 4F F8 83 37 7 22 55 66 11

380

2430

5120

6080

7520

7890

F FC C6 6B B8 83 37 7 55 22 44 11 B B2 27 78 84 46 63 3 55 CC FF 11 B BC CF F8 87 76 62 2 55 33 44 11 B BC C5 53 38 82 26 6 77 44 FF 11 B B5 5C C8 87 73 32 2 44 66 FF 11

B BC C6 6F F8 87 72 2 33 44 55 11

900

2870

5350

6580

7570

8050

B BC C4 48 83 3F F7 7 55 22 66 11 B B8 86 64 47 72 23 3 55 CC FF 11 B BC CF F8 87 75 54 4 22 33 66 11 B BC C3 35 56 68 82 2 44 77 FF 11 C C8 8F F3 32 27 76 6 44 BB 55 11

F FC C3 37 74 42 26 6 55 88 BB 11

1620

3240

5360

6890

7580

8060

B BC C2 26 6F F7 75 5 88 33 44 11 B BC C3 38 87 72 26 6 44 55 FF 11 B BC C8 8F F7 75 54 4 22 33 66 11 B B7 75 52 24 43 38 8 66 CC FF 11 F F8 83 3C CB B5 52 2 77 66 44 11

B BC CF F3 37 74 42 2 66 55 88 11

1920

3290

5480

7440

7620

8110

B BC CF F7 75 52 28 8 44 33 66 11 C C4 4F F7 76 63 32 2 55 88 BB 11 B BC C8 8F F7 75 54 4 22 66 33 11 B B7 75 52 24 43 3C C FF 66 88 11 B BC C3 38 8F F5 52 2 77 66 44 11

B BC CF F8 86 62 27 7 44 55 33 11

1970

3600

5620

7460

7680

8330

B BC CF F7 76 68 85 5 22 44 33 11 B BC C8 85 5F F7 72 2 66 44 33 11 B BC CF F7 75 54 42 2 88 66 33 11 B BC CF F6 67 73 32 2 44 88 55 11 B BC C3 36 6F F8 85 5 44 22 77 11

B BC CF F8 86 62 24 4 77 55 33 11

2040

3780

5780

7470

7690

B BC C3 35 54 46 6F F 88 77 22 11 B BC C4 4F F2 28 83 3 66 77 55 11 B BC C5 53 38 82 26 6 77 44 FF 11 B BC CF F7 73 32 26 6 44 88 55 11 B BC C6 67 7F F3 38 8 55 22 44 11

B BC CF F8 86 62 27 7 44 55 33 11

2050

4180

5790

7480

B BC CF F5 54 46 67 7 88 33 22 11 B BC C4 4F F2 26 68 8 33 77 55 11 B BC C5 53 36 68 82 2 44 77 FF 11 B BC CF F6 67 73 32 2 44 88 55 11

7760

8370

Figure 6.3: MAP topologies for HIV-1 dataset for one chain, arbitrarily rooted at subtype C (for visualization purposes only, the tree are in fact unrooted). Below each tree we indicate the first base where it occurs, and we report all distinct MAP topologies over the segments. The leaves correspond to the following taxa: (1) CH12, (2) 12 BF.AR.99.ARMA159, (3) 12 BF.UY.99.URTR35, (4) 12 BF.AR.97.A32879, (5) BF.AR.99.ARMA029, (6) BF.AR.99.ARMA097, (7) BF.AR.99.A027 and (8) BF.AR.99.A047.

115

6.3. POSTERIOR DISTRIBUTION OF SPR DISTANCES

C C7 7F FB B5 54 43 3 22 88 66 11

1

2360

5030

5980

7460

8110

F FC CB B7 75 52 24 4 88 66 33 11 B BC C3 36 64 47 78 8 55 22 FF 11 B BC C6 6F F2 28 83 3 77 55 44 11 B BC C3 38 86 67 75 5 22 44 FF 11 B BC C3 38 8F F5 52 2 77 66 44 11

B BC C2 2F F7 75 54 4 33 88 66 11

160

2500

5120

6000

7470

8330

F FC C6 6B B8 83 37 7 55 22 44 11 B B2 27 78 84 46 63 3 55 CC FF 11 B BC C6 67 75 52 2F F 88 33 44 11 B BC C5 53 38 82 26 6 77 44 FF 11 B BC C6 63 3F F5 58 8 22 44 77 11

B BC C4 4F F8 83 37 7 22 55 66 11

380

2870

5270

6080

7580

8370

B BC C4 48 83 3F F7 7 55 22 66 11 B B2 28 87 74 45 53 3 66 CC FF 11 B BC CF F8 87 76 65 5 22 33 44 11 B B7 75 52 24 43 38 8 66 CC FF 11 B BC C6 67 7F F8 83 3 22 55 44 11

B BC C6 6F F8 87 72 2 33 44 55 11

900

3290

5350

6090

7630

8380

B BC CF F7 75 52 26 6 88 33 44 11 B BC C3 38 87 72 26 6 44 55 FF 11 B BC CF F8 87 76 62 2 55 33 44 11 B B7 75 52 24 43 3C C FF 66 88 11 B BC C6 67 7F F8 83 3 55 22 44 11

F FC C3 37 74 42 26 6 55 88 BB 11

1620

3300

5480

6100

7690

B BC CF F7 75 52 28 8 44 33 66 11 C C4 4F F8 85 57 76 6 22 33 BB 11 B BC CF F8 87 76 65 5 22 33 44 11 B BC C5 5F F7 72 23 3 66 44 88 11 B BC C6 67 7F F8 83 3 22 55 44 11

B BC CF F8 86 62 24 4 77 55 33 11

1940

3560

5620

6320

7880

B BC CF F7 78 85 52 2 66 44 33 11 B BC C3 3F F8 85 57 7 66 22 44 11 B BC CF F8 87 76 65 5 22 33 44 11 B B5 5C CF F7 73 32 2 66 44 88 11

F FC CB B8 86 67 75 5 22 44 33 11

2190

3770

5780

6560

8050

B BC C3 35 54 46 6F F 88 77 22 11 B BC C4 4F F2 28 83 3 66 77 55 11 B BC CF F8 87 76 62 2 55 33 44 11 C C8 8F F3 32 27 76 6 44 BB 55 11

F FC CB B7 75 52 24 4 88 66 33 11

2230

4180

5880

6570

8060

B BC CF F5 54 46 67 7 88 33 22 11 B BC C4 4F F2 26 68 8 33 77 55 11 B BC CF F8 87 75 54 4 22 33 66 11 F F8 83 3C CB B5 52 2 77 66 44 11

F FC CB B8 86 67 75 5 22 44 33 11

2240

4390

5900

6890

B BC CF F8 83 37 76 6 44 55 22 11 B BC C6 6F F8 84 42 2 33 77 55 11 B BC C8 8F F7 75 54 4 22 33 66 11 C C8 8F F3 32 27 76 6 44 BB 55 11

8100

8390

Figure 6.4: MAP topologies for HIV-1 dataset for another chain, arbitrarily rooted at subtype C Below each tree we indicate the first base where it occurs, and we report all distinct MAP topologies over the segments. The correspondence between tips labels at the tree and the sequences is the same as in Figure 6.3.

116

2 0

distance

4

6.3. POSTERIOR DISTRIBUTION OF SPR DISTANCES

2000

4000

6000

8000

0

2000

4000

6000

8000

256 −18.5 0.29 0.02

support

0.56

−51.1

entropy

14.0

33

number of trees

480

0

segment

Figure 6.5: Summary statistics based on the posterior distribution of topologies per segment for HIV-1 dataset. The topmost panel shows the average distance between segments in gray and the distance between MAP topologies in black. The next two panels are the total number of distinct topologies and the entropy per segment. The panel at the bottom shows the frequency of the most frequent (MAP) topology in blue and the second most frequent topology in red for each segment. All values shown are the average between two independent chains.

117

log P(theta/X)

0

10

20

30

40

50

ACF 0.0 0.2 0.4 0.6 0.8 1.0

ACF 0.0 0.2 0.4 0.6 0.8 1.0

ACF 0.0 0.2 0.4 0.6 0.8 1.0

6.3. POSTERIOR DISTRIBUTION OF SPR DISTANCES

nSPR

0

10

20

30

40

50

10

20

30 Lag

10

20

40

50

30

40

50

30

40

50

ACF 0.0 0.2 0.4 0.6 0.8 1.0

Lag

ACF 0.0 0.2 0.4 0.6 0.8 1.0 0

0

Lag

ACF 0.0 0.2 0.4 0.6 0.8 1.0

Lag

nCOP

0

10

20

30 Lag

40

50

0

10

20 Lag

Figure 6.6: Autocorrelation analysis for MCMC sampling from HIV-1 dataset. The top and bottom panels refer to two independent chains, where we studied the correlation of the logarithm of the posterior probability (left), number of recombinations (center) and number of recombination break-points (right) along the posterior sampling. The lag corresponds to the sampled values, and thus a lag of one refers to iterations 50 steps apart.

118

6.3. POSTERIOR DISTRIBUTION OF SPR DISTANCES

1.00

−35600

−35200

shrink factor 1.05 1.10

1.15

−34800

log P(theta/X)

0

200

400

600

800

1000

800

1000

800

1000

0

200

400 600 last iteration in chain

800

1000

0

200

400 600 last iteration in chain

800

1000

0

200

400 600 last iteration in chain

800

1000

Iterations

55

1.0

60

1.2

65

70

shrink factor 1.4 1.6

75

1.8

nSPR

0

200

400

600 Iterations

30

1.00

35

40

shrink factor 1.10 1.20

45

1.30

nCOP

0

200

400

600 Iterations

Figure 6.7: Potential scale reduction factor (labelled “shrink factor” in the figure) for the unnormalized posterior probability (in log scale), number of recombinations and number of breeak-points based on two independent, overdispersed chains. On the left we have the trace plots (raw values) and on the right we have the reduction factor over batches of increment 20. The iterations refer to the sampled (thinned) values and not to the original number of iterations. Thus consecutive “iterations” referred to in the figure are in fact 500 iterations apart.

119

Chapter 7 Conclusions A variety of distinct methods have been developed to detect recombination [153]. They can be broadly classified into two classes, depending on the relative contributions of the recombinational and mutational processes [154]: the population genetic approach and the phylogenetic approach [19]. The population genetic approach uses the information of the linkage disequilibrium among segregating sites, assuming ubiquitous recombination. The linkage disequilibrium depends not only on the recombination rate between the sites but also on the the population history. Recombination rate and the population history are then estimated by introducing the ancestral recombination graphs (ARGs) as nuisance parameters (i.e., the population histories are averaged over all possible particular recombination scenarios) [18, 23, 25, 113]. The population genetic approach is efficient when recombination is pervasive along the genome, disrupting the phylogenetic signal. In this context recombination hotspots can be detected as regions where the recombination rate is higher than the local background rate [24, 25]. When the recombination rate is moderate compared with mutation rate, the sequences may be decomposed into a few segments that have specific phylogenetic histories. Instead of treating the recombination history as a nuisance parameter, the phylogenetic approach estimates the breakpoints and the phylogeny of the segments, assuming that some phylogenetic structure is preserved. Many techniques are based on sliding window procedures that compare the topology of one segment against neighboring segments or the whole alignment. This comparison may be based on the phenetic distance [3, 5], likelihood [4] or posterior distribu-

120

tion [6] of the topologies for each arbitrary segment. Hidden Markov Models [7, 8] regard topologies at sites as hidden states, where the transition probability penalizes the inconsistency of topology between neighboring sites. Bayesian change point models [9, 15] identify recombination breakpoints and differentiated substitution rates as change points of topologies and evolutionary rate parameters. While these Bayesian procedures have a sound statistical background, they can not reliably estimate the history of recombination events when the number of taxa increase, due to the large degree of freedom on topologies. So far inference of recombination in the phylogenetic approach has been restricted to the presence or absence of recombination break-points between sites, and detection of recombination hot-spots relied on unusual clustering patterns of these break-points along the genome. This is not likely to happen since short segments may not have enough phylogenetic signal to discriminate between competing topologies. By modelling the recombination distance between segments we penalize recombination scenarios where neighboring regions can only be explained by an excessive number of recombinations. With this correlation between sites even a single break-point has information about the minimum number of recombinations between the segments it comprises. This not only has a biological support but also makes the topology sampling problem computationally tractable. Our Bayesian hierarchical procedure not only detects the recombination breakpoints but also quantifies the disagreement between the trees. It therefore provides information regarding regions where recombinations occur frequently. The chance of correctly inferring the true tree is also higher than using other Bayesian procedures that neglect the similarity between trees on neighboring regions. Assuming a model of independent rates for each site and averaging over individual branch lengths proved to be useful in distinguishing recombination from non-random rate heterogeneity. Distinguishing one ancestral recombination (shared among many sequences) from a recombination hotspot (many recombinations rising independently) can be difficult [15]. The robustness of our procedure comes from the fact that a breakpoint cannot be pinpointed with arbitrary precision, and the prior on the SPR distance accommodates this compromise. The amount of recombination over a region can, therefore, be quantified regardless of the number of breakpoints just by looking at the sum of over this region. Credibility intervals can be constructed in the same way, by including all potential breakpoints (from larger to smaller posterior values), whose accumulated sum lies below some 121

threshold. For example, the 95% credibility interval for Y breakpoints (where Y is the posterior mean of the total number of breakpoints) can be found by summing up the posterior frequencies of recombination for each segment, where these frequencies are given by the number of samples in which the segment had a distance larger than zero. If the sum is conducted for segments ordered from larger to smaller posterior frequencies, the credibility interval is composed by all segments such that the sum is smaller than 0.95 times Y. The same reasoning can be applied to the inference of recombination cold spots, regions where recombination might lead to disruption of protein function. Applying our method to the HIV-1 dataset we detected a higher number of recombination break-points than that detected when parental sequences are assumed. The average of two recombinations per break-point is indeed an indication of existence of hot-spots. A scenario of one ancestral recombination giving rise to the diversity of a new recombinant subtype assumes that irrespective of intra-subtype recombination these recombinants should share a most recent common ancestor along all non-recombinant regions. Our results do not support a common ancestral origin for these recombinant sequences, at least for the chosen reference parental sequences, since the putative recombinants do not form a monophyletic group among segments. We conclude with the final remark that even for datasets displaying an identical recombination mosaic pattern, it is imperative to check for phylogenetic incongruences within the dataset, and do not rely on the breakpoints defined by the mosaic.

122

Bibliography [1] Worobey M, Holmes EC (1999) Evolutionary aspects of recombination in RNA viruses. J Gen Virol 80 ( Pt 10):2535–43. [2] Chan CX, Beiko RG, Ragan MA (2006) Detecting recombination in evolving nucleotide sequences. BMC Bioinformatics 7:412. [3] Weiller GF (1998) Phylogenetic profiles: a graphical method for detecting genetic recombinations in homologous sequences. Mol Biol Evol 15:326–35. [4] Grassly NC, Holmes EC (1997) A likelihood method for the detection of selection and recombination using nucleotide sequences. Mol Biol Evol 14:239–47. [5] Salminen MO, Carr JK, Burke DS, McCutchan FE (1995) Identification of breakpoints in intergenotypic recombinants of HIV type 1 by bootscanning. AIDS Res Hum Retroviruses 11:1423–5. [6] Husmeier D, Wright F (2001) Probabilistic divergence measures for detecting interspecies recombination. Bioinformatics 17 Suppl 1:S123–31. [7] Husmeier D, Wright F (2001) Detection of recombination in DNA multiple alignments with hidden Markov models. J Comput Biol 8:401–27. [8] Husmeier D (2005) Discriminating between rate heterogeneity and interspecific recombination in DNA sequence alignments with phylogenetic factorial hidden Markov models. Bioinformatics 21 Suppl 2:ii166–ii172. [9] Minin VN, Dorman KS, Fang F, Suchard MA (2005) Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics 21:3034–42. [10] Balakrishnan V (1995) Network optimization. Chapman & Hall New York. [11] Swofford D, Olsen G, Waddell P, Hillis D, Hillis D, et al. (1996) Molecular Systematics. Sinauer Associates. [12] Everitt BS, Landau S, Leese M (2001) Cluster Analysis. Arnold Publishers. [13] Paraskevis D, Deforche K, Lemey P, Magiorkinis G, Hatzakis A, et al. (2005) SlidingBayes: exploring recombination using a sliding window approach based on Bayesian phylogenetic inference. Bioinformatics 21:1274–5. [14] Hasegawa M, Kishino H, Yano T (1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 22:160–74. [15] Minin VN, Dorman KS, Fang F, Suchard MA (2006) Phylogenetic Mapping of Recombination Hot-Spots in HIV via Spatially Smoothed Change-Point Processes. Genetics :in press, genetics.106.066258doi:10.1534/genetics.106.066258. 123

BIBLIOGRAPHY

[16] Fearnhead P, Donnelly P (2002) Approximate likelihood methods for estimating local recombination rates. Journal of the Royal Statistical Society Series B(Statistical Methodology) 64:657–680. [17] Li N, Stephens M (2003) Modeling Linkage Disequilibrium and Identifying Recombination Hotspots Using Single-Nucleotide Polymorphism Data. Genetics 165:2213–2233. [18] McVean G, Awadalla P, Fearnhead P (2002) A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160:1231–41. [19] Awadalla P (2003) The evolutionary genomics of pathogen recombination. Nature Reviews Genetics 4:50–60. [20] Crawford D, Bhangale T, Li N, Hellenthal G, Rieder M, et al. (2004) Evidence for substantial fine-scale variation in recombination rates across the human genome. Nature Genetics 36:700–706. [21] McVean G, Myers S, Hunt S, Deloukas P, Bentley D, et al. (2004) The fine-scale structure of recombination rate variation in the human genome. Science 304:581–4. [22] Myers S, Bottolo L, Freeman C, McVean G, Donnelly P (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science 310:321–4. [23] Nielsen R (2000) Estimation of Population Parameters and Recombination Rates From Single Nucleotide Polymorphisms. Genetics 154:931–942. [24] Wiuf C, Posada D (2003) A Coalescent Model of Recombination Hotspots. Genetics 164:407–417. [25] Fearnhead P, Harding RM, Schneider JA, Myers S, Donnelly P (2004) Application of coalescent methods to reveal fine-scale rate variation and recombination hotspots. Genetics 167:2067–81. [26] Shriner D, Rodrigo AG, Nickle DC, Mullins JI (2004) Pervasive genomic recombination of HIV-1 in vivo. Genetics 167:1573–83. [27] Jetzt AE, Yu H, Klarmann GJ, Ron Y, Preston BD, et al. (2000) High rate of recombination throughout the human immunodeficiency virus type 1 genome. J Virol 74:1234–40. [28] Zhuang J, Jetzt AE, Sun G, Yu H, Klarmann G, et al. (2002) Human immunodeficiency virus type 1 recombination: rate, fidelity, and putative hot spots. J Virol 76:11273–82. [29] Chen J, Rhodes TD, Hu WS (2005) Comparison of the genetic recombination rates of human immunodeficiency virus type 1 in macrophages and t cells. J Virol 79:9337–40. [30] Levy DN, Aldrovandi GM, Kutsch O, Shaw GM (2004) Dynamics of HIV-1 recombination in its natural target cells. Proc Natl Acad Sci U S A 101:4204–9. [31] Mukherjee S, Lee HLR, Ron Y, Dougherty JP (2006) Proviral progeny of heterodimeric virions reveal a high crossover rate for human immunodeficiency virus type 2. J Virol 80:12402–7. [32] Flynn JA, An W, King SR, Telesnitsky A (2004) Nonrandom Dimerization of Murine Leukemia Virus Genomic RNAs. J Virol 78:12129–12139. [33] Stavrinides J, Guttman DS (2004) Mosaic evolution of the severe acute respiratory syndrome coronavirus. J Virol 78:76–82. 124

BIBLIOGRAPHY

[34] Simmonds P, Midgley S (2005) Recombination in the genesis and evolution of hepatitis B virus genotypes. J Virol 79:15467–76. [35] van Cuyck H, Fan J, Robertson DL, Roques P (2005) Evidence of recombination between divergent hepatitis E viruses. J Virol 79:9306–14. [36] Simmonds P, Welch J (2006) Frequency and dynamics of recombination within different species of human enteroviruses. J Virol 80:483–93. ¨ T, Sammons S, Schmid DS, et al. (2006) Complete[37] Norberg P, Liljeqvist JA, Bergstrom genome phylogenetic approach to varicella-zoster virus evolution: genetic divergence and evidence for recombination. J Virol 80:9569–76. [38] Peters GA, Tyler SD, Grose C, Severini A, Gray MJ, et al. (2006) A full-genome phylogenetic analysis of varicella-zoster virus reveals a novel origin of replication-based genotyping scheme and evidence of recombination between major circulating clades. J Virol 80:9850–60. [39] Chen J, Powell D, Hu WS (2006) High frequency of genetic recombination is a common feature of primate lentivirus replication. J Virol 80:9651–8. [40] Moutouh L, Corbeil J, Richman DD (1996) Recombination leads to the rapid emergence of HIV-1 dually resistant mutants under selective drug pressure. Proc Natl Acad Sci U S A 93:6106–11. [41] Kellam P, Larder BA (1995) Retroviral recombination can lead to linkage of reverse transcriptase mutations that confer increased zidovudine resistance. J Virol 69:669–74. [42] Nikolenko GN, Svarovskaia ES, Delviks KA, Pathak VK (2004) Antiretroviral Drug Resistance Mutations in Human Immunodeficiency Virus Type 1 Reverse Transcriptase Increase Template-Switching Frequency. J Virol 78:8761–8770. [43] Muller H (1932) Some Genetic Aspects of Sex. The American Naturalist 66:118–138. [44] Crow J, Kimura M (1965) Evolution in Sexual and Asexual Populations. The American Naturalist 99:439–450. [45] Maynard Smith J (1971) What use is sex? J Theor Biol 30:319–335. [46] Felsenstein J (1974) The evolutionary advantage of recombination. Genetics 78:737–756. ¨ [47] Bretscher M, Althaus C, Muller V, Bonhoeffer S (2004) Recombination in HIV and the evolution of drug resistance: for better or for worse? BioEssays 26:180–188. [48] Keightley PD, Otto SP (2006) Interference among deleterious mutations favours sex and recombination in finite populations. Nature 443:89–92. [49] Bonhoeffer S, Chappey C, Parkin NT, Whitcomb JM, Petropoulos CJ (2004) Evidence for positive epistasis in HIV-1. Science 306:1547–50. [50] Sanju´an R, Moya A, Elena SF (2004) The contribution of epistasis to the architecture of fitness in an rna virus. Proc Natl Acad Sci U S A 101:15376–9. [51] Muller H (1964) The relation of recombination to mutational advance. Mutation Research 106:2–9.

125

BIBLIOGRAPHY

[52] Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y, et al. (2005) Whole-genome analysis of human influenza A virus reveals multiple persistent lineages and reassortment among recent H3N2 viruses. PLoS Biol 3:e300. [53] Stevens S, Griffith J (1996) Sequence analysis of the human DNA flanking sites of human immunodeficiency virus type 1 integration. J Virol 70:6459–6462. ¨ [54] Schroder A, Shinn P, Chen H, Berry C, Ecker J, et al. (2002) HIV-1 Integration in the Human Genome Favors Active Genes and Local Hotspots. Cell 110:521–529. [55] Crise B, Li Y, Yuan C, Morcock DR, Whitby D, et al. (2005) Simian Immunodeficiency Virus Integration Preference Is Similar to That of Human Immunodeficiency Virus Type 1. J Virol 79:12199–12204. [56] Temin HM (1993) Retrovirus variation and reverse transcription: abnormal strand transfers result in retrovirus genetic variation. Proc Natl Acad Sci U S A 90:6900–3. [57] Konstantinova P, de Haan P, Das AT, Berkhout B (2006) Hairpin-induced trna-mediated (hitme) recombination in HIV-1. Nucleic Acids Res 34:2206–18. [58] Moumen A, Polomack L, Unge T, Veron M, Buc H, et al. (2003) Evidence for a Mechanism of Recombination during Reverse Transcription Dependent on the Structure of the Acceptor RNA. Journal of Biological Chemistry 278:15973–15982. [59] Lanciault C, Champoux JJ (2006) Pausing during reverse transcription increases the rate of retroviral recombination. J Virol 80:2483–94. [60] Galetto R, Moumen A, Giacomoni V, V´eron M, Charneau P, et al. (2004) The structure of HIV-1 genomic RNA in the gp120 gene determines a recombination hot spot in vivo. J Biol Chem 279:36625–32. [61] Baird HA, Galetto R, Gao Y, Simon-Loriere E, Abreha M, et al. (2006) Sequence determinants of breakpoint location during hiv-1 intersubtype recombination. Nucleic Acids Res 34:5203–16. [62] Chin MPS, Rhodes TD, Chen J, Fu W, Hu WS (2005) Identification of a major restriction in HIV-1 intersubtype recombination. Proc Natl Acad Sci U S A 102:9002–7. [63] Dykes C, Balakrishnan M, Planelles V, Zhu Y, Bambara RA, et al. (2004) Identification of a preferred region for recombination and mutation in HIV-1 gag. Virology 326:262–79. [64] Mansky LM, Temin HM (1995) Lower in vivo mutation rate of human immunodeficiency virus type 1 than that predicted from the fidelity of purified reverse transcriptase. J Virol 69:5087–94. [65] Mansky L (1998) Retrovirus mutation rates and their role in genetic variation. Journal of General Virology 79:1337–1345. [66] Nei M, Kumar S (2000) Molecular Evolution and Phylogenetics. Oxford University Press, USA. [67] Onafuwa A, An W, Robson ND, Telesnitsky A (2003) Human Immunodeficiency Virus Type 1 Genetic Recombination Is More Frequent Than That of Moloney Murine Leukemia Virus despite Similar Template Switching Rates. J Virol 77:4577–4587.

126

BIBLIOGRAPHY

[68] Chohan B, Lavreys L, Rainwater SMJ, Overbaugh J (2005) Evidence for frequent reinfection with human immunodeficiency virus type 1 of a different subtype. J Virol 79:10701–8. [69] Marchant D, Neil SJD, McKnight A (2006) Human immunodeficiency virus types 1 and 2 have different replication kinetics in human primary macrophage culture. J Gen Virol 87:411–418. doi:10.1099/vir.0.81391-0. [70] Leitner T, Foley B, Hahn B, Marx P, McCutchan F, et al. (2005) HIV Sequence Compendium 2005. Technical Report LA-UR-06-0680, Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, Group T-10, Mail Stop K710 , Los Alamos, New Mexico 87545 U.S.A. URL http://hiv.lanl.gov. [71] Rodrigo AG, Learn GH, editors (2001) Computational And Evolutionary Analysis of HIV Molecular Sequences. Kluwer Academic Publishers. [72] Gao F, Robertson D, Morrison S, Hui H, Craig S, et al. (1996) The heterosexual human immunodeficiency virus type 1 epidemic in Thailand is caused by an intersubtype (A/E) recombinant of African origin. J Virol 70:7013–7029. [73] Carr J, Salminen M, Koch C, Gotte D, Artenstein A, et al. (1996) Full-length sequence and mosaic structure of a human immunodeficiency virus type 1 isolate from Thailand. J Virol 70:5935–5943. [74] Sierra M, Thomson MM, R´ıos M, Casado G, de Castro RO, et al. (2005) The analysis of near full-length genome sequences of human immunodeficiency virus type 1 BF intersubtype recombinant viruses from Chile, Venezuela and Spain reveals their relationship to diverse lineages of recombinant viruses related to CRF12 BF. Infect Genet Evol 5:209–17. [75] Carr JK, Avila M, Carrillo MG, Salomon H, Hierholzer J, et al. (2001) Diverse BF recombinants have spread widely since the introduction of HIV-1 into south america. AIDS 15:F41–7. [76] Thomson MM, Delgado E, Herrero I, Villahermosa ML, de Parga EV, et al. (2002) Diversity of mosaic structures and common ancestry of human immunodeficiency virus type 1 BF intersubtype recombinant viruses from Argentina revealed by analysis of near full-length genome sequences. J Gen Virol 83:107–19. [77] Allen B, Steel M (2001) Subtree transfer operations and their induced metrics on evolutionary trees. Annals of Combinatorics 5:1–15. [78] Song YS, Hein J (2005) Constructing minimal ancestral recombination graphs. J Comput Biol 12:147–69. [79] Jotun Hein MHSCW (2005) Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford University Press. URL http://books.google.com/ books?id=QBC_SFOamksC. [80] Beiko RG, Hamilton N (2006) Phylogenetic identification of lateral genetic transfer events. BMC Evol Biol 6:15. [81] Song Y (2003) On the Combinatorics of Rooted Binary Phylogenetic Trees. Annals of Combinatorics 7:365–379.

127

BIBLIOGRAPHY

[82] Torsello A, Hidovic D, Pelillo M (2004) Four metrics for efficiently comparing attributed trees. Pattern Recognition, 2004 ICPR 2004 Proceedings of the 17th International Conference on 2:467–470. [83] Hein J (1990) Reconstructing evolution of sequences subject to recombination using parsimony. Math Biosci 98:185–200. [84] Felsenstein J (2004) Inferring phylogenies. Sinauer Associates Sunderland, Mass., USA. [85] Nakhleh L, Warnow T, Linder CR, John KS (2005) Reconstructing reticulate evolution in species-theory and practice. J Comput Biol 12:796–811. [86] Chataigner F (2005) Approximating the Maximum Agreement Forest on k trees. Information processing letters 93:239–244. [87] Nakhleh L, Ruths D, Wang L (2005) RIATA-HGT: A Fast and Accurate Heuristic for Reconstructing Horizontal Gene Transfer. Lect Notes Comput Sci 3595:84–93. [88] Hallett M, Lagergren J (2001) Efficient algorithms for lateral gene transfer problems. Proc Fifth Ann Intl Conf Comput Biol :149–156. [89] Beiko RG, Harlow TJ, Ragan MA (2005) Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A 102:14332–7. [90] Ge F, Wang LS, Kim J (2005) The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol 3:e316. [91] Robinson D, Foulds L (1981) Comparison of phylogenetic trees. Math Biosci 53:131–147. [92] Lin Y, Hsu T (2003) Efficient Algorithms for Descendent Subtrees Comparison of Phylogenetic Trees with Applications to Co-evolutionary Classifications in Bacterial Genome. Lecture Notes in Computer Science 2906:339–351. [93] Steel M, Warnow T (1993) Kaikoura tree theorems: Computing the maximum agreement subtree. Information Proc Letters 48:77–82. [94] Goddard W, Kubicka E, Kubicki G, McMorris F (1994) The agreement metric for labeled binary trees. Math Biosci 123:215–26. [95] Berger-Wolf T (2004) Online Consensus and Agreement of Phylogenetic Trees. Proceedings of the 4th International Workshop on Algorithms in Bioinformatics :350–361. [96] Berry V, Guillemot S, Nicolas F, Paul C (2005) On the Approximation of Computing Evolutionary Trees. Lect Notes Comput Sci 3595:115–125. doi:10.1007/11533719. [97] Farach M, Thorup M (1994) Optimal evolutionary tree comparison by sparse dynamic programming. Foundations of Computer Science, 1994 Proceedings, 35th Annual Symposium on :770–779. [98] Cole R, Farach-Colton M, Hariharan R, Przytycka T, Thorup M (2001) An O(nlog n) algorithm for the maximum agreement subtree problem for binary trees. SIAM Journal on Computing 30:1385–1404. [99] Lam T, Sung W, Ting H (1996) Computing the Unrooted Maximum Agreement Subtree in Sub-quadratic Time. Proceedings of the 5th Scandinavian Workshop on Algorithm Theory :124–135. 128

BIBLIOGRAPHY

[100] Farach M, Przytycka T, Thorup M (1995) On the Agreement of Many Trees. Information Processing Letters 55:297–301. [101] Song Y, Hein J (2003) Parsimonious Reconstruction of Sequence Evolution and Haplotype Blocks: Finding the Minimum Number of Recombination Events. Algorithms in Bioinformatics: Third International Workshop, WABI 2003, Budapest, Hungary, September 15-20, 2003: Proceedings :287–302. [102] Hein J (1993) A heuristic method to reconstruct the history of sequences subject to recombination. Journal of Molecular Evolution 36:396–405. [103] andFrank Dehne GH, Rau-Chaplin A, Blouin C (2008) Spr distance computation for unrooted trees. Evolutionary Bioinformatics 2008 4:17–27. [104] Bryant D, Moulton V (2004) Neighbor-net: an agglomerative method for the construction of phylogenetic networks. Mol Biol Evol 21:255–65. [105] Hordijk W, Gascuel O (2005) Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood. Bioinformatics 21:4338–47. [106] Suchard MA (2005) Stochastic models for horizontal gene transfer: taking a random walk through tree space. Genetics 170:419–31. [107] Gentleman R, Biocore (2006) geneplotter: Grapics related functions for Bioconductor. R package version 1.12.0. [108] Gelman A, Carlin JB, Stern HS, and DBR (2004) Bayesian Data Analysis. Chapman & Hall/CRC, 2nd edition. [109] O’Hagan A (1994) Kendall’s Advanced theory of Statistics, volume 2B:Bayesian Inference. Arnold Publishers. [110] Suchard MA, Weiss RE, Sinsheimer JS (2005) Models for estimating Bayes factors with applications to phylogeny and tests of monophyly. Biometrics 61:665–73. [111] Suchard MA, Weiss RE, Dorman KS, Sinsheimer JS (2002) Oh brother, where art thou? a Bayes factor test for recombination with uncertain heritage. Syst Biol 51:715–28. [112] Husmeier D, McGuire G (2002) Detecting recombination with mcmc. Bioinformatics 18 Suppl 1:S345–53. [113] Kuhner MK, Yamato J, Felsenstein J (2000) Maximum likelihood estimation of recombination rates from population data. Genetics 156:1393–401. [114] Wilson DJ, McVean G (2006) Estimating diversifying selection and functional constraint in the presence of recombination. Genetics 172:1411–25. [115] Kuhner MK, Smith LP (2007) Comparing likelihood and bayesian coalescent estimation of population parameters. Genetics 175:155–165. doi:10.1534. [116] Redelings B, Suchard M (2005) Joint Bayesian Estimation of Alignment and Phylogeny. Systematic Biology 54:401–418. [117] Opgen-Rhein R, Fahrmeir L, Strimmer K (2005) Inference of demographic history from genealogical trees using reversible jump Markov chain Monte Carlo. BMC Evolutionary Biology 5:6–6. 129

BIBLIOGRAPHY

[118] Drummond A, Rambaut A, Shapiro B, Pybus O (2005) Bayesian Coalescent Inference of Past Population Dynamics from Molecular Sequences. Molecular Biology and Evolution 22:1185–1192. [119] Drummond A, Ho S, Phillips M, Rambaut A (2006) Relaxed phylogenetics and dating with confidence. PLoS Biol 4:e88. [120] Thorne JL, Kishino H, Painter IS (1998) Estimating the rate of evolution of the rate of molecular evolution. Mol Biol Evol 15:1647–57. [121] Kishino H, Thorne JL, Bruno WJ (2001) Performance of a divergence time estimation method under a probabilistic model of rate evolution. Mol Biol Evol 18:352–61. [122] Thorne JL, Kishino H (2002) Divergence time and evolutionary rate estimation with multilocus data. Syst Biol 51:689–702. [123] Mau B, Newton MA, Larget B (1999) Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55:1–12. [124] Li S, Pearl D, Doss H (2000) Phylogenetic Tree Construction Using Markov Chain Monte Carlo. J Am Stat Assoc 95:493–508. [125] Rannala B, Yang Z (1996) Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. J Mol Evol 43:304–11. [126] Huelsenbeck JP, Ronquist F (2001) MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17:754–5. [127] Huelsenbeck JP, Larget B, Miller RE, Ronquist F (2002) Potential applications and pitfalls of Bayesian inference of phylogeny. Syst Biol 51:673–88. [128] Altekar G, Dwarkadas S, Huelsenbeck JP, Ronquist F (2004) Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference. Bioinformatics 20:407– 15. [129] Green P (2003) Trans-dimensional Markov chain Monte Carlo. Stochastic Systems :179–198.

Highly Structured

[130] Al-Awadhi F, Hurn M, Jennison C (2004) Improving the acceptance rate of reversible jump MCMC proposals. Statistics and Probability Letters 69:189–198. [131] Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–76. [132] Yang Z (1994) Estimating the pattern of nucleotide substitution. J Mol Evol 39:105–11. [133] Suchard M, Weiss R, Dorman K, Sinsheimer J (2003) Inferring spatial phylogenetic variation along nucleotide sequences: A multiple changepoint model. J Am Stat Assoc 98:427–438. [134] Waddell PJ, Steel MA (1997) General time-reversible distances with unequal rates across sites: mixing gamma and inverse gaussian distributions with invariant sites. Mol Phylogenet Evol 8:398–414. [135] Yang Z (1994) Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. J Mol Evol 39:306–14. 130

BIBLIOGRAPHY

[136] Yang Z (1995) A space-time process model for the evolution of DNA sequences. Genetics 139:993–1005. [137] Dimatteo I, Genovese C, Kass R (2001) Bayesian curve-fitting with free-knot splines. Biometrika 88:1055–1071. [138] Al-Awadhi F, Jennison C, Hurn M (2004) Statistical image analysis for a confocal microscopy two-dimensional section of cartilage growth. Journal of the Royal Statistical Society Series C(Applied Statistics) 53:31–49. [139] Kou S, Zhou Q, Wong W (2006) Equi-Energy Sampler with Applications in Statistical Inference and Statistical Mechanics. Annals of Statistics 34:1581–1619. [140] Brooks S, Giudici P (2000) MCMC convergence assessment via two-way ANOVA. Journal of Computational and Graphical Statistics 9:266–285. [141] Liu J (2001) Monte Carlo Strategies in Scientific Computing. Springer. [142] Gelman A, Rubin D (1992) Inference from Iterative Simulation Using Multiple Sequences. Statistical Science 7:457–472. [143] Brockwell P, Davis R (1991) Time series: theory and methods. Springer Series in Statistics. Springer-Verlag New York, Inc. New York, NY, USA, 2nd edition. [144] Geyer C (1992) Practical Markov Chain Monte Carlo. Statistical Science 7:473–483. [145] Mykland P, Tierney L, Yu B (1995) Regeneration in Markov Chain Samplers. Journal of the American Statistical Association 90:233–241. [146] Massey Jr F (1951) The Kolmogorov-Smirnov Test for Goodness of Fit. Journal of the American Statistical Association 46:68–78. [147] Brooks S, Gelman A (1998) General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics 7:434–455. [148] Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. CABIOS 13:555–556. [149] Ronquist F, Larget B, Huelsenbeck JP, Kadane JB, Simon D, et al. (2006) Comment on “phylogenetic mcmc algorithms are misleading on mixtures of trees”. Science 312:367; author reply 367. [150] Ronquist F, Huelsenbeck JP (2003) MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19:1572–4. [151] Bryant D (2003) A classification of consensus methods for phylogenetics. Bioconsensus (M Janowitz, FJ Lapointe, F McMorris, B Mirkin and F Roberts, eds) DIMACS-AMS, Providence :1–21. [152] Aulicino PC, Holmes EC, Rocco C, Mangano A, Sen L (2007) Extremely rapid spread of Human Immunodeficiency Virus Type 1 BF recombinants in Argentina. J Virol 81:427– 429. doi:10.1128/JVI.01403-06. [153] Posada D (2002) Evaluation of methods for detecting recombination from dna sequences: Empirical data. Mol Biol Evol 19:708–717. [154] Posada D, Crandall KA, Holmes EC (2002) Recombination in evolutionary genomics. Annu Rev Genet 36:75–97. 131