Pattern Recognition in Computational Molecular Biology

Viewer
Transcript

CHAPTER

23

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION: IMPACTS FROM MODEL MISSPECIFICATION, RECOMBINATION, HOMOPLASY, AND PATTERN RECOGNITION Diego Mallo1 , Agustín Sánchez-Cobos2 , and Miguel Arenas2,3 1

Department of Biochemistry, Genetics and Immunology, University of Vigo, Vigo, Spain 2 Bioinformatics Unit, Centre for Molecular Biology “Severo Ochoa” (CSIC), Madrid, Spain 3 Institute of Molecular Pathology and Immunology, University of Porto (IPATIMUP), Porto, Portugal

Pattern Recognition in Computational Molecular Biology: Techniques and Approaches, First Edition. Edited by Mourad Elloumi, Costas S. Iliopoulos, Jason T. L. Wang, and Albert Y. Zomaya. © 2016 John Wiley & Sons, Inc. Published 2016 by John Wiley & Sons, Inc.

439

440

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

23.1 INTRODUCTION Phylogenetic tree reconstruction provides an analysis of the evolutionary relationships among genetic sequences. In population genetics and molecular evolution, these relationships can be useful to understand processes such as species history and speciation [18, 62], demographic history of populations [20, 67], the evolution of genes and protein families [15, 57], the emergence of new protein functions [73, 102], coevolution [42], or comparative genomics [79]. Moreover, phylogenetic trees can be used to detect signatures of selection [49] or to perform ancestral sequence reconstruction [104]. These interesting applications resulted in an increasing number of studies on phylogenetic reconstruction (Figure 23.1). Nowadays, phylogenetic tree reconstruction is straightforward for any researcher given the current variety of user-friendly software. Despite the fact that the first phylogenetic programs ran on the command line, current frameworks often implement a Graphical User Interface (GUI) where nonexperts can easily perform phylogenetic reconstructions [50, 95]. However, such attractive simplicity may sometimes generate incorrect phylogenetic reconstructions because of ignoring a variety of processes. For example, it is known that substitution model misspecification, that is, ignored recombination or homoplasy, can result in incorrect phylogenies [10, 56, 87, 91]. Other evolutionary processes such as Gene Duplication and Loss (GDL), Incomplete Lineage Sorting (ILS), Horizontal Gene Transfer (HGT), and gene flow between species can also bias species tree reconstruction by making discordant gene and species histories [32, 52, 62]. Briefly, GDL describes how a piece of genetic material (locus) is copied to a different place of the genome (duplication) or erased (loss). It is a well-known (from Ohno’s seminal book [71] to recent reviews [112]) source of genetic variation and it is usually described as the main evolutionary force driving gene family evolution. However, ILS (also known as deep coalescence) describes a polymorphism present in an ancestral population that is retained along at least two speciation events, subsequently sorting in a way that is incongruent to the species phylogeny. Finally, HGT describes transferences of genetic material (one or a few loci) from one species to another. Since phylogenetic software always outputs a phylogeny, one could believe that such a phylogeny is correct, when maybe it is not. Therefore, it is important to remember the different evolutionary aspects that can affect phylogenetic tree reconstructions and how we can account for them. In this chapter, we describe the influences of diverse evolutionary phenomena on phylogenetic tree reconstruction and, when possible, we provide strategies to consider such phenomena. From this perspective, we finally discuss the future of phylogenetic tree reconstruction frameworks. 23.2 OVERVIEW ON METHODS AND FRAMEWORKS FOR PHYLOGENETIC TREE RECONSTRUCTION The phylogenetic reconstruction of a given genetic sample can be performed both at the gene and the species levels, hence studying two different biological histories. Note that gene and species trees may differ owing to evolutionary processes such as GDL,

441

Year

23.2 OVERVIEW ON METHODS AND FRAMEWORKS

1970 1972 1973 1974 1975 1976 1977 1978 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 0

40

80

120

160

200

240

Number of publications

Figure 23.1 The number of articles per year with terms “phylogenetic reconstruction” in the title or the abstract, as measured by PubMed (search conducted on 18 February 2014).

ILS, or HGT [62]. As a consequence, phylogenetic methods for gene tree reconstruction are based on approaches different to those applied for species tree reconstruction [58]. In this subsection, we briefly describe the commonly used methods and frameworks to perform phylogenetic tree reconstruction. 23.2.1

Inferring Gene Trees

Three main approaches are currently applied to reconstruct phylogenetic trees, namely, distance-based methods (Neighbor-Joining (NJ) [85]), Maximum Likelihood

442

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

(ML) methods [34], and Bayesian methods [45]. The goal of distance-based methods is a fast phylogenetic reconstruction from large amounts of data. ML methods are slower than distance-based methods but can generate much more accurate inferences owing to the consideration of a substitution model of evolution and ML optimizations. Finally, Bayesian methods differ from the previous methods on having an integrative point of view, estimating tree distributions instead of punctual estimates. Bayesian-based methods can also incorporate complex and flexible models such as demographics, relaxed molecular clocks, and longitudinal sampling [31]. Owing to their complexity, these methods usually require much more computational costs to yield reliable estimations (reaching convergence between different Markov Chain Monte Carlo (MCMC) runs). Thus, Bayesian methods may generate more information and sometimes provide more realistic reconstructions [31, 46]. For further details about these three approaches, we direct the reader to References [29, 36, 69, 103]. A number of programs have been developed to infer phylogenetic trees from nucleotide, codon, and amino acid sequences. Table 23.1 shows an updated list of the most currently used programs. Concerning distance-based phylogenetic inferences, we would recommend, from our practical experience, the programs HyPhy [50], SplitsTree [47], and MEGA [95], which support a variety of formats for the input Multiple Sequence Alignment (MSA) and are user-friendly. Concerning ML inferences, RAxML [92] is recommended for dealing with large amounts of data, PhyML [41] is one of the most famous programs, and both implement the interesting option of user-specified transition matrices. MEGA is also commonly used owing to its user-friendly GUI. Finally, Bayesian-based phylogenetic estimations are frequently performed with MrBayes [84] that allows for different substitution models along the sequences, and with BEAST [31], that may account for demographics and relaxed rates of evolution. Owing to the flexibility of Bayesian approaches, both programs can also be used to infer species trees under the multispecies coalescent model [82], see Section 23.2.2. Note that we only described the most well-established programs. Of course, there are many more phylogenetic tree reconstruction programs that continuously emerge with the implementation of new capabilities (see Reference [22]). 23.2.2

Inferring Species Trees

In spite of the fact that the explicit differences between species trees and gene trees had already been noted some decades ago [38, 72, 94], the species tree reconstruction paradigm has only recently blossomed, becoming one of the hottest topics in phylogenomics [24, 58, 108]. In Figure 23.2, the trees are composed by four species (A, B, C, and D) and four gene copies (A0, B0, C0, and D0). Figure 23.2 also shows the following: (i) a gene duplication (square, GDL) and a gene loss (cross, GDL) in the species A; (ii) a gene transfer (arrow, HGT) from C0 to replace the original gene copy of D0; (iii) a deep coalescence (circle) showing the role of ILS. Dashed lines indicate lost lineages, either due to replacement by a transfer or to a loss.

443

Distance matrix Distance matrix

Distance matrix, ML

Distance matrix, ML

Distance matrix, ML

ML ML

ML

Bayesian Bayesian

SplitsTree HyPhy

MEGA

Mesquite

Phylip

PhyML CodonPhyML

RAxML

MrBayes BEAST

Nucleotide (Alla), amino acid (Allc) Nucleotide (HKY, GTR), amino acid (Blosum62, Dayhoff, JTT, mtRev, cpRev, WAG)

Nucleotide (JC, … ,HKY), amino acid (JTT, Dayhoff, PAM) Nucleotide (Alla), amino acid (Allb) Nucleotide (Alla), Codon (GY94, MG94, empirical models), amino acid (Allc) Nucleotide (Alla), amino acid (Allc)

– Nucleotide (Alla), codon (NGb), amino acid (Dayhoff, JTT) Nucleotide (Alla), codon (NGb), amino acid (LG, Dayhoff, JTT, WAG, Mt, Rt) Nucleotide (JC, K2P, F81, F84)

Substitution models

No Yes

Yes

No No

No

Yes

Yes

Yes Yes

http://mesquiteproject.org/mesquite/ mesquite.html http://evolution.genetics.washington .edu/phylip.html http://www.atgc-montpellier.fr/phyml/ http://sourceforge.net/projects/ codonphyml/ http://sco.h-its.org/exelixis/web/ software/raxml/index.html http://mrbayes.sourceforge.net/ http://beast.bio.ed.ac.uk/Main_Page

http://www.splitstree.org/ http://hyphy.org/w/index.php/Main_ Page http://www.megasoftware.net/

GUI Source

[84] [31]

[92]

[41] [37]

[35]

[64]

[95]

[47] [50]

Reference (last version)

For each program, we indicate the underlying approach (distance matrix, ML, or Bayesian), kind of implemented substitution models (nucleotide, codon, or amino acid), development of a GUI, link, and reference. Details about nucleotide, codon, and amino acid substitution models can be found in the References [3, 78, 105], respectively. Although many more software packages exist, here we have selected, from our point of view, those programs most commonly used and user-friendly. a All: indicates all common nucleotide substitution models (included in jModelTest2 [28]: JC, … ,GTR) are implemented. b NG: Nei and Gojobori codon model [70]. c All: indicates all common amino acid substitution models (included in ProtTest [1]: Blosum62, … ,WAG) are implemented.

Approach

An updated list of the most commonly used phylogenetic tree reconstruction programs available up to date

Program

Table 23.1

444

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

A A0

B

C

D

B0

C0

D0

Loss (GDL)

Duplication (GDL)

Transfer (HGT)

Deep coalescence (ILS)

Figure 23.2 An illustrative example of a gene tree (thin lines) inside a species tree (gray tree in the background) and the effect of the three main evolutionary processes that can generate discordance between them.

Despite the complexity of the main approaches to infer species trees, we can roughly classify them into three categories: (i) The supermatrix approach is based on the concatenation of different gene alignments followed by a global gene tree reconstruction to finally generate a “supergene” phylogeny. Thus, it assumes that the discordance between gene trees is the result of minor discordant phylogenetic signals, which would cancel out when using all the information at once. Therefore, the reconstructed supergene phylogeny would represent the species tree. (ii) The supertree approach considers each gene alignment independently. Consequently, gene trees are independently reconstructed with any common phylogenetic tree reconstruction method and then subsequently used to reconstruct the species phylogeny. This last step can be performed just by trying to minimize the gene tree discordance or by considering a model that takes into account at least one evolutionary process. Examples of the process-oblivious approach are the consensus-based methods, Bayesian concordance (e.g., BUCKy [53]), MRP [21, 80], and RF supertrees [17, 26]. Model-based methods usually consider evolutionary processes leading to species tree/gene tree incongruence, that is, HGT, GDL, and ILS, in different combinations. Reconciliation-based methods (e.g., iGTP [25] and SPRSupertrees [98]), distance-based methods (e.g., STAR and STEAC [59]), pseudo-likelihood-based methods (e.g., STELLS [101] and MP-EST [60]) are examples of process-aware/model-based methods. Recently, our laboratory developed a novel Bayesian method that lies between these two categories (process-oblivious and process-aware/model-based), integrating error-based and reconciliation-based distances [66]. In general, these methods are much faster than the supermatrix or fully probabilistic approaches.

23.3 INFLUENCE OF SUBSTITUTION MODEL MISSPECIFICATION

445

(iii) Fully probabilistic models are the most sophisticated and are usually implemented in Bayesian frameworks. They integrate both gene tree and species tree estimation. The computer programs BEST [58] and *BEAST [43] consider the effect of ILS by implementing the multispecies coalescent model, while PHYLDOG [23] considers GDL using a birth–death model. In general, these fully probabilistic approaches are not only much more comprehensive but also more time consuming than the previous approaches. Another drawback is that they sometimes have convergence problems when tackling big data sets in terms of number of taxa. The selection of a species tree reconstruction method to analyze a biological data set is a difficult task, mainly due to the variety of underlying assumptions considered in the different methods. In general, probabilistic models can be recommended beforehand, as long as the model can handle the evolutionary processes present in the data. Indeed, there is an important trade-off between computational costs, data set size, and accuracy. Species tree reconstruction is a very active research topic where the emergence of new methods and comprehensive benchmarks will probably lead to more realistic species tree estimations. In the following sections, we describe the impact of different evolutionary phenomena on phylogenetic tree reconstruction and we provide alternative methodologies, when possible, to account for such phenomena.

23.3 INFLUENCE OF SUBSTITUTION MODEL MISSPECIFICATION ON PHYLOGENETIC TREE RECONSTRUCTION It is clear that nucleotide, codon, or amino acid substitution models of evolution are fundamental in phylogenetic inferences because the substitution model may affect the estimates of topology, substitution rates, bootstrap values, posterior probabilities, or the molecular clock [56, 111]. Actually, a simulation study by Lemmon and Moriarty [56] pointed out that more simple models than the true model can increase phylogenetic reconstruction biases. As a consequence, researchers use to apply the most complex models, that is, the GTR substitution model at nucleotide level, but this overparameterization may increase the variance and noise of the estimates and also generate temporal inconsistencies [93]. Altogether, the researcher should first apply a statistical selection of best-fit substitution models using programs such as jModelTest2 [28] (at nucleotide level) or ProtTest [1] (at amino acid level). Indeed, notice that different genomic regions can evolve under different substitution models [5, 8, 13]. Then, the selected model could be used to perform the phylogenetic reconstruction. Still with this consideration, it is known that commonly used substitution models can be unrealistic owing to assumptions such as site-independence evolution [99]. Site-dependent substitution models can better mimic the real evolutionary process [15, 39] but they have not been implemented yet in phylogenetic reconstruction methods because of their complexity, that is, the computation of the likelihood function from site-specific and branch-specific

446

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

transition matrices is not straightforward, although significant advances have been recently made in this area [16]. 23.4 INFLUENCE OF RECOMBINATION ON PHYLOGENETIC TREE RECONSTRUCTION Genetic recombination, or other processes where genetic material is exchanged, is a fundamental evolutionary process in a variety of organisms, especially in viruses and bacteria [61, 76, 83]. Unfortunately, recombination can seriously bias phylogenetic tree reconstruction. As an example, Schierup and Hein [87] simulated DNA sequence evolution under the coalescent with recombination [44]. They found that ignoring recombination biases the inferred phylogenetic trees toward larger terminal branches, smaller times to the Most Recent Common Ancestor (MRCA), and incorrect topologies for both distance-based and ML methods [87]. In addition, it can lead to loss of molecular clock, apparent homoplasies, and overestimation of the substitution rate heterogeneity [77, 87, 88]. Real data also presented these phylogenetic incongruences [33, 100]. Analysis derived from this kind of incorrect phylogenies can be also seriously affected, for example, the Ancestral Sequence Reconstruction (ASR) [6, 11] or the estimation of positively selected sites [4, 9, 12]. The evolutionary history of a sample with ancestral recombination events can be represented as a phylogenetic network, usually called Ancestral Recombination Graph (ARG) [7, 40], rather than as a single phylogenetic tree because each recombinant fragment (or partition) can have a particular evolutionary history (Figure 23.3). As a consequence, there are two phylogenetic reconstruction methodologies accounting for recombination: (i) Phylogenetic network reconstruction [68] by using programs such as SplitsTree. An example of this methodology applied to real data is described in Reference [14]. (ii) Phylogenetic tree reconstruction for each recombinant fragment. In the first step, recombination breakpoints are detected [65]. Then, a phylogenetic tree reconstruction can be performed for each recombinant fragment. Examples of this methodology applied to real data are described in References [74, 75]. Both the above-described methodologies are correct and the choice is just based on the evolutionary question to be answered. A phylogenetic network provides a complete visualization of clades and phylogenetic relationships (see Reference [48]), while methods such as ASR, or the detection of molecular adaptation, often require a tree for each recombinant fragment [4, 11]. In Figure 23.3, the ARG is based on two recombination events with breakpoints at positions 50 and 100, which result in three recombinant fragments (1–50, 51–100, and 101–300). Note that each recombinant fragment may be based on an evolutionary history that differs from the evolutionary history of other recombinant fragments. Dashed lines indicate branches derived from recombinant events.

23.5 INFLUENCE OF DIVERSE EVOLUTIONARY PROCESSES

ARG

447

Tree fragments 1−50

1−100 101−300

51−300 1−50

Tree fragments 51−100

Figure 23.3

Tree fragments 101−300

An illustrative example of an ARG and its corresponding embedded trees.

23.5 INFLUENCE OF DIVERSE EVOLUTIONARY PROCESSES ON SPECIES TREE RECONSTRUCTION Since most of species tree reconstruction methods entail a gene tree reconstruction step, every evolutionary process that is able to mislead gene tree estimation can also, in a greater or lesser extent, affect the species tree reconstruction. Simulation studies by Bayzid and Warnow [19] have in fact shown that species tree reconstruction accuracy is highly dependent upon gene tree accuracy. Moreover, these authors also suggested that the most accurate species tree methods base part of their advantage on estimating gene trees much better than common gene tree estimation methods. However, ILS-based distance methods are not strongly influenced by intralocus recombination as discussed by Lanier and Knowles [52]. In particular, these authors showed that tree lengths, and therefore the intensity of ILS, are the most important factors to determine the accuracy of the reconstruction when there are not additional, apart from ILS and intralocus recombination, evolutionary processes. In fact, recombination has to occur in unsorted lineages to affect species tree reconstruction under ILS-based models. Actually, the own effects of ILS are more misleading than recombination. Interestingly, sampling parameters, that is, number of loci and individuals, may properly control the species tree reconstruction accuracy and reduce the misleading effect of recombination. Gene flow may generate important biases in species tree inference through different migration models and migration patterns. Attending to topological accuracy,

448

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

Eckert and Cartens [32] observed that supertree methods are robust to historical gene flow models in contraposition to the supermatrix approach. However, the stepping-stone, and more importantly the n-island, models extremely misled both species tree reconstruction approaches, especially under high migration rates. In addition, Leaché et al. [55] showed that both the migration model and the migration pattern could bias the species tree inference, by using both tree-based and fully probabilistic Bayesian methods. For example, these authors showed that gene flow between sister species increases both the probability of estimating the true species tree and the posterior probabilities of the clade where migration events take place. On the other hand, migration between nonsister species can strongly bias species tree reconstruction and even increase the posterior probabilities of wrongly estimated clades. Regarding other species tree reconstruction parameters, gene flow can generate overcompression, that is, underestimation of the branch lengths of the species tree, and dilatation, that is, overestimation of the population size, as a function of the gene flow intensity [55]. Hybridization can also lead to gene tree incongruence and, as a consequence, mislead species tree reconstruction methods. However, there are a bunch of species tree methods to detect hybridization [51, 107] and to estimate species trees in presence of both ILS and hybridization events, that is, species networks (see Reference [106]). The most recognized processes able to generate gene tree discordance, that is, GDL, HGT, and ILS, have never been jointly modeled by any species tree reconstruction program, and therefore, the effect of unmodeled processes should also be explored for every kind of method. Unfortunately, there is a general lack of studies about the robustness to model misspecification of species tree methodologies, although there are some exceptions. For example PHYLDOG, a GDL-based probabilistic approach, can accurately reconstruct species trees under moderate levels of ILS, although it overestimates the number of duplications and losses [23]. Concerning the effect of HGT, ILS-based fully probabilistic methods showed a robust behavior when HGT was randomly distributed along the trees, although the accuracy drops when HGT is focalized into a specific branch of the species tree [27]. Apart from not having integrative models for those evolutionary processes, GDL, HGT, and ILS can bias species tree inference even when they are properly modeled. ILS not only generates gene tree incongruence, but also misleads species tree reconstruction, even when being explicitly modeled [19, 54, 63]. This pattern is not completely shared with GDL and HGT methods, which conversely obtain information from the modeled events, that is, duplications, losses, or transfers. Thus both high and low rates can mislead the estimation [89]. In summary, common evolutionary processes, such as GDL, HGT, and ILS, together with migration and hybridization, can bias species tree inferences although the performance of the methods differs. Nevertheless, we notice that additional simulation studies are still demanded to further analyze the effect of these processes on the broad plethora of species tree reconstruction methods.

23.7 CONCLUDING REMARKS

449

23.6 INFLUENCE OF HOMOPLASY ON PHYLOGENETIC TREE RECONSTRUCTION: THE GOALS OF PATTERN RECOGNITION The recurrence of similarity in evolution may generate homoplasy [86] where sequences are similar but are not derived from a common ancestor. Homoplasy can be generated by a variety of evolutionary processes such as recurrent mutation, gene flow, or recombination. As noted, phylogenetic tree reconstruction methods assume a common ancestor and, as a consequence, homoplasy can lead to phylogenetic uncertainty [91]. It is therefore fundamental to search for homology in the data before building a phylogenetic tree, for example, by using computer tools such as BLAST [2]. This consideration is related with the use of pattern recognition techniques [96] to reconstruct phylogenetic histories through cluster analysis [90]. In particular, a phylogeny can be reconstructed by following a particular pattern that can be recognized from the data [97, 109]. For example, one could directly apply a clustering algorithm and build a distance-based phylogenetic tree [97] or specify a particular pattern, that is, genes, domains, or sites, and then reconstruct a tree based on such a pattern [81, 113]. Pattern recognition can also be useful to study the influence of different phenomena on phylogenetic inferences. For example, this can be performed through the reconstruction of phylogenetic histories of different local patterns collected from the same raw data, that is, genome-wide data, and thus be able to identify a phylogeny for each pattern [110]. 23.7

CONCLUDING REMARKS

In recent years, phylogenetic reconstruction frameworks have experienced a sharp increase in new methods and technical advances, for example, parallelization on multiprocessors, GUI environments, or specific tools for specific analysis. However, we notice that influences derived from the nature of the data on phylogenetic tree reconstruction are sometimes forgotten. For example, sometimes phylogenies are incorrectly reconstructed under the GTR substitution model without further statistical analysis (see the interesting article by Sumner et al. [93]) or recombination is not considered when inferring phylogenetic trees [30]. Here, we suggest that the consideration of evolutionary phenomena, that is, through the above-described alternative methodologies, is fundamental for successfully building phylogenies. In our opinion, future phylogenetic frameworks should implement new models, for example, site-dependent substitution models, and technical advances but also should “internally” evaluate the data before performing a phylogenetic reconstruction, for example, as we suggest in the following. First, a homology search through pattern recognition methods would provide important information about the ancestral history, that is, homology, [91]. Note that a phylogenetic tree should not be inferred under lack of homology [36]. Second, a substitution model choice could be used to select the substitution model that best fits the data and then, such a fit to the selected substitution model could be evaluated according to a given threshold, that is, a likelihood score, under which the tree should not be computed, for example, data with lack

450

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

of evolutionary information. Third, recombination breakpoints could be also internally detected [65] and in presence of recombination, a tree should be estimated for each recombinant fragment. In case of species tree reconstruction, an extra gene tree discordance analysis could be used to select a proper species tree method to perform the final estimation step. Altogether, we believe that a tree should not be provided by a phylogenetic framework if the evolutionary history of the data cannot actually be explained or supported by such a tree from a biological perspective. This strategy (i.e., a framework that includes previous analyses of the data before computing a phylogeny) can probably increase computing times but it could help to avoid incorrect phylogenies. In summary, one should bear in mind that several evolutionary processes might impact the reconstructed phylogenies and we consequently recommend being cautious and contrasting the inferences.

ACKNOWLEDGMENTS This study was supported by the Spanish Government through the “Juan de la Cierva” fellowship “JCI-2011-10452” to MA and through the “FPI” fellowship “BES-2010-031014” at the University of Vigo to DM. MA also wants to acknowledge the EMBO fellowship “ASTF 367-2013” and the Portuguese Government with the FCT Starting Grant IF/00955/2014. We thank Leonardo de Oliveira Martins for helpful comments about some species trees methodological aspects. We thank to an anonymous reviewer for detailed comments. We also want to thank the Editors for the invitation to contribute with this chapter.

REFERENCES 1. Abascal F, Zardoya R, Posada D. ProtTest: selection of best-fit models of protein evolution. Bioinformatics 2005;21:2104–2105. 2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997;25:3389–3402. 3. Anisimova M, Kosiol C. Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 2009;26:255–271. 4. Anisimova M, Nielsen R, Yang Z. Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics 2003;164:1229–1236. 5. Arbiza L, Patricio M, Dopazo H, Posada D. Genome-wide heterogeneity of nucleotide substitution model fit. Genome Biol Evol 2011;3:896–908. 6. Arenas M. Computer programs and methodologies for the simulation of DNA sequence data with recombination. Front Genet 2013a;4:9. 7. Arenas M. The Importance and Application of the Ancestral Recombination Graph. Front Genet 2013b;4:206.

REFERENCES

451

8. Arenas M. Advances in computer simulation of genome evolution: toward more realistic evolutionary genomics analysis by approximate bayesian computation. J Mol Evol 2015;80:189–192. 9. Arenas M, Posada D. Coalescent simulation of intracodon recombination. Genetics 2010;184:429–437. 10. Arenas M, Posada D. Computational design of centralized HIV-1 genes. Curr HIV Res 2010;8:613–621. 11. Arenas M, Posada D. The effect of recombination on the reconstruction of ancestral sequences. Genetics 2010;184:1133–1139. 12. Arenas M, Posada D. The influence of recombination on the estimation of selection from coding sequence alignments. In: Fares MA, editor. Natural Selection: Methods and Applications. Boca Raton: CRC Press/Taylor & Francis; 2014. 13. Arenas M, Posada D. Simulation of Genome-wide Evolution under Heterogeneous Substitution models and Complex Multispecies Coalescent Histories. Mol Biol Evol 2014;31:1295–1301. 14. Arenas M, Patricio M, Posada D, Valiente G. Characterization of phylogenetic networks with NetTest. BMC Bioinform 2010;11:268. 15. Arenas M, Dos Santos HG, Posada D, Bastolla U. Protein evolution along phylogenetic histories under structurally constrained substitution models. Bioinformatics 2013;29:3020–3028. 16. Arenas M, Sanchez-Cobos A, Bastolla U. Maximum likelihood phylogenetic inference with selection on protein folding stability. Mol Biol Evol 2015;32:2195–2207. 17. Bansal MS, Burleigh JG, Eulenstein O, Fernandez-Baca D. Robinson–Foulds supertrees. Algorithms Mol Biol 2010;5:18. 18. Barraclough TG, Nee S. Phylogenetics and speciation. Trends Ecol Evol 2001; 16:391–399. 19. Bayzid MS, Warnow T. Naive binning improves phylogenomic analyses. Bioinformatics 2013;29:2277–2284. 20. Benguigui M, Arenas M. Spatial and temporal simulation of human evolution. Methods, frameworks and applications. Curr Genom 2014;15:245–255. 21. Bininda-Emonds OR. The evolution of supertrees. Trends Ecol Evol 2004;19:315–322. 22. Blair C, Murphy RW. Recent trends in molecular phylogenetic analysis: where to next? J Hered 2011;102:130–138. 23. Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Res 2012;23:323–330. 24. Capella-Gutierrez S, Kauff F, Gabaldon T. A phylogenomics approach for selecting robust sets of phylogenetic markers. Nucleic Acids Res 2014;42:e54. 25. Chaudhary R, Bansal MS, Wehe A, Fernandez-Baca D, Eulenstein O. iGTP: a software package for large-scale gene tree parsimony analysis. BMC Bioinform 2010;11:574. 26. Chaudhary R, Burleigh JG, Fernandez-Baca D. Inferring species trees from incongruent multi-copy gene trees using the Robinson-Foulds distance. Algorithms Mol Biol 2013;8:28. 27. Chung Y, Ane C. Comparing two Bayesian methods for gene tree/species tree reconstruction: simulations with incomplete lineage sorting and horizontal gene transfer. Syst Biol 2011;60:261–275.

452

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

28. Darriba D, Taboada GL, Doallo R, Posada D. jModelTest 2: more models, new heuristics and parallel computing. Nat Methods 2012;9:772. 29. Delsuc F, Brinkmann H, Philippe H. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet 2005;6:361–375. 30. Doria-Rose NA, Learn GH, Rodrigo AG, Nickle DC, Li F, Mahalanabis M, Hensel MT, Mclaughlin S, Edmonson PF, Montefiori D, Barnett SW, Haigwood NL, Mullins JI. Human immunodeficiency virus type 1 subtype B ancestral envelope protein is functional and elicits neutralizing antibodies in rabbits similar to those elicited by a circulating subtype B envelope. J Virol 2005;79:11214–11224. 31. Drummond AJ, Suchard MA, Xie D, Rambaut A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Mol Biol Evol 2012;29:1969–1973. 32. Eckert AJ, Carstens BC. Does gene flow destroy phylogenetic signal? The performance of three methods for estimating species phylogenies in the presence of gene flow. Mol Phylogenet Evol 2008;49:832–842. 33. Feil EJ, Holmes EC, Bessen DE, Chan M-S, Day NPJ, Enright MC, Goldstein R, Hood DW, Kalia A, Moore CE, Zhou J, Spratt BG. Recombination within natural populations of pathogenic bacteria: Short-term empirical estimates and long-term phylogenetic consequences. Proc Natl Acad Sci USA 2001;98:182–187. 34. Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol 1981;17:368–376. 35. Felsenstein J. PHYLIP (Phylogeny Inference Package)3.5c ed.. Seattle: Department of Genetics, University of Washington; 1993. 36. Felsenstein J. Inferring Phylogenies. MA, Sunderland: Sinauer Associates; 2004. 37. Gil M, Zanetti MS, Zoller S, Anisimova M. CodonPhyML: fast maximum likelihood phylogeny estimation under codon substitution models. Mol Biol Evol 2013;30:1270–1280. 38. Goodman M, Czelusniak J, Moore G, Romero-Herrera A, Matsuda G. Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst Zool 1979;28:132–163. 39. Grahnen JA, Nandakumar P, Kubelka J, Liberles DA. Biophysical and structural considerations for protein sequence evolution. BMC Evol Biol 2011;11:361. 40. Griffiths RC, Marjoram P. An ancestral recombination graph. In: Donelly P, Tavaré S, editors. Progress in Population Genetics and Human Evolution. Berlin: Springer-Verlag; 1997. 41. Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol 2010;59:307–321. 42. Hafner MS, Nadler SA. Phylogenetic trees support the coevolution of parasites and their hosts. Nature 1988;332:258–259. 43. Heled J, Bryant D, Drummond AJ. Simulating gene trees under the multispecies coalescent and time-dependent migration. BMC Evol Biol 2013;13:44. 44. Hudson RR. Properties of a neutral allele model with intragenic recombination. Theor Popul Biol 1983;23:183–201. 45. Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogeny. Bioinformatics 2001;17:754–755. 46. Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. Bayesian inference of phylogeny and its impact on evolutionary biology. Science 2001;294:2310–2314.

REFERENCES

453

47. Huson DH. SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 1998;14:68–73. 48. Huson DH, Bryant D. Application of phylogenetic networks in evolutionary studies. Mol Biol Evol 2006;23:254–267. 49. Kosakovsky Pond SL, Frost SD. Not so different after all: a comparison of methods for detecting amino Acid sites under selection. Mol Biol Evol 2005;22:1208–1222. 50. Kosakovsky Pond SL, Frost SD, Muse SV. HYPHY: hypothesis testing using phylogenies. Bioinformatics 2005;21:676–679. 51. Kubatko LS. Identifying hybridization events in the presence of coalescence via model selection. Syst Biol 2009;58:478–488. 52. Lanier HC, Knowles LL. Is recombination a problem for species-tree analyses? Syst Biol 2012;61:691–701. 53. Larget BR, Kotha SK, Dewey CN, Ane C. BUCKy: gene tree/species tree reconciliation with Bayesian concordance analysis. Bioinformatics 2010;26:2910–2911. 54. Leaché AD, Rannala B. The accuracy of species tree estimation under simulation: a comparison of methods. Syst Biol 2011;60:126–137. 55. Leaché AD, Harris RB, Rannala B, Yang Z. The influence of gene flow on species tree estimation: a simulation study. Syst Biol 2014;63:17–30. 56. Lemmon AR, Moriarty EC. The importance of proper model assumption in bayesian phylogenetics. Syst Biol 2004;53:265–277. 57. Lijavetzky D, Carbonero P, Vicente-Carbajosa J. Genome-wide comparative phylogenetic analysis of the rice and Arabidopsis Dof gene families. BMC Evol Biol 2003;3:17. 58. Liu L. BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics 2008;24:2542–2543. 59. Liu L, Yu L, Kubatko L, Pearl DK, Edwards SV. Coalescent methods for estimating phylogenetic trees. Mol Phylogenet Evol 2009;53:320–328. 60. Liu L, Yu L, Edwards SV. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol 2010;10:302. 61. Lopes JS, Arenas M, Posada D, Beaumont MA. Coestimation of recombination, substitution and molecular adaptation rates by approximate Bayesian computation. Heredity 2014;112:255–264. 62. Maddison W. Gene trees in species trees. Syst Biol 1997;46:523–536. 63. Maddison WP, Knowles LL. Inferring phylogeny despite incomplete lineage sorting. Syst Biol 2006;55:21–30. 64. Maddison, W.P. and Maddison, D.R. (2010). Mesquite: a modular system for evolutionary analysis. 2.73, http://mesquiteproject.org. 65. Martin DP, Lemey P, Posada D. Analysing recombination in nucleotide sequences. Mol Ecol Resour 2011;11:943–955. 66. Martins LD, Mallo D, Posada D. A Bayesian Supertree Model for Genome-Wide Species Tree Reconstruction. Syst Biol 2014. doi: 10.1093/sysbio/syu082. 67. Mona S, Ray N, Arenas M, Excoffier L. Genetic consequences of habitat fragmentation during a range expansion. Heredity 2014;112:291–299. 68. Morrison DA. Networks in phylogenetic analysis: new tools for population biology. Int J Parasitol 2005;35:567–582.

454

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

69. Nakhleh L. Computational approaches to species phylogeny inference and gene tree reconciliation. Trends Ecol Evol 2013;28:719–728. 70. Nei M, Gojobori T. Simple method for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 1986;3:418–426. 71. Ohno S. Evolution by Gene Duplication. Berlin: Springer-Verlag; 1970. 72. Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol 1988;5:568–583. 73. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999;96:4285–4288. 74. Perez-Losada M, Posada D, Arenas M, Jobes DV, Sinangil F, Berman PW, Crandall KA. Ethnic differences in the adaptation rate of HIV gp120 from a vaccine trial. Retrovirology 2009;6:67. 75. Perez-Losada, M., Jobes, D.V., Sinangil, F., Crandall, K.A., Arenas, M., Posada, D. and Berman, P.W. (2011). Phylodynamics of HIV-1 from a phase III AIDS vaccine trial in Bangkok, Thailand. PLoS One, 6, e16902. 76. Perez-Losada M, Arenas M, Galan JC, Palero F, Gonzalez-Candelas F. Recombination in viruses: Mechanisms, methods of study, and evolutionary consequences. Infect Genet Evol 2015;30C:296–307. 77. Posada D. Unveiling the molecular clock in the presence of recombination. Mol Biol Evol 2001;18:1976–1978. 78. Posada D. Selecting models of evolution. In: Vandemme A, Salemi M, editors. The Phylogenetic Handbook. Cambridge, UK: Cambridge University Press; 2003. 79. Postlethwait JH, Woods IG, Ngo-Hazelett P, Yan YL, Kelly PD, Chu F, Huang H, Hill-Force A, Talbot WS. Zebrafish comparative genomics and the origins of vertebrate chromosomes. Genome Res 2000;10:1890–1902. 80. Ragan MA. Phylogenetic inference based on matrix representation of trees. Mol Phylogenet Evol 1992;1:53–58. 81. Rajendran KV, Zhang J, Liu S, Peatman E, Kucuktas H, Wang X, Liu H, Wood T, Terhune J, Liu Z. Pathogen recognition receptors in channel catfish: II. Identification, phylogeny and expression of retinoic acid-inducible gene I (RIG-I)-like receptors (RLRs). Dev Comp Immunol 2012;37:381–389. 82. Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 2003;164:1645–1656. 83. Robertson DL, Sharp PM, Mccutchan FE, Hahn BH. Recombination in HIV-1. Nature 1995;374:124–126. 84. Ronquist F, Teslenko M, Van Der Mark P, Ayres DL, Darling A, Hohna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 2012;61:539–542. 85. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987;4:406–425. 86. Sanderson M, Hufford L. Homoplasy: The Recurrence of Similarity in Evolution. New York: Academic Press; 1996. 87. Schierup MH, Hein J. Consequences of recombination on traditional phylogenetic analysis. Genetics 2000;156:879–891.

454

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

69. Nakhleh L. Computational approaches to species phylogeny inference and gene tree reconciliation. Trends Ecol Evol 2013;28:719–728. 70. Nei M, Gojobori T. Simple method for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 1986;3:418–426. 71. Ohno S. Evolution by Gene Duplication. Berlin: Springer-Verlag; 1970. 72. Pamilo P, Nei M. Relationships between gene trees and species trees. Mol Biol Evol 1988;5:568–583. 73. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999;96:4285–4288. 74. Perez-Losada M, Posada D, Arenas M, Jobes DV, Sinangil F, Berman PW, Crandall KA. Ethnic differences in the adaptation rate of HIV gp120 from a vaccine trial. Retrovirology 2009;6:67. 75. Perez-Losada, M., Jobes, D.V., Sinangil, F., Crandall, K.A., Arenas, M., Posada, D. and Berman, P.W. (2011). Phylodynamics of HIV-1 from a phase III AIDS vaccine trial in Bangkok, Thailand. PLoS One, 6, e16902. 76. Perez-Losada M, Arenas M, Galan JC, Palero F, Gonzalez-Candelas F. Recombination in viruses: Mechanisms, methods of study, and evolutionary consequences. Infect Genet Evol 2015;30C:296–307. 77. Posada D. Unveiling the molecular clock in the presence of recombination. Mol Biol Evol 2001;18:1976–1978. 78. Posada D. Selecting models of evolution. In: Vandemme A, Salemi M, editors. The Phylogenetic Handbook. Cambridge, UK: Cambridge University Press; 2003. 79. Postlethwait JH, Woods IG, Ngo-Hazelett P, Yan YL, Kelly PD, Chu F, Huang H, Hill-Force A, Talbot WS. Zebrafish comparative genomics and the origins of vertebrate chromosomes. Genome Res 2000;10:1890–1902. 80. Ragan MA. Phylogenetic inference based on matrix representation of trees. Mol Phylogenet Evol 1992;1:53–58. 81. Rajendran KV, Zhang J, Liu S, Peatman E, Kucuktas H, Wang X, Liu H, Wood T, Terhune J, Liu Z. Pathogen recognition receptors in channel catfish: II. Identification, phylogeny and expression of retinoic acid-inducible gene I (RIG-I)-like receptors (RLRs). Dev Comp Immunol 2012;37:381–389. 82. Rannala B, Yang Z. Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. Genetics 2003;164:1645–1656. 83. Robertson DL, Sharp PM, Mccutchan FE, Hahn BH. Recombination in HIV-1. Nature 1995;374:124–126. 84. Ronquist F, Teslenko M, Van Der Mark P, Ayres DL, Darling A, Hohna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol 2012;61:539–542. 85. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987;4:406–425. 86. Sanderson M, Hufford L. Homoplasy: The Recurrence of Similarity in Evolution. New York: Academic Press; 1996. 87. Schierup MH, Hein J. Consequences of recombination on traditional phylogenetic analysis. Genetics 2000;156:879–891.

REFERENCES

455

88. Schierup MH, Hein J. Recombination and the molecular clock. Mol Biol Evol 2000;17:1578–1579. 89. Sennblad B, Lagergren J. Probabilistic orthology analysis. Syst Biol 2009;58:411–424. 90. Sharaf MA, Kowalski BR, Weinstein B. Construction of phylogenetic trees by pattern recognition procedures. Z Naturforsch C 1980;35:508–513. 91. Smouse P. To tree or not to tree. Mol Ecol 1998;7:399–412. 92. Stamatakis A, Aberer AJ, Goll C, Smith SA, Berger SA, Izquierdo-Carrasco F. RAxML-Light: a tool for computing terabyte phylogenies. Bioinformatics 2012;28: 2064–2066. 93. Sumner JG, Jarvis PD, Fernandez-Sanchez J, Kaine BT, Woodhams MD, Holland BR. Is the general time-reversible model bad for molecular phylogenetics? Syst Biol 2012;61:1069–1074. 94. Takahata N. Gene geneology in three related populations: consistency probability between gene and population trees. Genetics 1989;122:957–966. 95. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. MEGA6: molecular evolutionary genetics analysis version 6.0. Mol Biol Evol 2013;30:2725–2729. 96. Vogt NB, Knutsen H. SIMCA pattern recognition classification of five infauna taxonomic groups using non-polar compounds analysed by high resolution gas chromatography. Marine Ecol 1985;26:145–156. 97. Wei K. Stratophenetic tracing of phylogeny using SIMCA pattern recognition technique: a case study of the late Neogene planktic foraminifera Globoconella clade. Paleobiology 1994;20:52–65. 98. Whidden C, Zeh N, Beiko RG. Supertrees Based on the Subtree Prune-and-Regraft Distance. Syst Biol 2014;63:566–581. 99. Wilke, C.O. (2012). Bringing molecules back into molecular evolution. PLoS Comput Biol, 8, e1002572. 100. Worobey M, Holmes EC. Evolutionary aspects of recombination in RNA viruses. J Gen Virol 1999;80:2535–2543. 101. Wu Y. Coalescent-based species tree inference from gene tree topologies under incomplete lineage sorting by maximum likelihood. Evolution 2012;66:763–775. 102. Yang Z. The power of phylogenetic comparison in revealing protein function. Proc Natl Acad Sci USA 2005;102:3179–3180. 103. Yang Z. Computational Molecular Evolution. Oxford, England: Oxford University Press; 2006. 104. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 2007;24:1586–1591. 105. Yang Z, Nielsen R, Masami H. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol 1998;15:1600–1611. 106. Yu, Y. and Nakhleh, L. (2012). Fast algorithms for reconciliation under hybridization and incomplete lineage sorting. arXiv:1212.1909. 107. Yu Y, Than C, Degnan JH, Nakhleh L. Coalescent histories on phylogenetic networks and detection of hybridization despite incomplete lineage sorting. Syst Biol 2011;60:138–149. 108. Yu Y, Ristic N, Nakhleh L. Fast algorithms and heuristics for phylogenomics under ILS and hybridization. BMC Bioinform 2013;14(Suppl 15):S6.

456

DIVERSE CONSIDERATIONS FOR SUCCESSFUL PHYLOGENETIC TREE RECONSTRUCTION

109. Zahid MaH, Mittal A, Joshi RC. A pattern recognition-based approach for phylogenetic network construction with constrained recombination. Pattern Recogn 2006;39: 2312–2322. 110. Zamani N, Russell P, Lantz H, Hoeppner MP, Meadows JR, Vijay N, Mauceli E, Di Palma F, Lindblad-Toh K, Jern P, Grabherr MG. Unsupervised genome-wide recognition of local relationship patterns. BMC Genomics 2013;14:347. 111. Zhang J. Performance of likelihood ratio tests of evolutionary hypotheses under inadequate substitution models. Mol Biol Evol 1999;16:868–875. 112. Zhang J. Evolution by gene duplication: an update. Trends Ecol Evol 2003;18:292–298. 113. Zhang J, Liu S, Rajendran KV, Sun L, Zhang Y, Sun F, Kucuktas H, Liu H, Liu Z. Pathogen recognition receptors in channel catfish: III phylogeny and expression analysis of Toll-like receptors. Dev Comp Immunol 2013;40:185–194.

Introduction to Computational molecular biology - Carlos Setubal ...