Coestimation of recombination, substitution and ...

Viewer
Transcript

Heredity (2014) 112, 255–264 & 2014 Macmillan Publishers Limited All rights reserved 0018-067X/14 www.nature.com/hdy

ORIGINAL ARTICLE

Coestimation of recombination, substitution and molecular adaptation rates by approximate Bayesian computation JS Lopes1,2,6, M Arenas3,4,6, D Posada3 and MA Beaumont1,5 The estimation of parameters in molecular evolution may be biased when some processes are not considered. For example, the estimation of selection at the molecular level using codon-substitution models can have an upward bias when recombination is ignored. Here we address the joint estimation of recombination, molecular adaptation and substitution rates from coding sequences using approximate Bayesian computation (ABC). We describe the implementation of a regression-based strategy for choosing subsets of summary statistics for coding data, and show that this approach can accurately infer recombination allowing for intracodon recombination breakpoints, molecular adaptation and codon substitution rates. We demonstrate that our ABC approach can outperform other analytical methods under a variety of evolutionary scenarios. We also show that although the choice of the codon-substitution model is important, our inferences are robust to a moderate degree of model misspecification. In addition, we demonstrate that our approach can accurately choose the evolutionary model that best fits the data, providing an alternative for when the use of full-likelihood methods is impracticable. Finally, we applied our ABC method to co-estimate recombination, substitution and molecular adaptation rates from 24 published human immunodeficiency virus 1 coding data sets. Heredity (2014) 112, 255–264; doi:10.1038/hdy.2013.101; published online 23 October 2013 Keywords: ABC; recombination; molecular adaptation; mutation; dN/dS; HIV-1

INTRODUCTION In nature, different evolutionary processes, such as recombination, substitution and selection, shape the genetic diversity of a population as generations go by. The traces left behind in the gene pool result from the joint action of all the evolutionary processes together, and, ideally, one would want to estimate these parameters simultaneously to avoid potential biases. For instance, ignoring the presence of recombination can bias the estimation of gene trees (Posada and Crandall, 2002) and derived inferences like growth rates (Schierup and Hein, 2000) or molecular adaptation (Anisimova et al., 2003; Scheffler et al., 2006; Arenas and Posada, 2010). As far as we know, only Wilson and McVean (2006) and Wilson et al. (2009) have proposed methods for the co-estimation of recombination and adaptation rates from coding sequences. The former used an approximation to the coalescent with recombination based on the product of approximating conditionals likelihood and the copying model of Li and Stephens (2003), which was implemented in the package OmegaMap using a Markov chain Monte Carlo framework. The latter introduced an approximate Bayesian computation (ABC) method to jointly infer recombination, adaptation and substitution rates from a particular data set of Campylobacter jejuni, but did not explore further the statistical method. Noticeably, both methods assume that recombination cannot occur within codons, which has been shown to bias the estimation of adaptation rates at

individual sites (Anisimova et al., 2003; Arenas and Posada, 2010). On the other hand, the effect of different selective regimes on estimates of the recombination rate has not been evaluated in detail. Additionally, an accurate choice of the underlining substitution models of evolution also has a significant role in molecular evolution studies (see Sullivan and Joyce, 2005 and references therein). Traditionally, model selection methods require the calculation of likelihood functions; however, for complex models these calculations may become prohibitive. In these cases, the ABC framework can provide a feasible way to choose between complex models (Beaumont, 2008; Cornuet et al., 2008). In this study we have developed an ABC method for inferring jointly rates of recombination, molecular adaptation and substitution from coding sequences, while accounting for intracodon recombination. We have investigated a large number of summary statistics that are potentially informative about these parameters, and used the mean square error to choose a subset for efficient computation. We have also tested the use of a new ABC approach for model choice and assessed its robustness to model misspecification. In addition, we have compared the performance of our ABC method with that of OmegaMap, PAML (Yang, 2007) and LAMARC (Kuhner, 2006). Finally, we applied our ABC method to analyze 24 published human immunodeficiency virus 1 (HIV-1) data sets.

1School of Animal and Microbial Sciences, University of Reading, Whiteknights, UK; 2Instituto Gulbenkian de Ciencia, Oeiras, Portugal; 3Departamento de Bioquı´mica, Gene´tica e Inmunologı´a, Universidad de Vigo, Vigo, Spain; 4Centre for Molecular Biology ‘Severo Ochoa’, Consejo Superior de Investigaciones Cientı´ficas (CSIC), Madrid, Spain and 5School of Mathematical Sciences and School of Biological Sciences, University of Bristol, University Walk, Bristol, UK 6These authors contributed equally to this work. Correspondence: Dr JS Lopes, Instituto Gulbenkian de Ciencia, Apartado 14, 2781-901 Oeiras, Portugal. E-mail: [email protected] Received 18 April 2013; revised 22 August 2013; accepted 17 September 2013; published online 23 October 2013

Coestimation of recombination and selection using ABC JS Lopes et al 256

MATERIALS AND METHODS ABC approach based on rejection/regression algorithm The aim of the ABC approach is to summarize typically high-dimensional data by a vector of summary statistics S and then obtain an estimate of the posterior distribution of the parameter vector X in the model of interest: PðX j SÞ ¼

PðS j XÞPðXÞ PðSÞ

ð1Þ

where P(X) is the prior distribution, P(S|X) is the likelihood function and P(S) R is the marginal likelihood, PðSÞ ¼ PðS j XÞPðXÞdX. In practice the likelihood function cannot easily be evaluated and the posterior distribution is estimated from simulations. Recently there has been a great increase in the application of ABC methods for problems in population genetics, and this has largely been driven by the availability of useful software packages (for example, ABCtoolbox package (Wegmann et al., 2009), ABC package (Csillery et al., 2012) or DIY-ABC (Cornuet et al., 2008)). These authors have developed a number of important enhancements to the original rejection and regression algorithms (reviewed in Beaumont (2010)). However, a recent study by Blum et al. (2013) has demonstrated that the original regression approach of Beaumont et al. (2002) typically performs well when compared with more recent developments, and currently there seems to be no ‘best’ method across all scenarios. In particular regression adjustment without any selection of summary statistics tends to outperform methods based on partial least squares. In the present study we use a rejection algorithm (Pritchard et al., 1999) and a regression algorithm (Beaumont et al., 2002). Although our method does use a technique for finding an ‘optimal’ subset of summary statistics, we demonstrate that this is motivated by its improved performance (as judged by mean square error) over regression adjustment based on all the summary statistics. In outline, for the rejection algorithm, trial values Xi are simulated from the prior P(X); data are simulated by running the model with Xi; these are summarized by Si and the real data are summarized by S ; for some distance metric d( ), the Xi for which d (S, S )od are accepted. These are then regarded as drawn from the posterior distribution. In the regression algorithm the points that are retained by the rejection algorithm are then adjusted as follows. A weighted linear regression—weighted as a function of d( )—gives an estimate of the posterior mean X^ as a linear function of S. On the assumption that only the posterior mean changes with S, and no higher ^ þ XðS ^ Þ should yield samples drawn moments, then the adjustment Xi XðSÞ approximately from the target posterior distribution. Explicit algorithms for these methods are given in Beaumont (2010), but see also Blum et al. (2013). Because our aim is to estimate recombination, adaptation and codon substitution rates from coding sequence alignments, the summary statistics to be used in ABC were chosen according to their correlation with these parameters.

Substitution models and prior distribution considered The simulated data were obtained using the backwards-in-time coalescentbased program NetRecodon (Arenas and Posada, 2010), which allows recombination breakpoints to occur between and within codons and permits the simulation sequences under a wide range of codon-substitution models and demographic scenarios. Notice that under GY94 substitution codon models, only one nucleotide may change per codon substitution event and, therefore, these models can be crossed with nucleotide models. On the other hand, note that recombination only occurs at the nucleotide level (Arenas and Posada, 2010). Here we simulated data under four codon-substitution models with increasing complexity levels: Model A: GY94 K80, where the transition over transversion ratio rate is assumed to be 2.0 (a common value in real data), and the codon frequencies are equally distributed. Model B: GY94 GTR þ Gnuc with parameter values typical of HIV-1 data sets (following Carvajal-Rodriguez et al. (2006)), particularly, using codon frequencies calculated from the nucleotide frequencies fA ¼ 0.35, fC ¼ 0.17, fG ¼ 0.23 and fT ¼ 0.25, transition rates RA–C ¼ 3.00, RA–G ¼ 5.00, RA–T ¼ 0.90, RC–G ¼ 1.30, RC–T ¼ 5.30 and RG–T ¼ 1.00, and alpha shape of the gamma distribution of 0.70. Heredity

Model C: GY94 GTR þ Gnuc þ Gcodon: model B with an additional gamma distribution for the non-synonymous/synonymous rate ratio (o) variation per codon. The alpha shape for the latter distribution was obtained from the work of Yang et al. (2000) on HIV-1 (M5 codon model, a ¼ 0.557). Model D: GY94 GTR3 þ Gnuc þ Gcodon: model C with different GTR substitution rates for each position of the codon calculated from the HIV-1 data set published by Van Rij et al. (2003), and using codon frequencies calculated from the nucleotide frequencies for each position in the codon fA1 ¼ 0.32, fC1 ¼ 0.23, fG1 ¼ 0.31, fT1 ¼ 0.15, fA2 ¼ 0.42, fC2 ¼ 0.18, fG2 ¼ 0.17, fT2 ¼ 0.23, fA3 ¼ 0.50, fC3 ¼ 0.14, fG3 ¼ 0.16 and fT3 ¼ 0.20, changing rates R(A C)1 ¼ 2.861, R(A G)1 ¼ 4.66, R(A T)1 ¼ 0.27, R(C G)1 ¼ 0.11, R(C T)1 ¼ 4.52, R(G T)1 ¼ 1.00, R(A C)2 ¼ 195.00, R(A G)2 ¼ 4186.73, R(A T)2 ¼ 583.60, R(C G)2 ¼ 287.81, R(C T)2 ¼ 3937.27 and R(G T)2 ¼ 1.00, R(A C)3 ¼ 1.44, R(A G)3 ¼ 6.45, R(A T)3 ¼ 1.20, R(C G)3 ¼ 1.13, R(C T)3 ¼ 9.54 and R(G T)3 ¼ 1.00, and alpha shape of the gamma distribution for rate variation among nucleotide sites of 0.028. Here GY94 K80 refers to the GY94 codon model (Goldman and Yang, 1994) crossed with the K80 nucleotide model and GY94 GTR þ G refers to that same codon model crossed with the GTR nucleotide model under a gamma distribution for rate variation among nucleotide sites ( þ Gnuc), or for o variation among codons ( þ Gcodon). The parameters of interest throughout the manuscript were: the scaled recombination rate r ¼ 4Nrl, where N is the effective (diploid) population size, r is the recombination rate per nucleotide and l is the number of nucleotides in the sequence; the non-synonymous/synonymous rate ratio o; and the scaled codon substitution rate y ¼ 4NmL, where m is the substitution rate per codon and L is the number of codons in the sequence. Data were simulated using values of these parameters sampled from different prior distributions. The scaled recombination rate r was sampled from a uniform prior distribution between 0 and 50 (note that r ¼ 50 corresponds to a very high recombination rate with 141 expected recombinant events in the neutral case when considering 15 sequences). In nature, the selective pressure in codon sequences is thought to be mainly purifying (for example, Hughes et al., 2003). Therefore, for o we choose a lognormal prior distribution with a location parameter of log10 of 0.3 and scale parameter of 0.4. In this way we obtain a prior distribution with only positive values, with a median 0.3 and with small probability of being higher than 1, which we believe represents well our prior beliefs about the distribution of this parameter in real data. The scaled codon substitution rate y was simulated under a uniform prior distribution between 50 and 300 (nucleotide diversity of between 5 and 20%). For each of the four scenarios, and using the priors described above, we generated 50 000 samples of n ¼ 15 sequences with L ¼ 333 codons and an effective population size N ¼ 1000. The four simulated data sets were used as reference tables (that is, very large tables of simulations to be reused many times in ABC (Cornuet et al., 2008)). Finally, we generated 1000 data sets for each scenario with values of r, o and y ranging from the entire extent of the prior, to be used as test data sets (that is, simulated real data for which the true values of the parameters are known).

Definition and choice of summary statistics We defined an initial pool of 26 summary statistics that were applied at the nucleotide, codon and amino-acid levels (Table 1). Three of the summary statistics consisted of fast statistical tests for detecting recombination: the maximum chi-squared method (w2, Smith, 1992), the neighbor similarity scope (NSS, Jakobsen and Easteal, 1996) and the pairwise homoplasy index (PHI, Bruen et al., 2006), available in the PhiTest software (Bruen et al., 2006). Other chosen summary statistics were the mean, standard deviation (s.d.), skewness (sk) and kurtosis (ku) of the nucleotide diversity (p), number of segregating sites (S) and heterozygosity (H): ! m X ni 2 1 Hi ¼ fj ð2Þ ni 1 j¼1 where ni is the number of total states at site i, m is the number of possible states and fj is the frequency of state j. Finally, we introduced a series of summary statistics, similar to the o ratio but faster to calculate, that consider information at the codon (cd) and amino-acid (aa) levels jointly (non-

Coestimation of recombination and selection using ABC JS Lopes et al 257

Table 1 Description of the 26 summary statistics used in this work Amino-acid level a

Codon level a

Nucleotide level a

Mean of p (aapav)

Mean of p (cdpav)

PHI

s.d. of p (aapsd) sk of p (aapsk)

s.d. of p (cdpsd) sk of p (cdpsk)

NSS w2

ku of p (aapku) Mean of H (aaHav)

ku of p (cdpku) Mean H (cdHav)

S (nucS)

s.d. of H (aaHsd) sk of H (aaHsk)

s.d. of H (cdHsd) sk of H (cdHsk)

ku of H (aaHku) S (aaS)

ku of H (cdHku) S (cdS)

Cp ¼

aapsk/(cdpsk–aapsk) (NSRsk)

statistics abbreviations are shown in parentheses.

synonymous ratio, NSR): p of aa p of cd p of aa

ð5Þ

Optimal tolerance and number of simulations

aapku/(cdpku–aapku) (NSRku)

NSR ¼

SSEP N þ 2p Z2

where P is the number of predictors, SSEP is the error sum of squares for the model with p predictors, Z2 is the residual mean square after regression on the complete set of predictors and N is the sample size. This procedure provides a ‘best’ set of summary statistics for each parameter considered (recombination, substitution rate and molecular adaptation). We then took the union of the ‘best’ set of summary statistics of each parameter in order to have one single set that enables us to infer parameters jointly and compare different models.

aapav/(cdpav–aapav) (NSRav) aaps.d./(cdpsd–aaps.d.) (NSRs.d.)

aSummary

between every parameter and different sets of summary statistics with the leaps package (Lumley and Miller, 2004) in R. This way we tested all possible combinations of groups of independent variables (the summary statistics) with respect to a single dependent variable (the parameters) while accounting for ‘over-fit’ of the regression model using the Mallow’s Cp index:

ð3Þ

the use of disequilibrium measures such as pairwise D, D0 or r2 was not considered because they require the comparison of all sites against each other, which would imply heavy computational costs. The motivation behind using summary statistics is to reduce the dimensionality of the data (D). Ideally one wants to use a set of statistics S ¼ (S1, S2, y, Sn) such that the posterior distribution P(X|S) is similar to the posterior distribution P(X|D). The key consideration is to choose enough summary statistics so that S is close to sufficient for the parameters of the model, while taking into account the increasing inaccuracy of estimation with the number of summary statistics (the so-called curse of dimensionality). The choice of the most suitable summary statistics is still problematic for ABC (Beaumont, 2008). Several attempts to provide a general methodology have been proposed, but there is as yet little consensus on the best approach (Blum et al., 2013). One of the most promising approaches is that of Fearnhead and Prangle (2012), in which, motivated by the need to minimize the expected mean squared error, they prove that the optimal summary statistic for any parameter is the posterior mean for that parameter. Although the posterior mean is unknown it can be estimated in a pilot set of simulations using local regression (as is indeed the aim of the regression adjustment method of Beaumont et al. (2002)). The criterion of Fearnhead and Prangle (2012) does not pertain to the posterior variance, only to the posterior mean. By contrast, our approach aims to minimize the mean integrated squared error (MISE), which is equal to the sum of the mean squared error and the posterior variance. Although it would be desirable in the long term to develop a principled proof-guided method along the lines of Fearnhead and Prangle (2012) we have taken a more empirical approach, following earlier authors (Veeramah et al., 2012), in which the performance of the ABC method for different combinations of summary statistics is evaluated through the use of test data sets in which the true parameter values are known. We aimed to find combinations that minimized the MISE for each parameter over a range of parameter values sampled from the prior: ! n 2 X 1X MISE ¼ fi f0j ð4Þ n i¼1 j where n is the number of sampled points from the posterior distribution, fi is the ith sampled point from the posterior distribution and f is the true value of parameter of the jth test data set. With a pool of 26 summary statistics, the set of all possible combinations is large. In order to reduce this number we chose an initial subset of summary statistics and the order that the remaining summary statistics were added to such subset. This was done by calculating a multiple correlation coefficient

Following Beaumont et al. (2002) we have chosen the tolerance by setting a proportion of simulations Pd to be accepted according to d(S, S ). We evaluated the use of different combinations of number of simulations and tolerance intervals by using test data sets. For each combination we used 1000 test data sets and calculated their average MISE for the three parameters of interest: rates of recombination, molecular adaptation and substitution (that is, MISErec , MISEomega and MISEmut ). In order to compare the different combinations in use using a single measurement of error, we devised a composite MISE value as: MISEtotal ¼

MISErec MISEomega MISEmut þ þ MISErec MISEomega MISEmut

ð6Þ

The set of the different conditions to be tested will have an average MISEtotal equal to the number of parameters to estimate (see for example Figure 2, where the average of the points of each panel is 3). For this reason, the composite value not only helps the comparison between different conditions, but also scales the error values to the number of parameters, allowing for a better comparison of the behavior of the different models.

RESULTS Experiment I. Selection of the preliminary subsets of summary statistics The initial experiment focused on choosing the most suitable summary statistics set for estimating recombination, adaptation and substitution rates. The first approach was to select a preliminary subset. Considering the four reference tables and using the Cp index, we obtained the most linearly correlated subsets of 1, 2, y, 6 summary statistics with each parameter for the four models (see Supplementary Tables S1 S4). This information was used to choose a preliminary subset of summary statistics and the order in which the remaining summary statistics were added to that subset. Thus, the preliminary subset was composed of eight summary statistics (PHI, NSS, w2, aapav, aaS, cdHav, cdS and nucS) and the order of addition of the remaining summary statistics was cdpav, cdps.d., aaHav, aaHs.d., cdpsk, NSRav, aaps.d., cdHs.d., aapsk, cdHsk, aaHsk, cdHku, aaHku, aapku, cdpku, NSRs.d., NSRsk and NSRku. Experiment II. Selection of the final set of summary statistics and accuracy of the ABC method For the selection of the final set of summary statistics, we applied the developed ABC method to analyze the 1000 test data sets of each model, using the points simulated under the appropriate model, and accepted the closest 0.2% (100 samples). The regression step was performed on each parameter at a time after applying a log transformation to ensure that the regression-adjustment would not project points outside the region of support of the model. The regression-adjusted parameters were then back-transformed. We used Heredity

Coestimation of recombination and selection using ABC JS Lopes et al 258

each set of summary statistics to perform ABC on the test data sets as follows: taking a model at a time, we analyzed the 1000 test data sets and calculated their median MISE for each parameter (see Supplementary Tables S5 S8); whenever the added summary statistic decreased the MISE, it was added to the preliminary subset of summary statistics. Upon considering the results for the four models, we chose a final subset of 22 summary statistics (PHI, NSS, w2, aapav, aaps.d., aapsk, aapku, aaHav, aaHs.d., aaHsk, aaHku, aaS, cdpav, cdps.d., cdpsk, cdpku, cdHav, cdHs.d., cdHsk, cdHku, cdS, nucS). This subset contains all the suggested summary statistics but the NSRs, which are composite summary statistics calculated as nonlinear functions of aap and cdp. In Figure 1 we present the true values of the parameters that gave rise to the test data sets and their point estimated values (mode of the posterior distribution) using ABC with the final subset of summary statistics. Under the simpler cases (models A, B and C), ABC estimations were accurate, particularly for o and y. When considering the most complex scenario (model D), the correlation between the true and estimated values decreased, especially for o. Under this

model each codon can take a different value for o and each nucleotide position in the codon a different substitution scheme, and thus parameter estimation becomes more difficult. Overall, our results show that the chosen summary statistics can accurately co-estimate recombination, adaptation and substitution rates. As a side study presented in the Supplementary Information, we considered the 1000 test data sets from model A and performed ABC using two different table references with 50 000 simulations each— one table assumed no recombination (r ¼ 0) and the other no adaptation (o ¼ 1). Results suggest that unaccounted recombination leads to bias in adaptation (Supplementary Figure S1C). Interestingly, contrary to the two sets observed by Scheffler et al. (2006), the trend is for the presence of recombination to lead to an underestimation of global adaptation by using ABC. However, in our case we consider the overall impact of low and high recombination in different values of adaptation. Regarding the impact of ignoring adaptation, this seems to also lead to underestimation of recombination, especially for higher values of recombination (Supplementary Figure S1D).

Figure 1 Accuracy of ABC estimates of r, o and y rates. The full line indicates the estimated value (y axis) corresponding to each of the 1000 test data sets generated using models A–D under a given true value (x axis). Dotted lines indicate 95% credible intervals. Panels (a) to (l) show results for each parameter of each model as described in the panel’s title. Heredity

Coestimation of recombination and selection using ABC JS Lopes et al 259

Experiment III. Finding the optimal tolerance and number of simulations Following the choice of the summary statistics, we studied the optimal tolerance interval and number of simulations. We performed ABC analysis under the same conditions as before and used the set of summary statistics proposed in Experiment II. This time we varied the total number of simulated data used and the tolerance interval. The analyses were performed using the full table references with 50 000 simulations and two subsets of it, 12 500 and 25 000 simulations, while the tolerance intervals were set to 50, 100, 200, 400 and 800 simulated samples, resulting in 15 different simulation scenarios. For each one, we calculated the average MISE for the ABC analysis of the 1000 data sets for each of the three parameters of interest and under models A D. In particular, we calculated the average MISE among all the analyses for each parameter separately (that is, MISErec , MISEomega and MISEmut ) and, following equation (6), we calculated MISEtotal . In Figure 2 we show MISEtotal as a function of tolerance for models A–D (see Supplementary Figures S2–S4 for results on the MISE of each of the three parameters). Here the MISE values increased considerably when using 25 000 or 12 500 simulations, suggesting that we should not decrease the number of simulated data below 50 000. Likewise, as we moved the tolerance interval down to 100 data sets the MISE decreased consistently, but for lower tolerance intervals the MISE was maintained and even increased. This suggests that a

tolerance of 100 data sets (when using either 12 500, 25 000 or 50 000 simulated data) provides a good sample of the posterior distribution. Experiment IV. Testing ABC robustness to model misspecification The previous experiments showed that our ABC implementation performs well under the correct model of evolution. As a follow-up, we investigated how robust this method was when the model of molecular evolution is misspecified. As model D is substantially more complex than others, we considered only models A, B and C in this experiment. Thus, we run ABC analyses on three sets (one for each model) of 1000 test data sets. For each set of test data and under the same conditions as in the previous experiments, we performed ABC with the reference tables of the other two models (that is, we used misspecified models for the analyses). When models B and C were misspecified, the results were similar to those obtained under the true model (Supplementary Figures S5 and S6). The analyses using simulated test data from A but assuming any of the other two models, or using simulated test data from B and C and assuming model A, all resulted in high MISEs (Supplementary Figures S5–S7). Indeed, model A is very different from models B and C: the former assumes a K80 nucleotide model, whereas the latter two assume a GTR þ G nucleotide model (see MATERIALS AND METHODS). Most importantly, model A assumes that all sites evolve under the same rate, whereas models B and C assume rate variation among nucleotides (B) and both nucleotides and codons (C). The weak

Figure 2 Effect of number of simulations and tolerance interval on accuracy of ABC using 1000 test data sets generated with models A–D. MISEtotal (y axis) is plotted against different tolerance intervals (x axis) in an analysis with different number of simulated data (12.5k (full line), 25k (dotted line), 50k (dashed line)). (a) Model A. (b) Model B. (c) Model C. (d) Model D. Heredity

Coestimation of recombination and selection using ABC JS Lopes et al 260

impact of misspecification between models B and C suggests that, under this framework, rate variation among codons has a relatively mild effect on parameter estimations. In any case, before using ABC we recommend to test the codon model that best fits the data. The ABC framework also allows for model validation using prior and posterior predictive distributions (Ratmann et al., 2009) and for testing the best-fitting model out of a set of candidate models (Beaumont, 2008). The advantage of using ABC is that we can easily quantify the fit of each suggested model to the data, providing a useful alternative when the likelihood functions are impracticable. Experiment V. Choosing the appropriate model of evolution using ABC Following the experiment on model misspecification, we evaluated the accuracy of the proposed ABC method on choosing best-fit substitution models. In this experiment we performed model choice on the four sets of 1000 test data sets using a combined reference table of 200 000 simulations (that is, we joined the four reference tables of each model of evolution) with various tolerance intervals and using both ABC-rejection and ABC-logistic-regression methods, as described previously (see Beaumont, 2008; Cornuet et al., 2008). Figure 3 presents the support for the true model as the tolerance intervals decreased. For all models, the regression step increased model selection accuracy considerably, although this advantage decreased at smaller tolerance intervals. An effect that has been observed before in other ABC studies (for example, Guillemaud et al., 2010). For models B and C, when considering the widest tolerance

interval (that is, 800 data sets) ABC regression gave around 95% support to the true models, while ABC-rejection algorithm showed a support of about 65% (with the remaining 35% corresponding to model B or C depending on the model used to generate the test data). When considering a tolerance interval of 50 data sets, however, the support for the true model using ABC rejection increased up to 85% (again with the remaining support given to either model B or C). The similarities between models B and C may be responsible for the relative difficulty in being able to distinguish between them. For models A and D, as they are so different from the remaining ones, the method performed very well, identifying them as the correct models, irrespective of the tolerance intervals considered. Experiment VI. Comparison of ABC with other methods In order to evaluate how our approach compares with others, we compared parameter estimates obtained using ABC with estimates calculated by OmegaMap, PAML and LAMARC. Constrained by the extremely large computational time needed to reach convergence in OmegaMap, we simulated 10 replicates of 27 test data sets under model A using as true values every combination of the following settings: o ¼ 0.2, 1.0 and 2.0; r ¼ 0, 10 and 30; and y ¼ 60, 100 and 200. For the ABC estimates we used the reference table generated under model A, as this is the model implemented in OmegaMap. We used the best set of summary statistics, as chosen in Experiment II, and accepted the closest 0.2% (100 samples). As before, we performed the regression step on the log-transformed accepted data and the point estimate was taken as the posterior mode. For OmegaMap we used

Figure 3 Performance of model choice using ABC in 1000 test data sets generated with models A D. The points (and error bars) correspond to the average (and standard error) of the posterior probabilities of the true model from a choice of the four models in the study (A, B, C and D) using both the rejection (dashed lines) and regression (full lines) algorithms and different tolerance intervals. (a) Model A. (b) Model B. (c) Model C. (d) Model D. Heredity

Coestimation of recombination and selection using ABC JS Lopes et al 261

two sets of priors for r: a uniform prior as it was set in ABC, and an exponential with the same mean. The prior for o was set to be the same as the one used in ABC, and the prior for the synonymous mutation rate was set to the software default, as recommended by the authors. In accordance with the generating model, the transition over transversion ratio rate was set to 2.0. All the other parameters were set as the software default. Note that OmegaMap uses the codon model NY98 (Nielsen and Yang, 1998), equivalent to the model GY94 K80 used for simulating the test data sets. Estimates were obtained by performing two independent runs for each data set until reaching convergence, the latter being assessed by comparing o and r estimates of the two runs. Note also that NetRecodon enables the user to differentiate between N and r instead of using the scaled parameter r. OmegaMap, on the other hand, estimates r directly. In order to better compare both approaches, and to remain in line with the results presented in Experiments I V, we show the results in terms of the scaled parameters. For estimation of o with PAML, we first reconstructed a maximum likelihood phylogenetic tree using PhyML (Guindon and Gascuel, 2003) and estimated o with the codeml computer program (included in the PAML package) under the M0 codon model. To estimate r with LAMARC, we used the default parameters. In Figure 4 we present a comparison between the recombination and adaptation rates estimates obtained with our ABC method and OmegaMap, PAML and LAMARC. ABC estimates of recombination rate were in general accurate with tight credible intervals, albeit showing some tendency for underestimation of higher values of r. These results were expected, as the effect of fitting a prior for r with mean 25 was to cause the posterior to underestimate r when r425 and potentially to overestimate r when ro25. This effect, however, was quite mild and decreased as the number of simulations used for ABC increased (results not shown), indicating that there is some space for improvement of the method. By contrast, OmegaMap usually showed overestimations of r with credible intervals well off the true value. Even with a more informative

Figure 4 Comparison of ABC, OmegaMap, Lamarc and PAML estimates from simulated data. Tweenty-seven evolutionary scenarios were generated under different values of r, o and y (x axes). The height of the bars correspond to the average values over replicates. Error bars indicate 95% credible intervals and horizontal dashed-lines indicate the true value. (a) Estimates of r. (b) Estimates of o.

exponential prior on r the overestimation of recombination rate by OmegaMap still remained (see Supplementary Figure S8). LAMARC was able to accurately infer the absence of recombination and presented a better performance than OmegaMap. However, in the presence of recombination LAMARC underestimated recombination rates (Figure 4). Importantly, the ability of the ABC method to estimate recombination rates was robust to low levels of codon diversity, where the tendency of LAMARC to underestimate recombination rate was increased (see Figure 4a, scenarios with y ¼ 60 and presence of recombination). On the other hand, all methods were quite accurate in estimating adaptation rates (Figure 4b), where the true value was commonly contained within narrow credible intervals. Experiment VII. Application of ABC to real data Here we illustrate the application of our ABC method to 24 sequence alignments of HIV-1. These data were obtained from the PopSet database at the NCBI, and correspond to several genomic regions (for example, partial sequences of genes env, gag, nef, pol, tat and of long terminal repeats) and to different viral types and subtypes (for example, subtypes A, B, C, G)—details and references are shown in Supplementary Table S9. These HIV-1 data sets are particularly interesting to analyze owing to their generally high recombination, adaptation and substitution rates. In all cases, sequences were realigned and missing, ambiguous and stop codons were excluded. The priors were the same as those used in Experiments I–VI, as these were already chosen in order to cover a wide range of empirical scenarios. We used jModelTest (Posada, 2008) to select the best-fit substitution model and Hyphy to calculate the nucleotide frequencies at each codon position. We then considered 500 000 simulated data, from which the closest 0.2% (1000) were retained. This was followed by the regression step on the log-transformed parameters. We present the estimates obtained using our ABC method in Figure 5 and Supplementary Table S10. The method seemed to perform well, as the estimates fell well within the prior distributions and the posterior distributions were fairly sharply peaked with narrow credible intervals. Nevertheless there was a trend for credible intervals to become wider when estimated values get higher for all the parameters. This trend was also confirmed in Experiments II and VI. In order to assess the goodness of fit of the models to the data, we performed principal component analysis on the simulated summary statistics and plot the first two components for the accepted simulated data and the observed data (Supplementary Figure S9). Reassuringly, these plots show that the observed data falls consistently within the accepted simulated data. Additionally, we tried to obtain OmegaMap estimates from these same HIV-1 data sets. For a variety of data sets, particularly those with high recombination rates, the program had apparent problems to converge, and even after 20 million Markov chain Monte Carlo iterations (several months running on a cluster) r estimates from the two independent runs still differ by more than five units of r. Nevertheless, the comparison of molecular adaptation estimates obtained using ABC and those obtained with OmegaMap showed an encouraging agreement (Supplementary Figure S10). DISCUSSION We present a novel approach that allows one to assume a codonsubstitution model, while considering recombination at the nucleotide level. We show that this method provides accurate estimates of recombination, molecular adaptation and substitution rates from alignments of coding sequences under different conditions. Thus, for Heredity

Coestimation of recombination and selection using ABC JS Lopes et al 262

Figure 5 Estimates of r, o and y from 24 HIV-1 published data sets. Error bars indicate 95% credible intervals. (a) Estimates of r. (b) Estimates of o. (c) Estimates of y.

the relatively simple models explored (A, B and C), all the three parameters were accurately estimated, except in the high recombination scenario, which seems to be more challenging. For more complex models, with both substitution and adaptation rates differing per site (case D), the precision of ABC decreased, as might be expected. The efficacy of ABC depends to a great extent on the choice of summary statistics that extract and summarize maximal information from data, while allowing simulated data sets to be easily compared with the real data set under investigation. The strategy for choosing summary statistics in the ABC study of Wilson et al. (2009) was to use previously developed method-of-moments estimators (a similar approach is justified theoretically in Fearnhead and Prangle (2012)). Note that Wilson et al. (2009) developed their ABC to analyze a single data set, and could afford for summary statistics that require a noticeable computation time. By contrast, we present a general method for choosing a subset of summary statistics based on the calculation of the Cp index. We used test data to perform informed trial-and-error tests that suggest that a set of 22 summary statistics, chosen to be rapidly computable, is adequate to provide accurate estimates of the parameters studied. ABC has been shown to be an efficient alternative to full-likelihood methods when the likelihood function is intractable. This approximate method has also the advantage of being potentially faster than the correspondent full-likelihood approach. Nevertheless, ABC can still be highly computational intensive if the simulation step is complex. For this reason it is important to refine the method in Heredity

order to find an optimal number of simulations both in terms of accuracy and speed. We showed that in a range of simple to complex models of molecular evolution a sample size of 100 points provided sufficient information to construct reliable posterior summaries. Given this number, at least 50 000 simulated points in total are needed to successfully perform ABC in the scenarios proposed. Under these conditions we obtained very good estimates, reaching correlations between the true and the estimated values approaching 1. Note, however, that we have worked with moderately small samples size (15 sequences with 333 codons). Larger data would require more computation time for each simulation, but might need fewer simulations to reach the same accuracy. An illustrative list of running times for NetRecodon is shown in the Supplementary Table S11, and as expected longer sequences and higher sample sizes require longer computational times. We also observe that the prior distribution of the parameters can affect the required simulation time, indeed a higher recombination rate requires more computation time. Importantly, NetRecodon simulations can be run in parallel, reducing the simulation time, which is the ABC step with the highest computational cost. Finally, the estimation part of ABC requires only a few minutes and does not depend on the simulation settings. We also examined the problem of misspecification of the model of evolution. In cases where the model used for the ABC approach and the model used to generate the data differed on the assumption of rate variation among sites, the estimation of the parameters was poor. Fortunately, rate variation among sites is easily detected using model selection techniques, and our results show that ABC is robust to slight model misspecification, in particular regarding different relative substitution matrices (see Supplementary Figures S5–S7). As noted by Ratmann et al. (2009), it is relatively straightforward to investigate model misspecification under an ABC framework through comparison of observed summary statistics with those generated under predictive distributions. ABC model choice enables different models to be compared, although such tests do not assure that the most supported model is correct. Recently, attention has been drawn to the potential sensitivity of ABC model choice to sufficiency of the summary statistics (Robert et al., 2011). These studies recommend the use of simulation testing to evaluate performance of model choice procedures. We have carried out extensive simulations, and showed that ABC can easily distinguish between distant models. In the comparisons between evolutionary models that were similar its efficiency was slightly reduced, but we still obtained a high support for the true model (see Figure 3). We compared estimates of recombination and adaptation rates made by our ABC method, OmegaMap, PAML and LAMARC on simulated test data sets. An important difference between these approaches is that only the ABC method considers intracodon recombination. Nevertheless, according to previous studies, our expectation was that the estimations of alignment-wise adaptation rates (o) would not be much affected by ignoring intracodon recombination (Arenas and Posada, 2010). Reassuringly, this was the case, and the similar estimates obtained by the different methods reinforce our beliefs about the validity of the ABC method proposed here (Figure 4b). More interesting, though, was the observation that overall recombination rates were most accurately estimated by ABC, OmegaMap showed a clear tendency for overestimation (Figure 4a). In fact, some authors have shown that there is an upward bias on product of approximating conditionals-likelihood estimations of r even on data without intracodon recombination (Li and Stephens, 2003; Wilson and McVean, 2006), although this can be corrected as suggested by Li and Stephens (2003). LAMARC performed well in the absence of recombination but for higher recombination rates tended

Coestimation of recombination and selection using ABC JS Lopes et al 263

to underestimate the rate. Interestingly, although such an underestimation was increased in scenarios with lower levels of codon diversity, ABC was quite robust to these scenarios. Additional advantages of the ABC approach are its speed and the possibility to analyze data sets with a vast variety of codon models, as implemented in NetRecodon or in other simulator, in contrast with the single codon model implemented in OmegaMap. This can be particularly beneficial when there is a need to consider substitution models that best-fit the data (see Yang et al., 2000; Arbiza et al., 2011). We collected 24 HIV-1 data sets with which we challenged our developed ABC method. In this case our ABC approach exhibited good behavior, in that estimates using different tolerance intervals and either the regression or the rejection algorithms showed considerable agreement, and as the tolerance interval decreased, the credible intervals got tighter (results not shown). Furthermore, the tests to assess the goodness of fit of the models show that the models explain the data reasonably well (Supplementary Figure S9). In accordance with previous studies, our results suggest that overall the analyzed HIV-1 data sets exhibit large recombination rates, and that, considering the whole sequences, selection pressure has been mostly purifying (Scheffler et al., 2006). Nevertheless, there is space for improvement. First, we have assumed a uniform recombination rate along sequences, although some studies revealed the presence of hotspots in HIV-1 (for example, Zhuang et al., 2002). Second, we have estimated molecular adaptation for the whole sequence, although estimating adaptation at local (codon) level is possible (for example, by using the Bayesian hierarchical model developed by Bazin et al., 2010). In any case, we argue that the ABC approach can be an ideal framework to incorporate more relaxed model assumptions. In this study we assumed a model that only considers molecular adaptation by the non-synonymous/synonymous ratio. Although this ratio is widely used as an indicator of molecular adaptation (for example, Nielsen and Yang, 1998; Wilson and McVean, 2006; Wilson et al., 2009), this methodology is not free from some controversy. One of the main criticisms is its use in single-population samples, where the close ancestry of the sequences may break the relation of this ratio to the adaptation coefficient (Kryazhimskiy and Plotkin, 2008). However, for populations with high substitution rate, non-synonymous/synonymous ratio approaches have been shown to perform well (for example, virus and bacteria). Nonetheless, this assumption should be considered when applying the methodology we present here. To summarize, our approach provides a novel ABC method to estimate jointly codon substitution, adaptation and recombination rates under realistic substitution models from coding data. This new method can also accurately choose the substitution model that best fits the data between a series of candidate models. In addition, further increase of model complexity is relatively straightforward, requiring only the introduction of new features to the simulation software. Such alterations may require testing different summary statistics and optimizing the number of simulations and the tolerance interval, a procedure that we believe can be easily attained by following the steps we have proposed here. DATA ARCHIVING There were no data to deposit. CONFLICT OF INTEREST The authors declare no conflict of interest. ACKNOWLEDGEMENTS We thank Maria Anisimova, Nicolas Galtier and one anonymous reviewer for thoughtful comments, which improved the quality of the original manuscript.

This work was supported by the Spanish Ministry of Science (grants BIO200761411 and BFU2009-08611 to DP and FPI fellowship BES-2005-9151 to MA). JSL and MAB were supported by EPSRC (grant H3043900). JSL was also supported by Fundacao Ciencia Tecnologia (grant SFRH/BD/43588/2008).

Anisimova M, Nielsen R, Yang Z (2003). Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics 164: 1229–1236. Arbiza L, Patricio M, Dopazo H, Posada D (2011). Genome-wide heterogeneity of nucleotide substitution model fit. Genome Biol Evol 3: 896–908. Arenas M, Posada D (2010). Coalescent simulation of intracodon recombination. Genetics 184: 429–437. Bazin E, Dawson KJ, Beaumont MA (2010). Likelihood-free inference of population structure and local adaptation in a Bayesian hierarchical model. Genetics 185: 587–602. Beaumont M (2008). Joint determination of topology, divergence time, and immigration in population trees. In: Matsumura S, Forster P, Renfrew C (eds) Simulations, Genetics, and Human Prehistory. McDonald Institute for Archaeological Research: Cambridge, pp 135–154. Beaumont MA (2010). Approximate Bayesian computation in evolution and ecology. Annu Rev Ecol Evol Syst 41: 379–405. Beaumont MA, Zhang W, Balding DJ (2002). Approximate Bayesian computation in population genetics. Genetics 162: 2025–2035. Blum MGB, Nunes MA, Prangle D, Sisson SA (2013). A comparative review of dimension reduction methods in approximate Bayesian computation. Stat Sci 28: 189–208. Bruen TC, Philippe H, Bryant D (2006). A simple and robust statistical test for detecting the presence of recombination. Genetics 172: 2665–2681. Carvajal-Rodriguez A, Crandall KA, Posada D (2006). Recombination estimation under complex evolutionary models with the coalescent composite-likelihood method. Mol Biol Evol 23: 817–827. Cornuet JM, Santos F, Beaumont MA, Robert CP, Marin JM, Balding DJ et al. (2008). Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. Bioinformatics 24: 2713–2719. Csillery K, Francois O, Blum MGB (2012). abc: an R package for approximate Bayesian computation (ABC). Methods Ecol Evol 3: 475–479. Fearnhead P, Prangle D (2012). Constructing ABC summary statistics: semi-automatic ABC. J R Statist Soc B 74: 1–28. Goldman N, Yang Z (1994). A codon-based model of nucleotide substitution for proteincoding DNA sequences. Mol Biol Evol 11: 725–736. Guillemaud T, Beaumont MA, Ciosi M, Cornuet JM, Estoup A (2010). Inferring introduction routes of invasive species using approximate Bayesian computation on microsatellite data. Heredity 104: 88–99. Guindon S, Gascuel O (2003). A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52: 696–704. Hughes AL, Packer B, Welch R, Bergen AW, Chanock SJ, Yeager M (2003). Widespread purifying selection at polymorphic sites in human protein-coding loci. Proc Natl Acad Sci USA 100: 15754–15757. Jakobsen IB, Easteal S (1996). A program for calculating and displaying compatibility matrices as an aid to determining reticulate evolution in molecular sequences. Comput Appl Biosci 12: 291–295. Kryazhimskiy S, Plotkin JB (2008). The population genetics of dN/dS. PLoS Genet 4: e1000304. Kuhner MK (2006). LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters. Bioinformatics 22: 768–770. Li N, Stephens M (2003). Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165: 2213–2233. Lumley T, Miller A (2004). Leaps: Regression Subset Selection, R package version 2.7. Available at, http://CRAN.R-project.org/web/packages/leaps. Nielsen R, Yang Z (1998). Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 148: 929–936. Posada D (2008). jModelTest: phylogenetic model averaging. Mol Biol Evol 25: 1253–1256. Posada D, Crandall KA (2002). The effect of recombination on the accuracy of phylogeny estimation. J Mol Evol 54: 396–402. Pritchard JK, Seielstad MT, Perez-Lezaun A, Feldman MW (1999). Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol Biol Evol 16: 1791–1798. Ratmann O, Andrieu C, Wiuf C, Richardson S (2009). Model criticism based on likelihoodfree inference, with an application to protein network evolution. Proc Natl Acad Sci USA 106: 10576–10581. Robert CP, Cornuet JM, Marin JM, Pillai N (2011). Lack of confidence in approximate Bayesian computation model choice. Proc Natl Acad Sci USA 108: 15112–15117. Scheffler K, Martin DP, Seoighe C (2006). Robust inference of positive selection from recombining coding sequences. Bioinformatics 22: 2493–2499. Schierup MH, Hein J (2000). Consequences of recombination on traditional phylogenetic analysis. Genetics 156: 879–891.

Heredity

Coestimation of recombination and selection using ABC JS Lopes et al 264 Smith JM (1992). Analyzing the mosaic structure of genes. J Mol Evol 34: 126–129. Sullivan J, Joyce P (2005). Model selection in phylogenetics. Annu Rev Ecol Evol Syst 36: 445–466. van Rij RP, Worobey M, Visser JA, Schuitemaker H (2003). Evolution of R5 and X4 human immunodeficiency virus type 1 gag sequences in vivo: evidence for recombination. Virology 314: 451–459. Veeramah KR, Wegmann D, Woerner A, Mendez FL, Watkins JC, Destro-Bisol G et al. (2012). An early divergence of KhoeSan ancestors from those of other modern humans is supported by an ABC-based analysis of autosomal resequencing data. Mol Biol Evol 29: 617–630. Wegmann D, Leuenberger C, Excoffier L (2009). Efficient approximate bayesian computation coupled with markov chain monte carlo without likelihood. Genetics 182: 1207–1218.

Wilson DJ, Gabriel E, Leatherbarrow AJ, Cheesbrough J, Gee S, Bolton E et al. (2009). Rapid evolution and the importance of recombination to the gastroenteric pathogen Campylobacter jejuni. Mol Biol Evol 26: 385–397. Wilson DJ, McVean G (2006). Estimating diversifying selection and functional constraint in the presence of recombination. Genetics 172: 1411–1425. Yang Z (2007). PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24: 1586–1591. Yang Z, Nielsen R, Goldman N, Pedersen A-MK (2000). Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155: 431–449. Zhuang J, Jetzt AE, Sun G, Yu H, Klarmann G, Ron Y et al. (2002). Human immunodeficiency virus type 1 recombination: rate, fidelity, and putative hot spots. J Virol 76: 11273–11282.

Supplementary Information accompanies this paper on Heredity website (http://www.nature.com/hdy)

Heredity

Supplementary Material Coestimation of Recombination, Substitution and Molecular Adaptation rates by approximate Bayesian computation Joao S. Lopes, Miguel Arenas, David Posada and Mark A. Beaumont Table S1. Subsets of 1 up to 6 summary statistics obtained by calculating the Cp index, treating the parameter of interest as the dependent variable of a regression model and the summary statistics as the predictors. Summary statistics are computed from test data simulated under model A.

Number of predictors

R2

Recombination considered

1

0.45

PHI.

rate

2

0.54

PHI and cdHav.

3

0.67

PHI; χ2 and cdHav.

4

0.70

PHI; NSS; χ2 and cdHav.

5

0.71

PHI; NSS; χ2; aaHsd and cdHav.

6

0.71

PHI; NSS; χ2; cdπav; cdHav and nucS.

1

0.86

cdS.

2

0.87

cdπsd and cdS.

3

0.89

aaHsd; cdπsd and cdS.

4

0.89

aaHav; aaHsd; cdπsd and cdS.

5

0.89

aaπav; aaHsd; cdπsd; cdπsk and cdS.

6

0.89

PHI; χ2; aaHav; cdπsd; cdHsd and cdS.

1

0.58

NSRav.

2

0.94

aaS and nucS.

3

0.96

aaπav; aaS and nucS.

4

0.97

aaπav; aaHsk; cdπav; and NSRav.

5

0.98

aaπav; aaHsk; aaHsk; cdHav; and NSRav.

6

0.98

aaπav; aaHsk; aaS; cdπav; nucS and NSRav.

Parameter

Substitution rate

dN/dS rate

1  

Summary statistics (predictors)

Table S2. Subsets of 1 up to 6 summary statistics obtained by calculating the Cp index, treating the parameters of interest as the dependent variable of a regression model and the summary statistics as the predictors. Summary statistics are computed from test data simulated under model B.

Number of predictors

R2

Recombination considered

1

0.44

PHI.

rate

2

0.54

PHI and cdHsd.

3

0.65

PHI; χ2 and cdHav.

4

0.67

PHI; NSS; χ2 and cdHav.

5

0.68

PHI; NSS; χ2; cdHav and nucS.

6

0.69

PHI; NSS; χ2; cdπav; cdHav and nucS.

1

0.84

cdS.

2

0.86

cdπsd and cdS.

3

0.88

aaHav; cdπsd and cdS.

4

0.88

aaHav; cdπsd; cdHsd and cdS.

5

0.89

aaHav; aaHsd; cdπsd; cdHsd and cdS.

6

0.89

aaHav; cdπav; cdπsd; cdπsk; cdπku and cdS.

1

0.58

NSRsd.

2

0.94

aaS and nucS.

3

0.96

aaπsd; aaS and nucS.

4

0.97

aaπsd; aaS; nucS; and NSRav.

5

0.98

aaπsd; aaS; cdπsd; nucS; and NSRav.

6

0.98

aaπsd; aaπsk; aaS; cdπsd; nucS and NSRav.

Parameter

Substitution rate

dN/dS rate

2  

Summary statistics (predictors)

Table S3. Subsets of 1 up to 6 summary statistics obtained by calculating the Cp index, treating the parameters of interest as the dependent variable of a regression model and the summary statistics as the predictors. Summary statistics are computed from test data simulated under model C.

Number of predictors

R2

Recombination considered

1

0.39

PHI.

rate

2

0.51

PHI and cdHsd.

3

0.67

PHI; χ2 and cdHav.

4

0.68

PHI; NSS; χ2 and cdHav.

5

0.70

PHI; χ2; cdπsd; cdHav and nucS.

6

0.72

PHI; NSS; χ2; cdπsd; cdHav and nucS.

1

0.86

cdS.

2

0.87

NSS and cdS.

3

0.88

aaHav; cdπsd and cdS.

4

0.89

χ2; aaHav; cdπsd and cdS.

5

0.89

PHI; χ2; aaHav; cdπsd and cdS.

6

0.89

PHI; NSS; χ2; aaHav; cdπsd and cdS.

1

0.74

NSRsd.

2

0.91

aaS and nucS.

3

0.93

aaS; nucS and NSRsd.

4

0.95

aaHsd; aaS; cdHav and NSRav.

5

0.95

aaπsk; aaHsd; aaS; cdHav and NSRav.

6

0.95

aaπsk; aaHsd; aaS; cdHav; NSRav and NSRsd.

Parameter

Substitution rate

dN/dS rate

3  

Summary statistics (predictors)

Table S4. Subsets of 1 up to 6 summary statistics obtained by calculating the Cp index, treating the parameters of interest as the dependent variable of a regression model and the summary statistics as the predictors. Summary statistics are computed from test data simulated under model D.

Parameter

Number of

R2

Recombination rate considered

1 predictors

0.15

PHI.

2

0.28

PHI and aaHsd.

3

0.38

PHI; χ2 and cdHsd.

4

0.42

PHI; χ2; cdHsd and nucS.

5

0.44

PHI; χ2; cdπav; cdHsd and cdHku.

6

0.47

PHI; χ2; cdHsd; cdHsk; cdS and nucS.

1

0.50

aaπsd.

2

0.52

aaHsd and cdπsd.

3

0.53

PHI; aaHsk and cdπsd.

4

0.54

aaπsd; aaπsk; cdπav and cdπsd.

5

0.55

aaπsd; aaπsk; aaπku; cdπav and cdπsd.

6

0.56

PHI; aaπsd; aaπsk; cdπav; cdπsd and cdHsk.

1

0.20

NSRsd.

2

0.54

aaS and nucS.

3

0.57

aaS; nucS and NSRsd.

4

0.59

aaHsd; aaS; cdHav and NSRav.

5

0.60

aaπsk; aaHsd; aaS; cdHav and NSRav.

6

0.60

aaπsk; aaHsd; aaS; cdHav; NSRav and NSRsd.

Substitution rate

dN/dS rate

4  

Summary statistics (predictors)

Table S5. Median value of the MISE of ABC estimates of recombination (ρ), adaptation (ω) and substitution (θ) rates of 1000 test datasets generated using model A with different subsets of summary statistics. The number in square brackets indicates the size of the subset.

Set

Summary Statistics

1

PHI, NSS, χ2, aaπav, aaS, cdHav, cdS and nucS

ρ

ω

θ

1.47E-12

9.61E-04

1.07E-10

PHI, NSS, χ2, aaπav, aaS, cdπava, cdHav, cdS and 1.47E-12

7.65E-04

8.64E-11

1.31E-12

7.35E-04

8.19E-11

1.18E-12

6.66E-04

7.59E-11

8.55E-13

6.33E-04

6.96E-11

1.28E-12

7.33E-04

8.83E-11

1.02E-12

5.76E-04

7.66E-11

8.42E-13

5.98E-04

7.21E-11

6.95E-13

5.44E-04

5.70E-11

PHI, NSS, χ2, aaπav, aaπsd, aaπska, aaHav, aaHsd, 7.33E-13

5.29E-04

5.94E-11

[n=8]. 2

nucS [n=9]. 3

PHI, NSS, χ2, aaπav, aaS, cdπav, cdπsda, cdHav, cdS and nucS [n=10].

4

PHI, NSS, χ2, aaπav, aaHava, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=11].

5

PHI, NSS, χ2, aaπav, aaHav, aaHsda, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=12].

6

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav, cdπsd, cdπska, cdHav, cdS and nucS [n=13].

7

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdS, nucS and NSRava [n=13].

8

PHI, NSS, χ2, aaπav, aaπsda, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=13].

9

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdHsda, cdS and nucS [n=14].

10

aaS, cdπav, cdπsd, cdHav, cdHsd, cdS and nucS [n=15].

5  

11

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaS,

5.81E-13

5.23E-04

5.43E-11

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHska, 5.80E-13

5.23E-04

5.39E-11

5.73E-13

5.24E-04

5.09E-11

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHkua, 5.41E-13

5.19E-04

4.94E-11

5.06E-04

5.06E-11

5.78E-04

5.68E-11

4.99E-04

5.17E-11

5.59E-04

5.10E-11

5.46E-04

4.88E-11

5.69E-04

5.88E-11

cdπav, cdπsd, cdHav, cdHsd, cdHska, cdS and nucS [n=15]. 12

aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdS and nucS [n=16]. 13

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHkua, cdS and nucS [n=16].

14

aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=17]. 15

PHI, NSS, χ2, aaπav, aaπsd, aaπkua, aaHav, aaHsd, 5.45E-13 aaHku, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=18].

16

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHku, 7.64E-13 aaS, cdπav, cdπsd, cdπkua, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=18].

17

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHku, 6.11E-13 aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRsda [n=18].

18

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHku, 5.67E-13 aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRska [n=18].

19

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHku, 5.63E-13 aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRkua [n=18].

Full set

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaπku, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdπku, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS

6  

8.30E-13

[n=22].

Full set2 PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaπku, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdπku, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS, NSRav, NSRsd, NSRsk and NSRku [n=26]. a

Summary statistic added to the set.

7  

1.01E-12

5.87E-04

6.50E-11

Table S6. Median value of the MISE of ABC estimates of recombination (ρ), adaptation (ω) and substitution (θ) rates of 1000 test datasets generated using model B with different subsets of summary statistics. The number in square brackets indicates the size of the subset.

Set

Summary Statistics

1

PHI, NSS, χ2, aaπav, aaS, cdHav, cdS and nucS

ρ

ω

θ

2.18E-12

1.40E-03

1.65E-10

PHI, NSS, χ2, aaπav, aaS, cdπava, cdHav, cdS and 1.98E-12

1.16E-03

1.36E-10

1.80E-12

1.10E-03

1.25E-10

1.68E-12

1.02E-03

1.17E-10

1.45E-12

9.78E-04

1.08E-10

1.64E-12

1.01E-03

1.10E-10

1.62E-12

9.26E-04

1.12E-10

1.47E-12

9.69E-04

1.08E-10

1.24E-12

9.31E-04

9.40E-11

1.30E-12

9.56E-04

9.58E-11

[n=8]. 2

nucS [n=9]. 3

PHI, NSS, χ2, aaπav, aaS, cdπav, cdπsda, cdHav, cdS and nucS [n=10].

4

PHI, NSS, χ2, aaπav, aaHava, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=11].

5

PHI, NSS, χ2, aaπav, aaHav, aaHsda, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=12].

6

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav, cdπsd, cdπska, cdHav, cdS and nucS [n=13].

7

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdS, nucS and NSRava [n=13].

8

PHI, NSS, χ2, aaπav, aaπsda, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=13].

9

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdHsda, cdS and nucS [n=13].

10

PHI, NSS, χ2, aaπav, aaπska, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdHsd, cdS and nucS [n=14].

8  

11

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav,

1.10E-12

8.51E-04

8.22E-11

1.07E-12

8.57E-04

8.21E-11

1.02E-12

8.12E-04

7.71E-11

9.50E-13

8.04E-04

7.57E-11

PHI, NSS, χ2, aaπav, aaπkua, aaHav, aaHsd, aaHsk, 1.03E-12

8.81E-04

7.97E-11

8.20E-04

7.83E-11

8.01E-04

7.81E-11

9.05E-04

7.86E-11

8.73E-04

7.70E-11

8.40E-04

8.40E-11

cdπsd, cdHav, cdHsd, cdHska, cdS and nucS [n=14]. 12

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaHska, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdS and nucS [n=15].

13

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaHsk, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHkua, cdS and nucS [n=16].

14

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaHsk, aaHkua, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=17].

15

aaHku, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=18]. 16

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaHsk, aaHku, 1.04E-12 aaS, cdπav, cdπsd, cdπkua, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=18].

17

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaHsk, aaHku, 1.06E-12 aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRsda [n=18].

18

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaHsk, aaHku, 1.06E-12 aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRska [n=18].

19

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaHsk, aaHku, 1.02E-12 aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRkua [n=18].

Full set

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaπku, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdπku, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS

9  

1.17E-12

[n=22].

Full set2 PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaπku, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdπku, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS, NSRav, NSRsd, NSRsk and NSRku [n=26]. a

Summary statistic added to the set.

10  

1.45E-12

9.20E-04

9.64E-11

Table S7. Median value of the MISE of ABC estimates of recombination (ρ), adaptation (ω) and substitution (θ) rates of 1000 test datasets generated using model C with different subsets of summary statistics. The number in square brackets indicates the size of the subset.

Set

Summary Statistics

1

PHI, NSS, χ2, aaπav, aaS, cdHav, cdS and nucS

ρ

ω

θ

1.97E-12

2.45E-03

1.80E-10

PHI, NSS, χ2, aaπav, aaS, cdπava, cdHav, cdS and 1.79E-12

1.96E-03

1.51E-10

1.58E-12

1.87E-03

1.40E-10

1.50E-12

1.75E-03

1.31E-10

1.44E-12

1.58E-03

1.23E-10

1.54E-12

1.74E-03

1.24E-10

1.57E-12

1.59E-03

1.26E-10

1.40E-12

1.60E-03

1.17E-10

9.62E-13

1.45E-03

1.07E-10

1.01E-12

1.41E-03

1.07E-10

[n=8]. 2

nucS [n=9]. 3

PHI, NSS, χ2, aaπav, aaS, cdπav, cdπsda, cdHav, cdS and nucS [n=10].

4

PHI, NSS, χ2, aaπav, aaHava, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=11].

5

PHI, NSS, χ2, aaπav, aaHav, aaHsda, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=12].

6

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav, cdπsd, cdπska, cdHav, cdS and nucS [n=13].

7

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdS, nucS and NSRava [n=13].

8

PHI, NSS, χ2, aaπav, aaπsda, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=13].

9

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdHsda, cdS and nucS [n=14].

10

PHI, NSS, χ2, aaπav, aaπsd, aaπska, aaHav, aaHsd, aaS, cdπav, cdπsd, cdHav, cdHsd, cdS and nucS [n=15].

11  

11

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaS,

8.65E-13

1.31E-03

9.66E-11

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHska, 8.56E-13

1.26E-03

9.65E-11

7.98E-13

1.23E-03

9.12E-11

8.17E-13

1.18E-03

8.79E-11

PHI, NSS, χ2, aaπav, aaπsd, aaπkua, aaHav, aaHsd, 8.53E-13

1.20E-03

8.89E-11

8.24E-13

1.22E-03

8.57E-11

9.22E-13

1.21E-03

9.33E-11

8.64E-13

1.27E-03

9.41E-11

8.88E-13

1.25E-03

8.98E-11

1.01E-12

1.28E-03

9.70E-11

cdπav, cdπsd, cdHav, cdHsd, cdHska, cdS and nucS [n=15]. 12

aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdS and nucS [n=16]. 13

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHsk, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHkua, cdS and nucS [n=17].

14

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHsk, aaHkua, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=18].

15

aaHsk, aaHku, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=19]. 16

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπkua, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=19].

17

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRsda [n=19].

18

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRska [n=19].

19

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRkua [n=19].

Full set

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaπku, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdπku, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS

12  

[n=22].

Full set2 PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaπku, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdπku, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS, NSRav, NSRsd, NSRsk and NSRku [n=26]. a

Summary statistic added to the set.

13  

1.20E-12

1.24E-03

1.05E-10

Table S8. Median value of the MISE of ABC estimates of recombination (ρ), adaptation (ω) and substitution (θ) rates of 1000 test datasets generated using model D with different subsets of summary statistics. The number in square brackets indicates the size of the subset.

Set

Summary Statistics

1

PHI, NSS, χ2, aaπav, aaS, cdHav, cdS and nucS

ρ

ω

θ

7.61E-12

6.36E-02 1.20E-09

PHI, NSS, χ2, aaπav, aaS, cdπava, cdHav, cdS and 6.81E-12

5.40E-02 1.09E-09

[n=8]. 2

nucS [n=9]. 3

PHI, NSS, χ2, aaπav, aaS, cdπav, cdπsda, cdHav,

6.30E-12

5.34E-02 1.03E-09

5.85E-12

4.76E-02 9.84E-10

4.69E-12

4.68E-02 8.52E-10

4.32E-12

4.52E-02 8.48E-10

4.61E-12

4.00E-02 8.80E-10

4.24E-12

4.42E-02 8.06E-10

3.91E-12

4.26E-02 7.50E-10

3.59E-12

3.91E-02 7.13E-10

cdS and nucS [n=10]. 4

PHI, NSS, χ2, aaπav, aaHava, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=11].

5

PHI, NSS, χ2, aaπav, aaHav, aaHsda, aaS, cdπav, cdπsd, cdHav, cdS and nucS [n=12].

6

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav, cdπsd, cdπska, cdHav, cdS and nucS [n=13].

7

PHI, NSS, χ2, aaπav, aaHav, aaHsd, aaS, cdπav, cdπsd, cdπsk, cdHav, cdS, nucS and NSRava [n=14].

8

PHI, NSS, χ2, aaπav, aaπsda, aaHav, aaHsd, aaS, cdπav, cdπsd, cdπsk, cdHav, cdS and nucS [n=14].

9

PHI, NSS, χ2, aaπav, aaπsd, aaHav, aaHsd, aaS, cdπav, cdπsd, cdπsk, cdHav, cdHsda, cdS and nucS [n=15].

10

PHI, NSS, χ2, aaπav, aaπsd, aaπska, aaHav, aaHsd, aaS, cdπav, cdπsd, cdπsk, cdHav, cdHsd, cdS and nucS [n=16].

14  

11

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaHav, aaHsd,

3.49E-12

3.65E-02 6.34E-10

3.30E-12

3.54E-02 6.01E-10

3.13E-12

3.40E-02 5.62E-10

3.08E-12

3.40E-02 5.19E-10

3.04E-12

3.48E-02 5.50E-10

3.08E-12

3.43E-02 5.51E-10

3.03E-12

3.31E-02 5.28E-10

3.05E-12

3.55E-02 5.49E-10

3.21E-12

3.53E-02 5.47E-10

aaS, cdπav, cdπsd, cdπsk, cdHav, cdHsd, cdHska, cdS and nucS [n=17]. 12

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaHav, aaHsd, aaHska, aaS, cdπav, cdπsd, cdπsk, cdHav, cdHsd, cdHsk, cdS and nucS [n=18].

13

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaHav, aaHsd, aaHsk, aaS, cdπav, cdπsd, cdπsk, cdHav, cdHsd, cdHsk, cdHkua, cdS and nucS [n=19].

14

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaHav, aaHsd, aaHsk, aaHkua, aaS, cdπav, cdπsd, cdπsk, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=20].

15

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaπkua, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=21].

16

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdπkua, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=21].

17

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRsda [n=21].

18

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS and NSRska [n=21].

19

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdHav,

15  

cdHsd, cdHsk, cdHku, cdS, nucS and NSRkua [n=21]. Full set

PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaπku, aaHav,

3.07E-12

3.29E-02 5.72E-10

3.40E-12

3.48E-02 6.07E-10

aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdπku, cdHav, cdHsd, cdHsk, cdHku, cdS and nucS [n=22]. Full set2 PHI, NSS, χ2, aaπav, aaπsd, aaπsk, aaπku, aaHav, aaHsd, aaHsk, aaHku, aaS, cdπav, cdπsd, cdπsk, cdπku, cdHav, cdHsd, cdHsk, cdHku, cdS, nucS, NSRav, NSRsd, NSRsk and NSRku [n=26]. a

Summary statistic added to the set.

16  

Table S9. 24 HIV-1 datasets with different number of sampled sequences and sequence length.

a

Dataset

na

lb

PopSetc

Reference

1

29

288

56112142

(Bhanja et al. 2005)

2

23

336

169883464

(Yabar, Salvatierra, and Quijano 2008)

3

27

387

28822436

(Williamson et al. 2003)

4

23

264

9887265

(Bobkov et al. 2001)

5

33

435

40806232

(Dykes et al. 2004)

6

44

429

37195022

(van Rij et al. 2003)

7

22

207

5456702

(Quinones-Mateu et al. 1999)

8

22

864

219808032

(Malet et al. 2009)

9

37

1002

170675189

(de Souza et al. 2008)

10

16

120

216360888

(Buonaguro et al. 2008)

11

26

210

162756032

(Sivakumaran et al. 2007)

12

14

1320

125486754

(Ojesina et al. 2007)

13

40

294

77812451

(Bessong et al. 2006)

14

10

570

154550148

(Rodriguez et al. 2007)

15

35

423

145206873

(Ryan et al. 2007)

16

10

606

133950452

(Chen et al. 2007)

17

23

432

34980089

(Zhong et al. 2003)

18

19

906

118139354

(Drexler et al. 2007)

19

57

414

126023789

(Harrington et al. 2007)

20

32

984

113720407

(Kurle et al. 2007)

21

17

405

61373131

(Ramirez de Arellano et al. 2007)

22

20

894

62958734

(Agnihotri et al. 2006)

23

43

528

47779053

(Kumar et al. 2006)

24

27

402

113376620

(Kitchen et al. 2006)

b

number of sampled sequences; sequence length in nucleotides; cPopSet ID.

17  

Table S10. ABC estimates (95% credible intervals in square brackets) of the recombination (ρ) adaptation (ω) and substitution (θ) rates for 24 HIV-1 datasets. ABC Dataset

ρ

ω

θ

1

25.63 [25.28, 26.00]

1.47 [1.30, 1.67]

57.57 [55.08, 59.88]

2

31.47 [25.80, 36.68]

0.13 [0.10, 0.15]

119.28 [102.01, 135.28]

3

6.19 [4.22, 8.40]

0.12 [0.10, 0.13]

237.96 [198.67, 273.64]

4

0.00 [0.00, 3.14]

1.25 [0.53, 2.39]

12.66 [8.25, 17.52]

5

49.68 [49.45, 49.87]

0.33 [0.19, 0.50]

24.08 [20.18, 28.73]

6

10.44 [0.00, 26.31]

0.27 [0.18, 0.40]

15.56 [11.07, 20.09]

7

17.46 [9.75, 26.69]

0.38 [0.28, 0.51]

34.24 [24.22, 45.39]

8

2.88 [0.66, 5.85]

0.14 [0.12, 0.16]

54.74 [51.02, 58.90]

9

3.55 [2.44, 4.76]

0.16 [0.15, 0.18]

282.81 [265.75, 297.27]

10

19.23 [12.74, 27.31]

1.31 [0.86, 1.86]

105.81 [88.55, 124.69]

11

3.58 [2.08, 5.64]

1.59 [1.07, 2.27]

14.75 [12.72, 17.06]

12

5.90 [5.60, 6.16]

0.08 [0.07, 0.08]

281.68 [266.92, 292.35]

13

43.02 [30.55, 50.00]

0.12 [0.08, 0.17]

60.07 [50.26, 71.82]

14

1.14 [0.73, 1.61]

0.23 [0.20, 0.26]

76.38 [71.37, 81.68]

15

28.85 [23.42, 34.03]

1.35 [1.15, 1.55]

233.90 [221.87, 246.79]

16

16.39 [15.43, 17.35]

0.24 [0.22, 0.26]

73.88 [71.12, 76.73]

17

18.83 [18.23, 19.43]

0.25 [0.23, 0.28]

299.70 [299.40, 299.91]

18

20.90 [15.55, 26.17]

0.05 [0.05, 0.06]

295.86 [294.54, 297.09]

19

32.00 [24.27, 40.58]

0.90 [0.71, 1.07]

34.64 [29.60, 40.30]

20

10.23 [8.60, 12.07]

0.09 [0.08, 0.09]

158.18 [152.99, 162.97]

21

20.64 [16.07, 25.03]

1.21 [0.94, 1.50]

67.97 [63.45, 73.43]

22

0.12 [0.00, 0.30]

0.76 [0.60, 0.93]

92.70 [83.86, 101.27]

23

29.88 [26.87, 32.86]

0.16 [0.15, 0.18]

281.96 [276.50, 286.92]

24

1.79 [0.38, 3.85]

0.25 [0.21, 0.31]

53.09 [41.55, 68.08]

18  

Table S11. Running times of NetRecodon to simulate 100 datasets under the model A (see Methods) for different combinations of sample size and sequence length. Running times on an Intel® Xeon® CPU 2.33GHz. Sample size Sequence length (codons)

5

15

111

3min 56.807s

9min 9.986s

14min 0.545s

333

10min 54.541s

23min 38.853s

40min 52.681

999

39min 11.247s

67min 22.337s

119min 32.092s

19  

45

Figure S1. Effect of ignoring recombination (ρ) or adaptation (ω) rates. Model 1 and 2 consider the same priors as Model A but with ρ = 0 and ω = 1, respectively. Full-line indicates the estimated value (y-axis) corresponding to each of the 1000 test datasets generated using model A under a given true value (x-axis). Dotted-line indicates de 95% credible-interval.

20  

Figure S2. Effect of number of simulations and tolerance interval on accuracy of recombination rate estimates using 1000 test datasets generated with models A-D. MISE (yaxis) is plotted against different tolerance intervals (x-axis) in analysis with several number of simulated data [12.5k (full line), 25k (dotted line), 50k (dashed line)].

21  

Figure S3. Effect of number of simulations and tolerance interval on accuracy of adaptation rate estimates using 1000 test datasets generated with models A-D. MISE (y-axis) is plotted against different tolerance intervals (x-axis) in analysis with several number of simulated data [12.5k (full line), 25k (dotted line), 50k (dashed line)].

22  

Figure S4. Effect of number of simulations and tolerance interval on accuracy of substitution rate estimates using 1000 test datasets generated with models A-D. MISE (y-axis) is plotted against different tolerance intervals (x-axis) in analysis with several number of simulated data [12.5k (full line), 25k (dotted line), 50k (dashed line)].

23  

Figure S5. Effect of model misspecification in ABC when using model A. 1000 test datasets were generated using model A and analyzed assuming model A (black) and models B and C (white). True values (x-axis) are plotted against estimated values (y-axis).

24  

 

Figure S6. Effect of model misspecification in ABC when using model B. 1000 test datasets were generated using model B and analyzed assuming model B (black) and models A and C (white). True values (x-axis) are plotted against estimated values (y-axis).

25  

Figure S7. Effect of model misspecification in ABC when using model C. 1000 test datasets were generated using model C and analyzed assuming model C (black) and models A and B (white). True values (x-axis) are plotted against estimated values (y-axis).

26  

Figure S8. Comparison of adaptation and recombination rates estimations using OmegaMap with either a uniform prior or an exponential prior for ρ in 27 test datasets with different true values for ω, ρ and θ (error bars correspond to 95% credible intervals).

27  

Figure S9. Assessing model fit for the 24 HIV-1 datasets. Plot of the first two components of a principal component analysis on the summary statistics used for ABC estimations (as in Additional file 3 of Cornuet et al. (2010). Points in black are a 10,000 random sample of all

28  

the simulated data, points in red are the accepted simulated data, and the yellow point is the observed data for each dataset.

29  

Figure S9. (continuation)

30  

Figure S9. (continuation)

31  

Figure S9. (continuation)

32  

Figure S10. Comparison of adaptation rate estimations using OmegaMap (preliminary analysis based on 2 million MCMC iterations) and our ABC method for 24 HIV-1 published datasets.

33  

References of the Supplementary material Agnihotri  KD,  Tripathy  SP,  Jere  AP,  Kale  SM,  and  Paranjape  RS  (2006).  Molecular  analysis  of  gp41  sequences of HIV type 1 subtype C from India. J Acquir Immune Defic Syndr 41:345‐351.  Bessong  PO,  Mphahlele  J,  Choge  IA,  Obi  LC,  Morris  L,  Hammarskjold  ML,  and  Rekosh  DM  (2006).  Resistance mutational analysis of HIV type 1 subtype C among rural South African drug‐naive  patients  prior  to  large‐scale  availability  of  antiretrovirals.  AIDS  Res  Hum  Retroviruses  22:1306‐1312.  Bhanja P, Sengupta S, Singh NY, Sarkar K, Bhattacharya SK, and Chakrabarti S (2005). Determination  of  gag  and  env  subtypes  of  HIV‐1  detected  among  injecting  drug  users  (IDUs)  in  Manipur,  India: evidence for intersubtype recombination. Virus Res 114:149‐153.  Bobkov A, Kazennova E, Khanina T, Bobkova M, Selimova L, Kravchenko A, Pokrovsky V, and Weber J  (2001).  An  HIV  type  1  subtype  A  strain  of  low  genetic  diversity  continues  to  spread  among  injecting drug users in Russia: study of the new local outbreaks in Moscow and Irkutsk. AIDS  Res Hum Retroviruses 17:257‐261.  Buonaguro L, Petrizzo  A, Tagliamonte  M  et  al. (2008). Molecular and phylogenetic analysis of  HIV‐1  variants circulating in Italy. Infect Agent Cancer 3:13.  Chen  Y,  Caruso  L,  Shen  C,  Wu  H,  Zhou  Y,  and  Gupta  P  (2007).  Lack  of  differences  in  HIV‐1  nef  functional domains in infected chinese blood donors at different stages of disease. AIDS Res  Hum Retroviruses 23:1150‐1154.  Cornuet  JM,  Ravigne  V,  and  Estoup  A  (2010).  Inference  on  population  history  and  model  checking  using  DNA  sequence  and  microsatellite  data  with  the  software  DIYABC  (v1.0).  BMC  Bioinformatics 11:401.  de  Souza  AC,  de  Oliveira  CM,  Rodrigues  CL,  Silva  SA,  and  Levi  JE  (2008).  Short  communication:  Molecular  characterization  of  HIV  type  1  BF  pol  recombinants  from  Sao  Paulo,  Brazil.  AIDS  Res Hum Retroviruses 24:1521‐1525.  Drexler  JF,  de  Souza  Luna  LK,  Pedroso  C,  Pedral‐Sampaio  DB,  Queiroz  AT,  Brites  C,  Netto  EM,  and  Drosten C (2007).  Rates of and reasons for failure of  commercial human immunodeficiency  virus type 1 viral load assays in Brazil. J. Clin. Microbiol. 45:2061‐2063.  Dykes C, Balakrishnan M, Planelles V, Zhu Y, Bambara RA, and Demeter LM (2004). Identification of a  preferred region for recombination and mutation in HIV‐1 gag. Virology 326:262‐279.  Harrington  PR,  Nelson  JA,  Kitrinos  KM,  and  Swanstrom  R  (2007).  Independent  evolution  of  human  immunodeficiency  virus  type  1  env  V1/V2  and  V4/V5  hypervariable  regions  during  chronic  infection. J Virol 81:5413‐5417.  Kitchen  CM,  Lu  J,  Suchard  MA,  Hoh  R,  Martin  JN,  Kuritzkes  DR,  and  Deeks  SG  (2006).  Continued  evolution  in  gp41  after  interruption  of  enfuvirtide  in  subjects  with  advanced  HIV  type  1  disease. AIDS Res Hum Retroviruses 22:1260‐1266.  Kumar  M,  Jain  SK,  Pasha  ST,  Chattopadhaya  D,  Lal  S,  and  Rai  A  (2006).  Genomic  diversity  in  the  regulatory  nef  gene  sequences  in  Indian  isolates  of  HIV  type  1:  emergence  of  a  distinct  subclade and predicted implications. AIDS Res Hum Retroviruses 22:1206‐1219.  Kurle  SN,  Gangakhedkar  RR,  Sen  S,  Hayatnagarkar  SS,  Tripathy  SP,  and  Paranjape  RS  (2007).  Emergence of NNRTI drug resistance mutations after single‐dose nevirapine exposure in HIV  type 1 subtype C‐infected infants in India. AIDS Res Hum Retroviruses 23:682‐685.  Malet  I,  Delelis  O,  Soulie  C  et  al.  (2009).  Quasispecies  variant  dynamics  during  emergence  of  resistance to raltegravir in HIV‐1‐infected patients. J Antimicrob Chemother 63:795‐804.  Ojesina  AI,  Mullins  C, Imade  G, Samuels  J, Sankale  JL, Pam S, Sagay S, Idoko  J, and  Kanki PJ (2007).  Characterization of HIV type 1 reverse transcriptase mutations in infants infected by mothers  who received peripartum nevirapine prophylaxis in Jos,  Nigeria.  AIDS  Res Hum  Retroviruses  23:1587‐1592. 

34  

Quinones‐Mateu  ME,  Albright  JL,  Torre  V,  Reinis  M,  Vandasova  J,  Bruckova  M,  and  Arts  EJ  (1999).  Molecular epidemiology of  HIV type 1  isolates from the Czech Republic: identification of an  env E subtype case. AIDS Res Hum Retroviruses 15:85‐89.  Ramirez de Arellano E,  Martin  C, Soriano V, Alcami J, and  Holguin A (2007). Genetic analysis of  the  long terminal repeat (LTR) promoter region in HIV‐1‐infected individuals with different rates  of disease progression. Virus Genes 34:111‐116.  Rodriguez MA, Shen C, Ratner D, Paranjape RS, Kulkarni SS, Chatterjee R, and Gupta P (2007). Genetic  and functional characterization of the LTR of HIV‐1 subtypes A and C circulating in India. AIDS  Res Hum Retroviruses 23:1428‐1433.  Ryan CE, Gare J, Crowe SM, Wilson K, Reeder JC, and Oelrichs RB (2007). The heterosexual HIV type 1  epidemic  in  Papua  New  Guinea  is  dominated  by  subtype  C.  AIDS  Res  Hum  Retroviruses  23:941‐944.  Sivakumaran H, Wang B, Gill MJ, Beckholdt B, Saksena NK, and Harrich D (2007). Functional relevance  of  nonsynonymous  mutations  in  the  HIV‐1  tat  gene  within  an  epidemiologically‐linked  transmission cohort. Virol J 4:107.  van  Rij  RP,  Worobey  M,  Visser  JA,  and  Schuitemaker  H  (2003).  Evolution  of  R5  and  X4  human  immunodeficiency  virus type 1  gag sequences  in  vivo: evidence for recombination. Virology  314:451‐459.  Williamson C, Morris L, Maughan MF et al. (2003). Characterization and selection of HIV‐1 subtype C  isolates for use in vaccine development. AIDS Res Hum Retroviruses 19:133‐144.  Yabar  CA,  Salvatierra  J,  and  Quijano  E  (2008).  Polymorphism,  recombination,  and  mutations  in  HIV  type 1 gag‐infecting Peruvian male sex workers. AIDS Res Hum Retroviruses 24:1405‐1413.  Zhong  P,  S  BU,  Konings  F  et  al.  (2003).  Genetic  and  biological  properties  of  HIV  type  1  isolates  prevalent  in  villagers  of  the  Cameroon  equatorial  rain  forests  and  grass  fields:  further  evidence of broad HIV type 1 genetic diversity. AIDS Res Hum Retroviruses 19:1167‐1178.   

35