www.fgks.org   »   [go: up one dir, main page]

Academia.eduAcademia.edu
Units of genetic transfer in prokaryotes Cheong Xin Chan1, Robert G. Beiko2 and Mark A. Ragan1 1 ARC Centre of Excellence in Bioinformatics, The University of Queensland, Brisbane QLD 4072, Australia 2 Department of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, Nova Scotia, Canada B3H 1W5 Abstract The transfer of genetic materials across species (lateral genetic transfer, LGT) contributes to genomic and physiological innovation in prokaryotes. The extent of LGT in prokaryotes has been examined in a number of studies, but the unit of transfer has not been studied in a rigorous manner. Using a rigorous phylogenetic approach, we analysed the units of LGT within families of single-copy genes obtained from 144 fully sequenced prokaryote genomes. A total of 30.3% of these gene families show evidence of LGT. We found that the transfer of gene fragments has been more frequent than the transfer of entire genes, suggesting the extent of LGT has been underestimated. We found little functional bias between within-gene (fragmentary) and whole-gene (nonfragmentary) genetic transfer, but non-fragmentary transfer has been more frequent into pathogens than into non-pathogens. As gene families that contain probable paralogs were excluded from the current study, our results may still underestimate the extent of LGT; nonetheless this is the most-comprehensive study to date of the unit of LGT among prokaryote genomes. 1 Introduction In prokaryotes, exchange of genetic material between lineages can counteract the accumulation of deleterious mutations, replacing damaged DNA and helping to maintain genetic variation. In some cases the introgressed genetic material may confer a selective advantage to the new host organism, resulting in positive selection. One wellknown example of this is the acquisition and spread of genes encoding antibiotic resistance in highly selective environments [1]. In the process, though, phylogenetic histories become entangled, and the very concept of a genomic or species phylogeny becomes fraught [2-4]. The inheritance of genetic materials in prokaryotes is largely vertical, i.e. transmitted from parent to offspring within a genomic and organismal lineage. However, a number of large-scale studies have identified substantial evidence for LGT. Some of these are based on the topological comparison of phylogenetic trees inferred for individual gene families, e.g. against a reference topology [5, 6]. The most careful such study published so far showed that organisms that are closely related phylogenetically, and/or are found in a common environmental niche, show a tendency to share genetic material via LGT [5]. Other approaches to quantify the extent of LGT include the examination of nucleotide composition or codon usage patterns [7], inference of gene gain and loss events [8, 9], and calculations of ancestral genome sizes under the assumption that in the absence of LGT, present-day diversity must be shared and derived from the common ancestral genome [10]. Because it is difficult to detect introgressed genomic regions that originate from closely related lineages (those with a high degree of sequence identity), the regions most confidently inferred to be of lateral origin may often be those that have come from more 2 distantly related sources, perhaps via transduction through phage [11, 12]. However, sequences can be divergent not only due to temporal separation from their common ancestor, but also because they have become functionally specialised, as is often the case with paralogs. Following duplication, a genetic region can lose its original function (non-functionalisation), gain a novel function (neofunctionalisation), or take on a specialised part of the original function (subfunctionalisation) [13]. As these processes can of course take place not only in the new host lineage but also in candidate donor lineages, paralogy can complicate the inference of lateral transfer (and vice-versa). In the process of LGT, exogenous genetic materials are first introduced into the recipient cell, and then integrated into the new host via recombination. The integrated genetic material can constitute an entire gene [14], a partial (fragmentary) gene [15, 16], or multiple (entire or fragmentary) adjacent genes [17, 18]. Although several studies have explored the frequency of LGT in prokaryotes and examined individual genes or functions affected [7, 19], none of these has taken a comprehensive rigorous approach to characterising the units of genetic transfer. Given the large number of completely sequenced prokaryote genomes now available in the public domain, such an analysis is now possible and timely. Here we report results of the first systematic study of the unit of lateral genetic transfer across the diversity of sequenced prokaryote genomes. We characterise the frequencies of within- and entire-gene transfer, and discuss correlations with annotated gene functions and phyletic group. To minimise, to the extent possible, the complications of paralogy and to increase the confidence with which we can infer LGT events, we focus here on families of single-copy genes. 3 Results For discovery of LGT events in prokaryote genomes we extracted a subset of the 22437 putatively orthologous families used in our previous large-scale study [5] on LGT in 144 phyletically diverse prokaryote genomes (see Materials and Methods). This subset, 1462 gene families, was restricted to families of single-copy genes, i.e. genes that are sufficiently unique within their respective genomes to make it unlikely that they have arisen by gene duplication. By applying this restriction, we ensure to the extent possible that any recombination we infer in any of these families arises from LGT, not paralogy. 200 150 0 50 100 number of families 250 300 350 The size distribution of these 1462 gene families is shown in Figure 1. 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 number of sequences Figure 1. Size distribution of families of single-copy genes examined in this study. Family sizes range from 4 to 52 members; 1229 (84.1%) of the families contain ≤ 10 sequences, with almost a quarter of them (362, 24.7%) of size 4. Gene families of size < 4 were excluded from the analysis, as they do not contribute to meaningful phylogenetic 4 inference. Each of the 1462 families was examined for evidence of LGT, as described below. Within-gene (fragmentary) genetic transfer We applied a two-phase strategy [20] for detecting recombination in each of the 1462 families of single-copy genes. In the first phase we used three statistical measures [2123] to search for evidence of phylogenetic discrepancies (i.e. a recombination signal) within the family; recombination was inferred if two of the three tests show a p-value ≤ 0.10. In the second phase we utilised a Bayesian phylogenetic approach, implemented in the software program DualBrothers [24], to locate recombination breakpoints more precisely in the families that, in the first phase, showed evidence of recombination. DualBrothers employs reversible-jump Markov chain Monte Carlo (MCMC) and a dual multiple change-point model to identify, within a set of sequences, contiguous regions that share a common tree topology, and the boundaries (recombination breakpoints) between regions that show different topologies [24, 25]. Instances of recombination discovered using this approach are thus necessarily fragmentary, as at least one end of a topologically distinct region (i.e. a recombination breakpoint) occurs within the sequence set used in our analysis. Whole-gene transfer escapes detection because it does not result in topological discrepancy along the length of these sequences. Our first-phase screening produced evidence of recombination in 426 (29.1%) of these 1462 families. Of these, we found clear evidence of recombination in 286 (19.6%), where “clear evidence” is defined as Bayesian posterior probability (BPP) support ≥ 0.500 for the dominant topology on at least one side of the inferred breakpoint. We 5 found a further 80 cases (5.5%) in which a breakpoint was located, but no sequence region has BPP ≥ 0.500; we classified these as inconclusive. Finally, we observed 60 cases (4.1%) for which recombination was indicated in the initial screening, but no recombination breakpoint could be identified. First-phase screening did not detect recombination in 1036 families (70.9%). Figure 2 shows the size distribution of these 286 gene families; the most-populated 20 0 10 number of families 30 40 classes are of eight (42 families, 14.7%) and six sequences each (39 families, 13.6%). 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 number of sequences Figure 2. Size distribution of gene families that show evidence of fragmentary lateral genetic transfer. The red bars indicate over-represented (frequency more-than-expected) groups; the blue bars indicate under-represented (frequency less-than-expected) groups; and the grey bars indicate groups indifferent (neither over- nor under-represented) in comparison with the 1462-family dataset, at p ≤ 0.05. Among these 286 gene families, those with ≤ 5 members are under-represented (p ≤ 0.05) based on their frequencies in the 1462-family dataset, implying that fragmentary LGT in these smallest families either (a) has been less frequent than that into the larger 6 families, or (b) is more difficult to detect. Conversely, almost all gene families of size > 5 are individually over-represented. Whole-gene (non-fragmentary) genetic transfer We next inferred phylogenetic trees for each of the 1096 gene families for which no recombination inferred (1036 from the first phase, 60 from the second), and compared the inferred topology with a reference tree. The reference (species) tree [5] was generated using Matrix Representation with Parsimony (MRP) [26], yielding a supertree that summarises all well-supported (BPP ≥ 0.95) bipartitions among the 22432 trees of putatively orthologous families in these 144 prokaryote genomes [5]. In the absence of a detectable recombination breakpoint within the gene, phylogenetic discordance between a well-supported gene tree and the reference supertree can most readily be interpreted as lateral transfer the entire gene (and probably beyond). We found 157 gene families of single-copy genes (10.7% of the 1462-family dataset) that are topologically incongruent with the reference tree, suggesting that nonfragmentary genetic transfer had affected these families. Their size distribution is depicted in Figure 3. As in the fragmentary transfer cases above (Figure 2), small gene families (size ≤ 5) are under-represented relative to the others, while two-thirds of the families of size ≥ 8 are over-represented (p ≤ 0.05). 7 20 15 0 5 10 number of families 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 number of sequences Figure 3. Size distribution of gene families that show evidence of non-fragmentary lateral genetic transfer. The red bars indicate over-represented (more-than-expected) groups; the blue bars indicate under-represented (less-than-expected) groups; and the grey bars indicate groups indifferent (neither over- nor under-represented) in comparison with the 1462-family dataset, at p ≤ 0.05. In total, among the 1462 families of single-copy genes we found evidence of LGT in 443 (30.3%), of which 286 (64.5%) show within-gene (fragmentary) and 157 (35.4%) whole-gene (non-fragmentary) recombination. Functional biases of fragmentary and non-fragmentary genetic transfer We used annotations from the TIGR Comprehensive Microbial Resource (http://cmr.tigr.org/) to assign a functional category (TIGR role category) to the protein associated with each gene in these 443 gene families. Details are provided in Materials and Methods. Figure 4 shows the proportions of proteins in each functional category, broken out by membership in families for which we inferred (a) fragmentary and (b) non-fragmentary LGT. 8 30 (a) Over-represented 1: Hypothetical proteins Under-represented 15: Energy metabolism 16: Transport and binding proteins 17: Protein synthesis 18: Biosynthesis of cofactors, prosthetic groups, and carriers 19: Amino acid biosynthesis 20: Fatty acid and phospholipid metabolism 25 ** 15 * 10 percentage 20 Indifferent 2: Unknown function 10: Purines, pyrimidines, nucleosides, and nucleotides 3: Unclassified 11: Mobile and extrachromosomal 4: Cell envelope 5: Regulatory functions element functions 6: Cellular processes 12: Transcription 13: Signal transduction 7: Protein fate 8: DNA metabolism 14: Viral functions 9: Central intermediary metabolism ** 5 ** ** * 0 ** 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Over-represented 1: Hypothetical proteins 2: Viral functions Indifferent 3: Energy metabolism 4: Unknown function 5: Unclassified 6: Cell envelope 7: Regulatory functions 8: Cellular processes 9: Protein fate 10: Mobile and extrachromosomal element functions 11: Signal transduction 15 10 * 5 * ** ** ** * * 0 percentage Under-represented 12: Transport and binding proteins 13: Protein synthesis 14: DNA metabolism 15: Biosynthesis of cofactors, prosthetic groups, and carriers 16: Central intermediary metabolism 17: Amino acid biosynthesis 18: Purines, pyrimidines, nucleosides, and nucleotides 19: Fatty acid and phospholipid metabolism 20: Transcription ** 20 25 (b) 30 functional category 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ** 18 * 19 functional category Figure 4. Representation of functional categories assigned to protein sequences corresponding to gene families that show evidence of (a) fragmentary and (b) nonfragmentary genetic transfer (yellow bars). The blue bars show the representation of these same functional categories in the full dataset (16639 families of size ≥ 4, 119695 proteins). Categories are numbered (differently for panels a and b) as shown in the boxes. Significance of over- or under-representation is represented by single (p ≤ 0.05) and double asterisks (p ≤ 0.01). 9 * 20 Hypothetical proteins constitute the major over-represented category, both as a proportion of proteins corresponding to families for which we infer within-gene recombination, and as a proportion of proteins corresponding to families in which one or more entire genes has arisen by LGT. A relatively tiny category of proteins related to viral functions (including transduction of DNA by phages) is the only other category similarly over-represented, and it is over-represented only among proteins corresponding to families for which we infer non-fragmentary transfer. On the contrary, proteins involved in a range of biosynthetic, metabolic, protein-synthetic, transport and binding functions are significantly under-represented in both within-gene and wholegene transfer. Proteins that function in energy metabolism are under-represented only in the case of fragmentary transfer (Figure 4a), while those engaged in DNA metabolism, central intermediary metabolism, and transcription are under-represented only for nonfragmentary LGT (Figure 4b). Phyletic biases of fragmentary and non-fragmentary genetic transfer We next asked whether within-gene and whole-gene lateral transfer is over- or underrepresented in particular taxa. Figure 5 shows the taxonomic origins (NCBI level-4 taxa) of proteins that correspond to families within which we infer fragmentary and non-fragmentary genetic transfer. For clarity, the corresponding proportions are not shown over the entire 144-genome (16639-family) dataset; over- and underrepresentation (p ≤ 0.05) are indicated by red and blue colouration respectively. 10 (a) Thermotogales Aquificales Fragmentary genetic transfer Thermus/Deinococcus group Chlorobi Bacteroidetes taxon group Spirochaetales Planctomycetes Crenarchaeota Chlamydiales Euryarchaeota High G+C Firmicutes Cyanobacteria Low GC Firmicutes Proteobacteria 0 10 20 30 40 50 60 percentage (b) Aquificales Chlorobi Non-fragmentary genetic transfer Thermus/Deinococcus group Bacteroidetes Thermotogales taxon group Planctomycetes Spirochaetales Crenarchaeota Chlamydiales Euryarchaeota Cyanobacteria High G+C Firmicutes Low G+C Firmicutes Proteobacteria 0 10 20 30 40 50 60 percentage Figure 5. Taxonomic origins (NCBI level-4 taxa) of genes in families that show evidence of (a) fragmentary and (b) non-fragmentary genetic transfer. Over-representation relative to the 16639-family dataset is shown in red; under-representation is shown in blue; grey indicates that there is neither over- nor under-representation at p ≤ 0.05. Our results reveal that families of single-copy genes affected by LGT contain a significantly (p ≤ 0.05) higher-than-expected proportion of genes originating from 11 High-G+C Firmicutes, Planctomycetes and Spirochaetales. This is true for both the fragmentary and non-fragmentary transfer cases. Other taxonomic groups are over-represented in only the fragmentary transfer case, or only the non-fragmentary, but not both. The data and our approach do not allow us to extrapolate with certainty, but to the extent that these single-gene families are representative of complete genomes, the cyanobacteria, chlamydiales and crenarchaeotes appear to be relatively receptive to introgression of gene fragments, whereas euryarchaeotes, chlorobi, and members of Thermotoga and Aquifex have been relatively receptive to transfer of entire genes. We note that many of the latter taxa are extremophiles, suggesting that it may bear further analysis whether whole-gene transfer is more frequent than fragmentary genetic transfer among organisms that live in e.g. high-temperature and highly saline environments. Low-G+C Firmicutes are under-represented in families affected by both types of LGT, fragmentary and non-fragmentary. On the other hand Proteobacteria, the high-level taxon most abundantly represented in our dataset, is significantly under-represented in families affected by whole-gene transfer. Table 1 shows the species that are significantly over-represented in families within which we infer recombination in this dataset. Many over-represented species are pathogens. 12 Table 1. Species that are over-represented (p ≤ 0.05) in gene families that show evidence of genetic transfer, in comparison with their contribution to the 16639-family dataset. Species are listed in descending order, from the most over-represented to the least over-represented, separately for fragmentary and non-fragmentary genetic transfer. The species represented in red are pathogens. The four species listed separately at the bottom of the Table are over-represented in both fragmentary and non-fragmentary cases. Fragmentary genetic transfer Non-fragmentary genetic transfer Nostoc sp. PCC 7120 Salmonella typhimurium LT2 Streptomyces avermitilis MA-4680 Salmonella enterica subsp. enterica serovar Typhi Ty2 Shewanella oneidensis MR-1 Mycobacterium tuberculosis CDC1551 Yersinia pestis CO92 Nitrosomonas europaea ATCC 19718 Synechococcus sp. WH 8102 Yersinia pestis KIM Thermosynechococcus elongatus BP-1 Methanothermobacter thermautotrophicus Pasteurella multocida Leptospira interrogans serovar lai str. 56601 Methanococcus jannaschii Thermotoga maritima Methanopyrus kandleri AV19 Fusobacterium nucleatum subsp. nucleatum ATCC 25586 Halobacterium sp. NRC-1 Chlorobium tepidum TLS Chlamydophila pneumoniae CWL029 Coxiella burnetii RSA 493 Mycoplasma pulmonis Aquifex aeolicus Treponema pallidum Streptococcus pyogenes MGAS315 Photorhabdus luminescens subsp. laumondii TTO1 Haemophilus ducreyi 35000HP Pirellula sp. Borrelia burgdorferi Discussion Our results demonstrate that in these families of single-copy genes from diverse prokaryotes, transfer of genetic material is largely vertical, but a significant proportion of gene families (30.3%) show clear evidence of LGT. In previous studies, estimates of the frequency of LGT range widely: 2% [27], 13% [5], 16% [8], 60% [6], to as high as 90% [28] of genes or bipartitions. In a recent study based on inference of ancestral 13 genome sizes [29], all genes in prokaryotes were proposed to have had undergone LGT. Several factors contribute to this range of estimates, including but not limited to methodological approach and sampling of genes and genomes. Different methodologies can produce not only different estimates of the extent of LGT, but incompatible lists of lateral genes, on the same dataset [30]. The phylogenetic approach to detection of LGT is firmly grounded in biological principle (the same principles as those responsible for inheritance and diversification of lineages) and can be carried out in a statistically rigorous manner, although systematic biases, e.g. surrounding the model of sequence change, may still intrude. A limitation of the phylogenetic approach as adopted in previous studies [5, 6], however, has been the intrinsic assumption that the unit of genetic transfer is a whole gene. Topological discordance between a gene-family tree and the reference topology has been interpreted as prima facie evidence that a gene has been transferred from one lineage into another. Here, we have employed a phylogenetic approach but without restricting the unit of transfer to be a whole gene, and have shown that among these diverse prokaryotic species, LGT can involve the recombination of a fragment smaller than a gene and/or the interruption of an existing gene. Indeed, over the set of families of single-copy genes in these genomes, within-gene transfer is about twice as frequent as the transfer of entire genes (or larger). The dataset used in this study is a subset of that used by Beiko et al. [5], who concluded that some 13-14% of bipartitions are affected by LGT. Here we report 30% of families are affected by LGT. These two numbers are not directly comparable, for three reasons: (1) our present subset is non-representative, comprising only families of single-copy genes; (2) our present dataset is smaller, having 74% as many families and 54% as 14 many sequences; and (3) we base our analyses on gene families, not on bipartitions, as genetic transmission involving within-gene recombination can only partially be mapped into the paradigm of bipartitions and subtrees. Neither we nor Beiko et al. [5] attempted to estimate LGT in paralogous families, or among very closely related genomes. Again we are reminded of the multifaceted trade-offs between methodological rigor, and the goal of a more-global estimate of frequency of LGT in prokaryotes. Extrapolation of results from families of single-copy genes to entire genomes might be on the least-solid ground in Proteobacteria, which in other studies have been found to show high rates of LGT [3, 31]. Gene duplications are more common among Proteobacteria than among many other prokaryotes [32]. However, as discussed elsewhere, in this study we excluded families with duplicates in individual genomes. Here we have also shown that LGT is less evident in small gene families (N ≤ 5) than in larger gene families. In our data, gene family size is correlated with degree of sequence divergence, as many small families represent closely related organisms (genomes) that have only recently diverged from a common ancestor [33]. As it is difficult to detect genetic transfer events involving highly similar sequences, the frequency of LGT among small gene families can be underestimated. Conversely, larger families typically are constituted by representatives from phyletically more-diverse organisms with a more-ancient common ancestor, making it is easier to detect phylogenetic discrepancies and hence to infer LGT [34]. We observed only modest differences in functional bias between fragmentary and nonfragmentary transfer in families of single-copy genes; hypothetical proteins are very significantly (p ≤ 0.01) over-represented in both cases. Gene products not classified by the TIGR Comprehensive Microbial Resource, or of unknown function, are not 15 significantly over- (or under-) represented relative to our full dataset, suggesting that an exogenous or hybrid origin does not significantly decrease (or increase) annotation of a functional role category. We also found that genes encoding viral functions are more likely to be laterally transferred in their entirety than as fragments. A similar trend is observed for pathogenic bacteria, which are prominent among the organisms that contribute disproportionately to families affected by non-fragmentary transfer. Genes that encode for virulence factors (e.g. toxins, adhesins and invasins) are known to be commonly located on mobile genetic elements such as plasmids and transposons, or in specific genomic region called pathogenicity islands [35, 36]. We observed that genes annotated as involved in DNA metabolism, transcription, and protein synthesis are under-represented among families for which we infer whole-gene LGT, although of these only the protein synthesis functional category is also underrepresented in fragmentary transfer. The complexity hypothesis [37] postulates that “informational” proteins involved in processes related to transcription and translation, including many in these three categories, typically function in the cell within large multi-protein complexes and hence must interact in finely tuned ways with many other biomolecules, and as a consequence their genes are less likely to be susceptible to transfer via LGT than are genes encoding the putatively less-interactive “operational” proteins. However, the susceptibility of the genomes to transfer of “informational” genes can still be underestimated, especially among highly similar sequences, in which detection of recombination is difficult. Our results do not speak directly to the validity of this hypothesis, but suggest that any bias against transfer of informational genes may be expressed more strongly in the case of whole-gene than within-gene transfer. 16 Materials and Methods Data From 144 completely sequenced prokaryote genomes we generated 22437 putatively orthologous protein families of size N ≥ 4 via a hybrid clustering approach [38]. We aligned these families [5] and validated the alignments using a pattern-centric objective function [19]. These protein sequence alignments were converted into DNA sequence alignments by retrieving the corresponding nucleotide sequences from GenBank (http://www.ncbi.nlm.nih.gov/) and arranging the nucleotide triplets to parallel exactly the protein alignment in each case, yielding 18809 gene families (N ≥ 4) containing a total of 139707 genes. We require N ≥ 4 because 4 is the minimum size that can yield distinct topologies; however, this is true only if every sequence in the family has a unique sequence. Therefore we identified sets of identical sequences and removed (at random) all but one copy of each, yielding 16639 families (N ≥ 4) and 119695 genes. In every case, the identical copies removed from consideration represented organisms either in the same genus (99.7%), or within the Escherichia-Shigella species pair (0.3%); many represent different strains within the same species (89.1%). It is possible that some of these represent (within-gene or whole-gene) LGT, but such cases could not have been detected by our (or any other existing) approach in any case. To minimise, to the extent possible, erroneous inference arising from the presence of paralogous sequences within these families, we further restricted our dataset to those 1462 families for which each member represents a different genome. In this dataset, these families of single-copy genes range in size from 4 to 52 (Figure 1). 17 Detecting fragmentary genetic transfer We applied a two-phase strategy for detecting recombination [20] in this study. In the first phase, PhiPack [21] was used to detect the occurrences of recombination based on discrepancies of phylogenetic signal within the sequence alignments. The program incorporates p-values of the NSS statistics in Reticulate [23], the MaxChi test [22], and PHI [21]. Datasets with at least two of the three p-values ≤ 0.10 were considered as positive for recombination. In the second phase, for each sequence set that showed evidence of recombination (above), a Bayesian phylogenetic approach was used to delineate recombination breakpoints; this was implemented in DualBrothers [24] run with MCMC chain length = 2500000, burnin = 500000, window_length = 5, and Green’s constant C = 0.25. The tree search space for each run of DualBrothers is defined by a list of unrooted tree topologies inferred using MRBAYES [39], for which we used parameter settings MCMC chain length = 2500000 and burnin = 500000 (nucmodel = 4by4, rates=gamma, ngammacat = 4) on smaller partitions of the sequence set, via a window-sliding approach (window length = 100, sliding size = 50, unit in alignment position); tree topologies within a 90% Bayesian confidence interval were included, with 1000 trees maximum. Gene families that show evidence of recombination are inferred to have undergone one or more events of fragmentary genetic transfer. Detecting non-fragmentary genetic transfer For each gene family for which no evidence of recombination was found in the firstphase screen, and for those positive in the first-phase screen but for which no recombination breakpoint could be detected, we inferred a Bayesian phylogenetic tree 18 (see below) and compared its topology against that of a reference tree; whole-gene (non-fragmentary) genetic transfer was inferred if the topologies were significantly discordant. These individual gene-family trees were inferred from DNA alignments (above). As reference we used the MRP [26] computed from all well-supported (BPP ≥ 0.95) bipartitions among all individual protein-family trees in these 144 genomes [5]. The individual gene-family trees were inferred using MRBAYES [39] with MCMC chain length = 2500000, burnin = 500000, and model = K2P [40]. Possible discordance between individual gene-family trees and the reference supertree topology was assessed under likelihood models captured in the Shimodaira-Hasegawa test [41], the one- and two-sided Kishino-Hasegawa tests [42, 43], and expected likelihood weights [44], all as implemented in Tree-Puzzle 5.1 [45]. Discordance was inferred if any tree was rejected by more than two of the four ML tests at a confidence interval of 95% (p ≤ 0.05), and was taken as prima facie evidence of whole-gene (non-fragmentary) lateral genetic transfer. Functional analysis of gene families Functional information for each protein sequence was retrieved from the Comprehensive Microbial Resource (CMR) at The Institute for Genomic Research website (http://cmr.tigr.org/), based on TIGR role identifiers and categorisation at Level 1. Over- or under-representation of functional categories and taxonomic groups was based on the probability of observing a defined number of target groups (or categories) in a subsample, given a process of sampling without replacement from the whole dataset (as defined in each case: see text) under a hypergeometric distribution [46]. The probability of observing x number of a particular target category is described as: 19 ⎛ m⎞ ⎛ N − m⎞ ⎜⎜ ⎟⎟ ⎜⎜ ⎟ k ⎠ ⎝ n − k ⎟⎠ ⎝ P (k = x) = f (k ; N , m, n) = ⎛N⎞ ⎜⎜ ⎟⎟ ⎝n⎠ in which N is the total population size, m is the size of the target category within the population, n is the total size of the subsample, and k is the size of the target category within the subsample. Acknowledgements This study was supported by Australian Research Council (ARC) grant CE0348221. We thank Aaron Darling and Vladimir Minin for valuable advice on the use of DualBrothers. CXC was supported by a University of Queensland UQIPRS scholarship. References 1. Grundmann H, Aires-de-Sousa M, Boyce J and Tiemersma E (2006). Emergence and resurgence of meticillin-resistant Staphylococcus aureus as a public-health threat. Lancet, 368: 874-885. 2. Doolittle WF (1999). Phylogenetic classification and the universal tree. Science, 284: 2124-2128. 3. Gogarten JP, Doolittle WF and Lawrence JG (2002). Prokaryotic evolution in light of gene transfer. Mol. Biol. Evol., 19: 2226-2238. 4. Wolf YI, Rogozin IB, Grishin NV and Koonin EV (2002). Genome trees and the Tree of Life. Trends Genet., 18: 472-479. 5. Beiko RG, Harlow TJ and Ragan MA (2005). Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. U. S. A., 102: 14332-14337. 6. Lerat E, Daubin V, Ochman H and Moran NA (2005). Evolutionary origins of genomic repertoires in bacteria. PLoS Biol., 3: e130. 7. Nakamura Y, Itoh T, Matsuda H and Gojobori T (2004). Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat. Genet., 36: 760-766. 8. Kunin V and Ouzounis CA (2003). The balance of driving forces during genome evolution in prokaryotes. Genome Res., 13: 1589-1594. 9. Hao W and Golding GB (2006). The fate of laterally transferred genes: life in the fast lane to adaptation or death. Genome Res., 16: 636-643. 10. Doolittle WF, Boucher Y, Nesbø CL, Douady CJ, Andersson JO and Roger AJ (2003). How big is the iceberg of which organellar genes in nuclear genomes are but the tip? Philos. Trans. R. Soc. Lond. B Biol. Sci., 358: 39-57. 20 11. Daubin V, Lerat E and Perrière G (2003). The source of laterally transferred genes in bacterial genomes. Genome Biol., 4: R57. 12. Pedulla ML, Ford ME, Houtz JM, Karthikeyan T, Wadsworth C, Lewis JA, Jacobs-Sera D, Falbo J, Gross J, Pannunzio NR, Brucker W, Kumar V, Kandasamy J, Keenan L, Bardarov S, Kriakov J, Lawrence JG, Jacobs WR, Jr., Hendrix RW and Hatfull GF (2003). Origins of highly mosaic mycobacteriophage genomes. Cell, 113: 171-182. 13. Lynch M and Conery JS (2000). The evolutionary fate and consequences of duplicate genes. Science, 290: 1151-1155. 14. Hartl DL, Lozovskaya ER and Lawrence JG (1992). Nonautonomous transposable elements in prokaryotes and eukaryotes. Genetica, 86: 47-53. 15. Inagaki Y, Susko E and Roger AJ (2006). Recombination between elongation factor 1-alpha genes from distantly related archaeal lineages. Proc. Natl. Acad. Sci. U. S. A., 103: 4528-4533. 16. Bork P and Doolittle RF (1992). Proposed acquisition of an animal protein domain by bacteria. Proc. Natl. Acad. Sci. U. S. A., 89: 8990-8994. 17. Omelchenko MV, Makarova KS, Wolf YI, Rogozin IB and Koonin EV (2003). Evolution of mosaic operons by horizontal gene transfer and gene displacement in situ. Genome Biol., 4: 55. 18. Igarashi N, Harada J, Nagashima S, Matsuura K, Shimada K and Nagashima KVP (2001). Horizontal transfer of the photosynthesis gene cluster and operon rearrangement in purple bacteria. J. Mol. Evol., 52: 333-341. 19. Beiko RG, Chan CX and Ragan MA (2005). A word-oriented approach to alignment validation. Bioinformatics, 21: 2230-2239. 20. Chan CX, Beiko RG and Ragan MA (2007). A two-phase strategy for detecting recombination in nucleotide sequences. South African Computer Journal, 38: 20-27. 21. Bruen TC, Philippe H and Bryant D (2006). A simple and robust statistical test for detecting the presence of recombination. Genetics, 172: 2665-2681. 22. Maynard Smith J (1992). Analyzing the mosaic structure of genes. J. Mol. Evol., 34: 126-129. 23. Jakobsen IB and Easteal S (1996). A program for calculating and displaying compatibility matrices as an aid in determining reticulate evolution in molecular sequences. CABIOS, 12: 291-295. 24. Minin VN, Dorman KS, Fang F and Suchard MA (2005). Dual multiple change-point model leads to more accurate recombination detection. Bioinformatics, 21: 3034-3042. 25. Suchard MA, Weiss RE, Dorman KS and Sinsheimer JS (2003). Inferring spatial phylogenetic variation along nucleotide sequences: a multiple change-point model. J. Am. Stat. Assoc., 98: 427437. 26. Ragan MA (1992). Phylogenetic inference based on matrix representation of trees. Mol. Phylogenet. Evol., 1: 53-58. 27. Ge F, Wang LS and Kim J (2005). The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol., 3: e316. 28. Mirkin BG, Fenner TI, Galperin MY and Koonin EV (2003). Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol. Biol., 3: 2. 21 29. Dagan T and Martin W (2007). Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc. Natl. Acad. Sci. U. S. A., 104: 870-875. 30. Ragan MA (2001). On surrogate methods for detecting lateral gene transfer. FEMS Microbiol. Lett., 201: 187-191. 31. Lerat E, Daubin V and Moran NA (2003). From gene trees to organismal phylogeny in prokaryotes: the case of the gamma-proteobacteria. PLoS Biol., 1: e19. 32. Gevers D, Vandepoele K, Simillon C and Van de Peer Y (2004). Gene duplication and biased functional retention of paralogs in bacterial genomes. Trends Microbiol., 12: 148-154. 33. Pushker R, Mira A and Rodríguez-Valera F (2004). Comparative genomics of gene-family size in closely related bacteria. Genome Biol, 5: R27. 34. Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson LD, Nelson WC, Ketchum KA, McDonald L, Utterback TR, Malek JA, Linher KD, Garrett MM, Stewart AM, Cotton MD, Pratt MS, Phillips CA, Richardson D, Heidelberg J, Sutton GG, Fleischmann RD, Eisen JA, White O, Salzberg SL, Smith HO, Venter JC and Fraser CM (1999). Evidence for lateral gene transfer between archaea and bacteria from genome sequence of Thermotoga maritima. Nature, 399: 323-329. 35. Ilyina TS and Romanova YM (2002). Bacterial genomic islands: organization, function, and evolutionary role. Molecular Biology, 36: 171-179. 36. Hacker J, Hochhut B, Middendorf B, Schneider G, Buchrieser C, Gottschalk G and Dobrindt U (2004). Pathogenomics of mobile genetic elements of toxigenic bacteria. Int. J. Med. Microbiol., 293: 453-461. 37. Jain R, Rivera MC and Lake JA (1999). Horizontal gene transfer among genomes: the complexity hypothesis. Proc. Natl. Acad. Sci. U. S. A., 96: 3801-3806. 38. Harlow TJ, Gogarten JP and Ragan MA (2004). A hybrid clustering approach to recognition of protein families in 114 microbial genomes. BMC Bioinformatics, 5: 45. 39. Huelsenbeck JP and Ronquist F (2001). MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17: 754-755. 40. Kimura M (1980). A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16: 111-120. 41. Shimodaira H and Hasegawa M (1999). Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol., 16: 1114-1116. 42. Kishino H and Hasegawa M (1989). Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol., 29: 170-179. 43. Goldman N, Anderson JP and Rodrigo AG (2000). Likelihood-based tests of topologies in phylogenetics. Syst. Biol., 49: 652-670. 44. Strimmer K and Rambaut A (2002). Inferring confidence sets of possibly misspecified gene trees. Proceedings of the Royal Society of London B: Biological Sciences, 269: 137-142. 45. Strimmer K and von Haeseler A (1996). Quartet puzzling: A quartet maximum-likelihood method for reconstructing tree topologies. Mol. Biol. Evol., 13: 964-969. 46. Johnson NL, Kotz S and Kemp AW (1992). Univariate Discrete Distributions. 2nd ed. New York: Wiley. 22 23