Skip to main content
Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and... more
Alignment-free methods, in which shared properties of sub-sequences (e.g. identity or match length) are extracted and used to compute a distance matrix, have recently been explored for phylogenetic inference. However, the scalability and robustness of these methods to key evolutionary processes remain to be investigated. Here, using simulated sequence sets of various sizes in both nucleotides and amino acids, we systematically assess the accuracy of phylogenetic inference using an alignment-free approach, based on D2 statistics, under different evolutionary scenarios. We find that compared to a multiple sequence alignment approach, D2 methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation. Across diverse empirical datasets, the alignment-free methods perform well for sequences sharing low divergence, at greater computation speed. Our findings provide strong evidence for the scalability and the potential use of alignment-free methods in large-scale phylogenomics.
Research Interests:
From 1971 to 1985, Carl Woese and colleagues generated oligonucleotide catalogs of 16S/18S rRNAs from more than 400 organisms. Using these incomplete and imperfect data, Carl and his colleagues developed unprecedented insights into the... more
From 1971 to 1985, Carl Woese and colleagues generated oligonucleotide catalogs of 16S/18S rRNAs from more than 400 organisms. Using these incomplete and imperfect data, Carl and his colleagues developed unprecedented insights into the structure, function, and evolution of the large RNA components of the translational apparatus. They recognized a third domain of life, revealed the phylogenetic backbone of bacteria (and its limitations), delineated taxa, and explored the tempo and mode of microbial evolution. For these discoveries to have stood the test of time, oligonucleotide catalogs must carry significant phylogenetic signal; they thus bear re-examination in view of the current interest in alignment-free phylogenetics based on k-mers. Here we consider the aims, successes, and limitations of this early phase of molecular phylogenetics. We computationally generate oligonucleotide sets (e-catalogs) from 16S/18S rRNA sequences, calculate pairwise distances between them based on D2 statistics, compute distance trees, and compare their performance against alignment-based and k-mer trees. Although the catalogs themselves were superseded by full-length sequences, this stage in the development of computational molecular biology remains instructive for us today.
Research Interests:
Sugarcane is a globally important food, biofuel and biomaterials crop. High nitrogen (N) fertilizer rates aimed at increasing yield often result in environmental damage because of excess and inefficient application. Inoculation with... more
Sugarcane is a globally important food, biofuel and biomaterials crop. High nitrogen (N) fertilizer rates aimed at increasing yield often result in environmental damage because of excess and inefficient application. Inoculation with diazotrophic bacteria is an attractive option for reducing N fertilizer needs. However, the efficacy of bacterial inoculants is variable, and their effective formulation remains a knowledge frontier. Here, we take a new approach to investigating diazotrophic bacteria associated with roots using culture-independent microbial community profiling of a commercial sugarcane variety (Q208A) in a field setting. We first identified bacteria that were markedly enriched in the rhizosphere to guide isolation and then tested putative diazotrophs for the ability to colonize axenic sugarcane plantlets (Q208A) and promote growth in suboptimal N supply.
Research Interests:
Algae and plants rely on the plastid (e.g., chloroplast) to carry out photosynthesis. This organelle traces its origin to a cyanobacterium that was captured over a billion years ago by a single-celled protist. Three major photosynthetic... more
Algae and plants rely on the plastid (e.g., chloroplast) to carry out photosynthesis. This organelle traces its origin to a cyanobacterium that was captured over a billion years ago by a single-celled protist. Three major photosynthetic lineages (the green algae and plants [Viridiplantae], red algae [Rhodophyta], and Glaucophyta) arose from this primary endosymbiotic event and are putatively united as the Plantae (also known as Archaeplastida). Glaucophytes comprise a handful of poorly studied species that retain ancestral features of the cyanobacterial endosymbiont such as a peptidoglycan cell wall. Testing the Plantae hypothesis and elucidating glaucophyte evolution has in the past been thwarted by the absence of complete genome data from these taxa. Furthermore, multi-gene phylogenetics has fueled controversy about the frequency of primary plastid acquisitions during eukaryote evolution because these approaches have generally failed to recover Plantae monophyly. Here we review some of the key insights about Plantae evolution that were gleaned from a recent analysis of a draft genome assembly from Cyanophora paradoxa (Glaucophyta). We present results that conclusively demonstrate Plantae monophyly. We also describe new insights that were gained into peptidoglycan biosynthesis in glaucophytes and the carbon concentrating mechanism (CCM) in C. paradoxa plastids.
Research Interests:
The recently published genome of the unicellular red alga Porphyridium purpureum revealed a gene-rich, intron-poor species, which is surprising for a free-living mesophile. Of the 8,355 predicted protein-coding regions, up to 773 (9.3%)... more
The recently published genome of the unicellular red alga Porphyridium purpureum revealed a gene-rich, intron-poor species, which is surprising for a free-living mesophile. Of the 8,355 predicted protein-coding regions, up to 773 (9.3%) were implicated in horizontal genetic transfer (HGT) events involving other prokaryote and eukaryote lineages. A much smaller number, up to 174 (2.1%) showed unambiguous evidence of vertical inheritance. Together with other red algal genomes, nearly all published in 2013, these data provide an excellent platform for studying diverse aspects of algal biology and evolution. This novel information will help investigators test existing hypotheses about the impact of endosymbiosis and HGT on algal evolution and enable comparative analysis within a more-refined, hypothesisdriven framework that extends beyond HGT. Here we explore the impacts of this infusion of red algal genome data on addressing questions regarding the complex nature of algal evolution and highlight the need for scalable phylogenomic approaches to handle the forthcoming deluge of sequence information.
Research Interests:
Diatoms are highly successful marine and freshwater algae that contribute up to 20% of global carbon fixation. These species are leading candidates for biofuel production owing to ease of culturing and high fatty acid content. To assist... more
Diatoms are highly successful marine and freshwater algae that contribute up to 20% of global carbon fixation. These species are leading candidates for biofuel production owing to ease of culturing and high fatty acid content. To assist in strain improvement and downstream applications for potential use as a biofuel, it is important to understand the evolution of lipid biosynthesis in diatoms. The evolutionary history of diatoms is however complicated by likely multiple endosymbioses involving the capture of foreign cells and horizontal gene transfer into the host genome. Using a phylogenomic approach, we assessed the evolutionary history of 12 diatom genes putatively encoding functions related to lipid biosynthesis. We found evidence of gene transfer likely from a green algal source for seven of these genes, with the remaining showing either vertical inheritance or evolutionary histories too complicated to interpret given current genome data. The functions of horizontally transferred genes encompass all aspects of lipid biosynthesis (initiation, biosynthesis, and desaturation of fatty acids) as well as fatty acid elongation, and are not restricted to plastid-targeted proteins. Our findings demonstrate that the transfer, duplication, and subfunctionalization of genes were key steps in the evolution of lipid biosynthesis in diatoms and other photosynthetic eukaryotes. This target pathway for biofuel research is highly chimeric and surprisingly, our results suggest that research done on related genes in green algae may have application to diatom models.
A recent editorial in Journal of Molecular Evolution highlights opportunities and challenges facing molecular evolution in the era of next-generation sequencing. Abundant sequence data should allow more-complex models to be fit at higher... more
A recent editorial in Journal of Molecular Evolution highlights opportunities and challenges facing molecular evolution in the era of next-generation sequencing. Abundant sequence data should allow more-complex models to be fit at higher confidence, making phylogenetic inference more reliable and improving our understanding of evolution at the molecular level. However, concern that approaches based on multiple sequence alignment may be computationally infeasible for large datasets is driving the development of so-called alignment-free methods for sequence comparison and phylogenetic inference. The recent editorial characterized these approaches as model-free, not based on the concept of homology, and lacking in biological intuition. We argue here that alignment-free methods have not abandoned models or homology, and can be biologically intuitive.
RNAi (RNA interference) relies on the production of small RNAs (sRNAs) from double-stranded RNA and comprises a major pathway in eukaryotes to restrict the propagation of selfish genetic elements. Amplification of the initial RNAi signal... more
RNAi (RNA interference) relies on the production of small RNAs (sRNAs) from double-stranded RNA and comprises a major pathway in eukaryotes to restrict the propagation of selfish genetic elements. Amplification of the initial RNAi signal by generation of multiple secondary sRNAs from a targeted mRNA is catalyzed by RNA-dependent RNA polymerases (RdRPs). This phenomenon is known as transitivity and is particularly important in plants to limit the spread of viruses. Here we describe, using a genome-wide approach, the distribution of sRNAs in the glaucophyte alga Cyanophora paradoxa. C. paradoxa is a member of the supergroup Plantae (also known as Archaeplastida) that includes red algae, green algae, and plants. The ancient (>1 billion years ago) split of glaucophytes within Plantae suggests that C. paradoxa may be a useful model to learn about the early evolution of RNAi in the supergroup that ultimately gave rise to plants. Using next-generation sequencing and bioinformatic analyses we find that sRNAs in C. paradoxa are preferentially associated with mRNAs, including a large number of transcripts that encode proteins arising from different functional categories. This pattern of exonic sRNAs appears to be a general trend that affects a large fraction of mRNAs in the cell. In several cases we observe that sRNAs have a bias for a specific strand of the mRNA, including many instances of antisense predominance. The genome of C. paradoxa encodes four sequences that are homologous to RdRPs in Arabidopsis thaliana. We discuss the possibility that exonic sRNAs in the glaucophyte may be secondarily derived from mRNAs by the action of RdRPs. If this hypothesis is confirmed, then transitivity may have had an ancient origin in Plantae.
The limited knowledge we have about red algal genomes comes from the highly specialized extremophiles, Cyanidiophyceae. Here, we describe the first genome sequence from a mesophilic, unicellular red alga, Porphyridium purpureum. The 8,355... more
The limited knowledge we have about red algal genomes comes from the highly specialized extremophiles, Cyanidiophyceae. Here, we describe the first genome sequence from a mesophilic, unicellular red alga, Porphyridium purpureum. The 8,355 predicted genes in P. purpureum, hundreds of which are likely to be implicated in a history of horizontal gene transfer, reside in a genome of 19.7 Mbp with 235 spliceosomal introns. Analysis of light-harvesting complex proteins reveals a nuclear-encoded phycobiliprotein in the alga. We uncover a complex set of carbohydrate-active enzymes, identify the genes required for the methylerythritol phosphate pathway of isoprenoid biosynthesis, and find evidence of sexual reproduction. Analysis of the compact, function-rich genome of P. purpureum suggests that ancestral lineages of red algae acted as mediators of horizontal gene transfer between prokaryotes and photosynthetic eukaryotes, thereby significantly enriching genomes across the tree of photosynthetic life.
Background Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful... more
Background
Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, including content bias and degree of divergence. New, highly scalable methods have recently been introduced to cluster the very large datasets being generated by next-generation sequencing technologies. However, there has been little systematic investigation of how characteristics of the data impact the performance of these approaches.

Results
Using clusters from a manually curated dataset as reference, we examined the performance of a widely used graph-based Markov clustering algorithm (MCL) and a greedy heuristic approach (UCLUST) in delineating protein families coded by three sets of bacterial genomes of different G+C content. Both MCL and UCLUST generated clusters that are comparable to the reference sets at specific parameter settings, although UCLUST tends to under-cluster compositionally biased sequences (G+C content 33% and 66%). Using simulated data, we sought to assess the individual effects of sequence divergence, rate heterogeneity, and underlying G+C content. Performance decreased with increasing sequence divergence, decreasing among-site rate variation, and increasing G+C bias. Two MCL-based methods recovered the simulated families more accurately than did UCLUST. MCL using local alignment distances is more robust across the investigated range of sequence features than are greedy heuristics using distances based on global alignment.

Conclusions
Our results demonstrate that sequence divergence, rate heterogeneity and content bias can individually and in combination affect the accuracy with which MCL and UCLUST can recover homologous protein families. For application to data that are more divergent, and exhibit higher among-site rate variation and/or content bias, MCL may often be the better choice, especially if computational resources are not limiting.
Abstract: Thanks to advances in next-generation technologies, genome sequences are now being generated at breadth (eg across environments) and depth (thousands of closely related strains, individuals or samples) unimaginable only a few... more
Abstract: Thanks to advances in next-generation technologies, genome sequences are now being generated at breadth (eg across environments) and depth (thousands of closely related strains, individuals or samples) unimaginable only a few years ago.
De novo genome and transcriptome data from a number of marine algal species have recently become available, ranging from red, green and brown algae, as well as other photosynthetic eukaryotes, e.g. diatoms and dinoflagellates.... more
De novo genome and transcriptome data from a number of marine algal species have recently become available, ranging from red, green and brown algae, as well as other photosynthetic eukaryotes, e.g. diatoms and dinoflagellates. Phylogenomic approaches are widely adopted to decipher the evolutionary relationships among diverse lineages. Novel algal genomes therefore provide an exciting analysis platform for understanding algal biology, ecophysiology and diversity, and at a broader scale, eukaryote evolution. In this brief communication, I highlight major findings from recent phylogenomic studies of marine algae and their impact to the research field. I discuss the current trends and future directions of phylogenomics, and how we can apply this approach in studying marine diversity in the South China Sea.
Research Interests:
The red seaweed Porphyra (Bangiophyceae) and related Bangiales have global economic importance. Here, we report the analysis of a comprehensive transcriptome comprising ca. 4.7 million expressed sequence tag (EST) reads from P.... more
The red seaweed Porphyra (Bangiophyceae) and related Bangiales have global economic importance. Here, we report the analysis of a comprehensive transcriptome comprising ca. 4.7 million expressed sequence tag (EST) reads from P. umbilicalis (L.) J. Agardh and P. purpurea (Roth) C. Agardh (ca. 980 Mbp of data generated using 454 FLX pyrosequencing). These ESTs were isolated from the haploid gametophyte (blades from both species) and diploid conchocelis stage (from P. purpurea). In a bioinformatic analysis, only 20% of the contigs were found to encode proteins of known biological function. Comparative analysis of predicted protein functions in mesophilic (including Porphyra) and extremophilic red algae suggest that the former has more putative functions related to signaling, membrane transport processes, and establishment of protein complexes. These enhanced functions may reflect general mesophilic adaptations. A near-complete repertoire of genes encoding histones and ribosomal proteins was identified, with some differentially regulated between the blade and conchocelis stage in P. purpurea. This finding may reflect specific regulatory processes associated with these distinct phases of the life history. Fatty acid desaturation patterns, in combination with gene expression profiles, demonstrate differences from seed plants with respect to the transport of fatty acid/lipid among subcellular compartments and the molecular machinery of lipid assembly. We also recovered a near-complete gene repertoire for enzymes involved in the formation of sterols and carotenoids, including candidate genes for the biosynthesis of lutein. Our findings provide key insights into the evolution, development, and biology of Porphyra, an important lineage of red algae.
Microbial eukaryotes may extinguish much of their nuclear phylogenetic history due to endosymbiotic/horizontal gene transfer (E/HGT). We studied E/HGT in 32,110 contigs of expressed sequence tags (ESTs) from the dinoflagellate Alexandrium... more
Microbial eukaryotes may extinguish much of their nuclear phylogenetic history due to endosymbiotic/horizontal gene transfer (E/HGT). We studied E/HGT in 32,110 contigs of expressed sequence tags (ESTs) from the dinoflagellate Alexandrium tamarense (Dinophyceae) using a conservative phylogenomic approach. The vast majority of predicted proteins (86.4%) in this alga are novel or dinoflagellate-specific. We searched for putative homologs of these predicted proteins against a taxonomically broadly sampled protein database that includes all currently available data from algae and protists, and reconstructed a phylogeny from each of the putative homologous protein sets. Of the 2,523 resulting phylogenies, 14%–17% are potentially impacted by E/HGT involving both prokaryote and eukaryote lineages, with 2%–4% showing clear evidence of reticulate evolution. The complex evolutionary histories of the remaining proteins, many of which may also have been affected by E/HGT, cannot be interpreted using our approach with currently available gene data. We present empirical evidence of reticulate genome evolution that combined with inadequate or highly complex phylogenetic signal in many proteins may impede genome-wide approaches to infer the tree of microbial eukaryotes.
Little is known about the genetic and biochemical mechanisms that underlie red algal development, for example, why the group failed to evolve complex parenchyma and tissue differentiation. Here we examined expressed sequence tag (EST)... more
Little is known about the genetic and biochemical mechanisms that underlie red algal development, for example, why the group failed to evolve complex parenchyma and tissue differentiation. Here we examined expressed sequence tag (EST) data from two closely related species, Porphyra umbilicalis (L.) J. Agardh and P. purpurea (Roth) C. Agardh, for conserved developmental regulators known from model eukaryotes, and their expression levels in several developmental stages. Genes for most major developmental families were present, including MADS-box and homeodomain (HD) proteins, SNF2 chromatin-remodelers, and proteins involved in sRNA biogenesis. Some of these genes displayed altered expression correlating with different life history stages or cell types. Notably, two ESTs encoding HD proteins showed eightfold higher expression in the P. purpurea sporophyte (conchocelis) than in the gametophyte (blade), whereas two MADS domain-containing paralogs showed significantly different patterns of expression in the conchocelis and blade respectively. These developmental gene families do not appear to have undergone the kinds of dramatic expansions in copy number found in multicellular land plants and animals, which are important for regulating developmental processes in those groups. Analyses of small RNAs did not validate the presence of miRNAs, but homologs of Argonaute were present. In general, it appears that red algae began with a similar molecular toolkit for directing development as did other multicellular eukaryotes, but probably evolved altered roles for many key proteins, as well as novel mechanisms yet to be discovered.
Lateral genetic transfer (LGT) involves the movement of genetic material from one lineage into another and its subsequent incorporation into the new host genome via genetic recombination. Studies in individual taxa have indicated lateral... more
Lateral genetic transfer (LGT) involves the movement of genetic material from one lineage into another and its subsequent incorporation into the new host genome via genetic recombination. Studies in individual taxa have indicated lateral origins for stretches of DNA of greatly varying length, from a few nucleotides to chromosome size. Here we analyze 1,462 sets of single-copy, putatively orthologous genes from 144 fully sequenced prokaryote genomes, asking to what extent complete genes and fragments of genes have been transferred and recombined in LGT. Using a rigorous phylogenetic approach, we find evidence for LGT in at least 476 (32.6%) of these 1,462 gene sets: 286 (19.6%) clearly show one or more “observable recombination breakpoints” within the boundaries of the open reading frame, while a further 190 (13.0%) yield trees that are topologically incongruent with the reference tree but do not contain a recombination breakpoint within the open reading frame. We refer to these gene sets as observable recombination breakpoint positive (ORB+) and negative (ORB−) respectively. The latter are prima facie instances of lateral transfer of an entire gene or beyond. We observe little functional bias between ORB+ and ORB− gene sets, but find that incorporation of entire genes is potentially more frequent in pathogens than in nonpathogens. As ORB+ gene sets are about 50% more common than ORB− sets in our data, the transfer of gene fragments has been relatively frequent, and the frequency of LGT may have been systematically underestimated in phylogenetic studies.
Background Genetic recombination can produce heterogeneous phylogenetic histories within a set of homologous genes. These recombination events can be obscured by subsequent residue substitutions, which consequently complicate their... more
Background Genetic recombination can produce heterogeneous phylogenetic histories within a set of homologous genes. These recombination events can be obscured by subsequent residue substitutions, which consequently complicate their detection. While there are many algorithms for the identification of recombination events, little is known about the effects of subsequent substitutions on the accuracy of available recombination-detection approaches.
Abstract. Genetic recombination following a genetic transfer event can produce heterogeneous phylogenetic histories within sets of genes that share a common ancestral origin. Delineating recombination events will enhance our understanding... more
Abstract. Genetic recombination following a genetic transfer event can produce heterogeneous phylogenetic histories within sets of genes that share a common ancestral origin. Delineating recombination events will enhance our understanding in genome evolution. However, the task of detecting recombination is not trivial due to effect of more-recent evolutionary changes that can obscure such event from detection.
Seaweeds are of economic and ecological importance, and for centuries have been used in food, phycocolloid, and pharmaceutical industries. With the advance development of computational biology and bioinformatics, the study of seaweeds is... more
Seaweeds are of economic and ecological importance, and for centuries have been used in food, phycocolloid, and pharmaceutical industries. With the advance development of computational biology and bioinformatics, the study of seaweeds is entering a new revolutionary phase.