Species tree estimation from very large gene trees

With the generation of large genomic data sets, there came the realization that new methods were needed to extract the evolutionary history common to all gene families. My objective, in collaboration with David Posada, is to make use of all available data to estimate the posterior distribution of species trees compatibe with the gene histories. For this we devised a multivariate model of discordance between trees that takes into account all sources of disagreement like duplications, losses, and incomplete lineage sorting. 

The resulting software is guenomu, an ongoing open-source project for estimation of species trees given arbitrary sets of gene trees. Within this project we also offer a program that can generate a distribution of gene topologies given a point estimate, using its branch lenghts as source of "noise". We also offer a program to quickly estimate the species tree using several gene distance matrix approaches, as well as an application to calculate several distances between genes and species trees 'using reconciliation costs or an approximate SPR distance).

You can watch below my talk at Evol2014 about the software, guenomu, and follow the slides at slideshare.

Talk at Evolution 2014 -- Raleigh/NC

Bayesian detection of phylogenetic recombination

In HIV evolution and many other fast-evolving viruses, there is rampant recombination that disrupts the phylogenetic signal but which is on itself an important clinical and evolutionary marker. I worked on my PhD with Hirosisa Kishino on the development of a Bayesian estimation of recombination taking into account the amount of phylogenetic disagreement it generates. This led, for the first time, to a quantitative estimation of events for recombination breakpoints -- that is, we could for the first time distinguish one ancestral recombination event from several recurring events (a hotspot).

The method was first published in PLoS ONE (doi:10.1371/journal.pone.0002651), which is reviewed in this blog post. This method was then compared to simpler models in a follow-up article at the Annals of the ISM (doi:10.1007/s10463-009-0259-8). In this paper we also describe a distance measure between recombination mosaics that can be used to summarise a posterior distribution of recombination scenarios or to objectively infer a method's accuracy.

In the left we see HIV recombinant mosaics from South America, while in the right we see a quantitative estimation of recombination
using biomc2 (in red, at the top) compared with a traditional, qualitative estimation (in blue, at the bottom). 

Adaptive Evolution of Viral Proteins

I am also interested in the evolutionary history of protein structures, and in a previous work we tried to understand if the three-dimensional structures of the influenza virus hemagglutinin (HA) were correlated with their primary sequences, which could be used as a predictor of evolutionary change. In order to accomplish this, we collected data from all known HA1 sequences from the H3N2 subtype, and used template-based protein structure prediction to obtain point estimates of their 3D structures. The resulting structures did not have a clear correlation with the protein primary sequences or their related year of sampling.

Through cross-validation studies we realized that the structure prediction algorithm had an average prediction error similar to the expected distance between samples. Therefore our naïve protein modeling didn't have enough resolution to distinguish between closely-related HA sequences. Ultimately, based on these results prof. Teruaki Watabe managed to develop a sequence-structure fitness index which could be successfully applied to estimate the binding ability of HA complexed with a set of antibodies (doi:10.1093/molbev/msm079).

MDS plot and NJ tree for protein sequences and structures of H3N2 HA1
Multidimensional scaling plots and dendrograms for the protein primary sequences and predicted structures for a sample of HA1 sequences of the
influenza H3N2 subtype. Notice the correlation between the primary sequences and year of sampling, which is not recovered by the protein structures.

Me and my colleagues at the Bioinformatics and Molecular Evolution Lab
(University of Vigo, Spain)

With Elcio Leal and Kishino先生 at the University of Tokyo