Phylogenomics Learning

This project aims at making gene family data sets amenable to statistical learning treatment (classification, clustering, and visualisation). This is my main project as a senior SNSF researcher at the University of Lausanne under the supervision of Christophe Dessimoz. With the increase in size and complexity of phylogenomic data sets, it is not possible any more to compare different genomic data even when they represent a common evolutionarily history. For example, gene families representing different genomic regions cannot be directly compared even though they represent the same species, whenever their leaves cannot be bijectively mapped (such trees are so-called mul-trees since multiple leaves have the same label/species). Therefore it is hard to pinpoint the evolutionary processes behind the observed evolution of these genes. This project offers an alternative to the wasteful process of orthologous selection and can handle all available information, including paralogs and samples from a single population.

My idea is to create a set of reference species trees, to which the gene trees can be always compared using specialised measures, to use as a coordinate system (like in landmark MDS or pivot-based indexing). Therefore any set of gene family trees can be described through this “spectral signature” (importing the term from chemometrics) and therefore compared. This system will allow the detection of outlier genes (due to e.g. HGT), the classification and selection of genes of interest, as well as the analysis and summary of tree collections (from MCMC, for instance). We are observing that this is a robust ‘decomposition’ of trees, that can handle at the same time huge gene family trees and small pairs with only few species in common. It also provides good visualisation between sampled trees, which can easily detect outliers, or can be utilised to assess convergence between MCMC Bayesian phylogenetic samples. The algorithms are being implemented into a C library which is available as a Python module (using the low-level C/Python API).

Example showing several gene family trees simulated under four distinct species trees (coloured yelow, orange, blue and purple) using our genomic simulator simphy.
On the left we see how pairs of gene trees have few species in common, and on the right we see a projection of these trees using our model.

Hyperspectral Imaging

Under the supervision of Prof. Molly Stevens from Imperial College London, I developed classification tools for stem cell lines through Raman spectroscopy signal. With the help of many other biomaterials and regenerative medicine researchers, I learned how to analyse a variety of high-volume data (hyperspectral and microscopic) and how to report it to a diverse audience. Specifically I developed a Python module for the analysis (pre-processing, learning, unmixing) of hyperspectral imaging data, such as Raman, surfance-enhanced Raman spectroscopy (SERS), Fourier-transform infrared spectroscopy (FT-IR), and time-of-fight secondary ion mass spectroscopy (ToF-SIMS).

Spectroscopic imaging allows for quantification from image data, given that spectral intensities are correlated with the abundance of molecules. Many spectroscopic analyses rely on the manual application of protocols (pre-processing like background and baseline correction, interpolation, etc.). This makes it hard and slow to replicate the steps on different data sets or even to compare different protocols on the same data set. My idea was to create a standardised Python module for storing and handling hyperspectral data with eventual image info ―since spectral data does not have to necessarily contain spacial information and can represent a single pixel, e.g. SIMS or circular dichroism (CD) spectroscopy. Many functions from this module use information from the pixels’ locations to help in spike removal, background correction and classification. During this time I also employed Statistical Learning techniques to classify cells from fluorescence microscopy images, using said library. The source code and notebooks can be found in

Median polish algorithm applied to a Raman image, showing the pixel and wavelength effects.

Species tree estimation from very large gene trees

With the generation of large genomic data sets, there came the realization that new methods were needed to extract the evolutionary history common to all gene families. My objective, in collaboration with David Posada, is to make use of all available data to estimate the posterior distribution of species trees compatibe with the gene histories. For this we devised a multivariate model of discordance between trees that takes into account all sources of disagreement like duplications, losses, and incomplete lineage sorting. 

The resulting software is guenomu, an ongoing open-source project for estimation of species trees given arbitrary sets of gene trees. Within this project we also offer a program that can generate a distribution of gene topologies given a point estimate, using its branch lenghts as source of "noise". We also offer a program to quickly estimate the species tree using several gene distance matrix approaches, as well as an application to calculate several distances between genes and species trees 'using reconciliation costs or an approximate SPR distance).

You can watch below my talk at Evol2014 about the software, guenomu, and follow the slides at slideshare.

Talk at Evolution 2014 -- Raleigh/NC

Bayesian detection of phylogenetic recombination

In HIV evolution and many other fast-evolving viruses, there is rampant recombination that disrupts the phylogenetic signal but which is on itself an important clinical and evolutionary marker. I worked on my PhD with Hirosisa Kishino on the development of a Bayesian estimation of recombination taking into account the amount of phylogenetic disagreement it generates. This led, for the first time, to a quantitative estimation of events for recombination breakpoints -- that is, we could for the first time distinguish one ancestral recombination event from several recurring events (a hotspot).

The method was first published in PLoS ONE (doi:10.1371/journal.pone.0002651), which is reviewed in this blog post. This method was then compared to simpler models in a follow-up article at the Annals of the ISM (doi:10.1007/s10463-009-0259-8). In this paper we also describe a distance measure between recombination mosaics that can be used to summarise a posterior distribution of recombination scenarios or to objectively infer a method's accuracy.

In the left we see HIV recombinant mosaics from South America, while in the right we see a quantitative estimation of recombination
using biomc2 (in red, at the top) compared with a traditional, qualitative estimation (in blue, at the bottom). 

Adaptive Evolution of Viral Proteins

I am also interested in the evolutionary history of protein structures, and in a previous work we tried to understand if the three-dimensional structures of the influenza virus hemagglutinin (HA) were correlated with their primary sequences, which could be used as a predictor of evolutionary change. In order to accomplish this, we collected data from all known HA1 sequences from the H3N2 subtype, and used template-based protein structure prediction to obtain point estimates of their 3D structures. The resulting structures did not have a clear correlation with the protein primary sequences or their related year of sampling.

Through cross-validation studies we realized that the structure prediction algorithm had an average prediction error similar to the expected distance between samples. Therefore our naïve protein modeling didn't have enough resolution to distinguish between closely-related HA sequences. Ultimately, based on these results prof. Teruaki Watabe managed to develop a sequence-structure fitness index which could be successfully applied to estimate the binding ability of HA complexed with a set of antibodies (doi:10.1093/molbev/msm079).

MDS plot and NJ tree for protein sequences and structures of H3N2 HA1
Multidimensional scaling plots and dendrograms for the protein primary sequences and predicted structures for a sample of HA1 sequences of the
influenza H3N2 subtype. Notice the correlation between the primary sequences and year of sampling, which is not recovered by the protein structures.

Relaxed Molecular Clock Models

In collaboration with several groups in Japan, I applied a relaxed molecular clock model to unravel the evolutionary history of loaches, deep-sea mussels and malaria parasites (see list of publications). At the same time I studied the performance of this model in multi-gene analyses under an empirical Bayesian framework, to estimate the distribution of autocorrelation rates

Phylogeography and divergence times of loaches of the genus lefua (doi:10.2108/zsj.22.157)

Me and my colleagues at the Bioinformatics and Molecular Evolution Lab
(University of Vigo, Spain)

With Elcio Leal and Kishino先生 at the University of Tokyo