Congratulations to CSE Assistant Professor Derek Aguiar who was awarded an NSF CAREER award titled “Practical algorithms and high dimensional statistical methods for multimodal haplotype modelling.” This project addresses major challenges in computational biology and applied machine learning by innovating new robust mathematical models that make few assumptions and efficient training algorithms to leverage massive and complex cellular data.
Source: NSF
Massive and diverse datasets have been generated from human cells with the goal of explaining the many ways cellular differences affect the observed differences in traits between people. Mathematical models of the genetic differences between people can be used to explain, for example, why some individuals are predisposed to developing a particular disease. However, most mathematical models make overly simplistic assumptions about how genetic differences interact to influence an observed trait. This project addresses major challenges in computational biology and applied machine learning by innovating new robust mathematical models that make few assumptions and efficient training algorithms to leverage massive and complex cellular data. Specifically, the project considers: (a) methods for computing sequences of genetic differences by integrating different types of data, machine learning, and algorithmic techniques; (b) mathematical models for characterizing the genetic similarity between people; and (c) efficient algorithms that scale to large datasets. The results of this project include new methods that are broadly applicable to clustering massive and diverse sequential data, and specifically helpful for researchers trying to understand how genetic differences affect disease and other traits. Furthermore, the research supports the math and science high school and university communities by developing interactive learning modules and networking resources.
This project develops the statistical and algorithmic foundations for sequences of multimodal variation (i.e., multiomic haplotypes) in two research directions. The first direction introduces the multiomic haplotype data structure and develops new Bayesian nonparametric models and fast inference algorithms for clustering multiomic haplotypes from heterogeneous and high dimensional biomolecular data. Computational tractability is achieved through novel and efficient inference algorithms that operate in data-space (Bayesian coresets), model-space (deep approximations), and algorithm-space (variational approximations). The second direction develops the first model that unifies the combinatorial domain of haplotype assembly with the probabilistic haplotype phasing domain to infer latent haplotypes. The investigator will accomplish this unification goal by combining directed and undirected graphical modeling techniques with efficient particle-based inference algorithms. The completion of these research tasks will result in new methods for developing deep approximations for high dimensional Bayesian nonparametric models, models for multimodal sequential clustering, and methods to accelerate the training of high dimensional statistical models. Additionally, the research addresses (a) the longstanding open problem of haplotype assembly and haplotype phasing unification; and (b) potential sources of missing heritability in association studies: phase-dependent genetic and haplotype-epigenetic interactions. Partnerships with the university and regional high school communities will translate the research findings into educational modules and resources to motivate, engage, and retain computer science students and teachers.