Interview with Maria Poptsova, Head of the Artificial Intelligence in Bioinformatics Project
Why did you choose this particular project? What is the relevance of the topic?
Our project focuses on building AI systems to recognise the functional elements of the genome. We envision the genome as a genetic computer, the operation of which can be controlled by switching functional elements on and off. Functional coding is realized at different levels of genome organization — at the level of DNA itself, epigenetic markers, secondary DNA structures and chromatin packing. Experimental data on the location of functional elements is collectively referred to as omics data. This is the big data of molecular biology that can only be analysed using AI systems. The control of functional elements is very important because it opens up the possibility of reprogramming cells.
What are the current priorities of the project?
We are currently actively developing ML models for the prediction of DNA secondary structures — a transformer model, a model based on generative-adversarial networks and a domain adaptation model. We are also developing a system for interpreting neural networks that have been trained on omics data matrices.
Simultaneously, we are developing algorithms to predict the shape of proteins that bind to secondary DNA structures and algorithms to predict adaptive introgression.
What artificial intelligence technologies are paramount in this research?
We are currently adapting state-of-the-art neural network deep learning algorithms like convolutional neural networks, transformers, generative-adversarial networks, graph networks and domain adaptation models — to work with genomic data. We develop representation learning approaches for graph neural networks. We also use
explicable artificial intelligence (XAI) approaches — such as layer-by-layer relevance propagation methods and integrated gradient methods.
What challenges the researchers in the course of their work? Do the problems replace one another over the course of the work?
The main difficulty is computational power. Despite our own laboratory server being sufficiently powerful to run neural network models, we have to resort to the computing resources of the Faculty but these are overloaded and we often have to queue.
Although the lab's research interest is focused on the role of secondary DNA structures, we are developing a more general system for recognising any functional elements of the genome from any omics input. This means that the modules will be easily adaptable to determine the functional role of different elements of the genome and determine its multidimensional relationships to other elements.
What are the future plans of the research team within the project?
During the first year, we plan to develop models of various neural network architectures for recognition of the genome functional elements based on omics data. We plan to test at least two approaches for interpreting neural networks — the layer-by-layer relevance propagation method and the integrated gradient method. After that, we will focus on developing omics data representation modules for more efficient ML algorithms.
We will also test different approaches for the task of predicting protein shape to recognize certain building blocks — a whole year will be spent developing the approach of learning representations in graph neural networks. The next step will be to select the best performing architecture and perform biological interpretation.
What applications are possible once the project is complete?
We hope that once we implement the intended modular system, it can be applied to a broad class of tasks — finding functionally relevant elements to turn on or off a particular cellular mode. Then it will be possible to run cell differentiation programmes and turn stem cells (undifferentiated cells) into cells of a particular tissue type. In the treatment of genome-altering diseases such as cancer, the system we have developed will be able to recognize functional elements that have been corrupted by the disease.
Different molecules may bind to and block precisely the secondary DNA structures that trigger or shut down different genetic programmes. The system we have developed to find the form of proteins that can bind to the secondary structures of the DNA can be used to develop this new class of drugs.