Neural networks will help study DNA
Scientists at the Higher School of Economics have proposed a way to improve the accuracy of finding Z-DNA – the sections of a molecule that are left-handed rather than right-handed. To do so, they used neural networks and a dataset of more than 30,000 experiments done by different laboratories around the world. Details of the study are published in Scientific Reports.
Over the 67 years since the discovery of the structure of DNA, scientists have found many variations in the structure of this molecule. Occasionally the structural elements of DNA are not at all like the familiar double helix, which is called B-DNA. They may differ from it in the number of chains (from two to four), density and thickness, the way the nitrogenous bases are connected and the direction of the helix.
Z-DNA is one of the DNA structures which is a double helix but wound differently than the others – to the left instead of to the right. It is known that Z-DNA sites are found in the cells of various organisms (from bacteria to humans), occur under certain conditions (such as high humidity or salt concentration) and can be combined with other variants of the structure in the same molecule. For example, if for some reason the B-DNA molecule is wound too much, so tightly that it impedes transcription (DNA-based RNA synthesis), some parts of it can be twisted in the opposite direction, thereby relieving unnecessary "tension". Scientists also hypothesize that Z-DNA can regulate transcription and increase the likelihood of mutations. Some studies suggest that Z-DNA formation may be associated with certain diseases, such as cancer, diabetes and Alzheimer's disease. Recently, more and more work has appeared demonstrating a role for Z-DNA in the innate immune response – the response to viruses and other pathogens within the cell itself.
To learn more about the origins and biological role of Z-DNA sites, it is necessary to learn how to find their location in the genome. The first genetic map with markup of Z-DNA sites was made back in 1997, based on experimental data on the structural connection of consecutive nucleotides. In recent years, methods have emerged in which the location of non-B-DNA sites was predicted using computer algorithms. Advances in machine learning have made it possible to use another powerful tool for this task – neural networks. Unlike most methods, they can take into account many factors and do not require scientists to choose in advance a few of the most probable ones. But even for neural networks, the task of finding Z-DNA remains difficult, since experimental data are insufficient: Z-DNA appears and disappears, and the experiment captures only a small fraction of such sites. The authors of the article decided to test whether the accuracy of neural networks would improve if they were also provided with omics data - information about how gene activity and protein synthesis are regulated in cells.
The scientists began by comparing how three types of neural networks cope with the task: convolutional, recurrent and a combination of the first two. The convolutional neural networks are most often applied to image processing, while the recurrent networks are used to analyze sequences, such as handwritten text or speech. All three types of neural networks have already been tested on tasks related to the study of the genome. The authors of the paper trained a total of 151 models on an extended dataset and evaluated them; the best results were shown by one of the recurrent neural networks. It was named DeepZ and was used to predict new Z-DNA sites in the human genome. Its accuracy far exceeds that of the existing algorithm, Z-Hunt.
Using DeepZ, the scientists marked the entire human genome sequence, determining for each nucleotide the probability that it would end up inside a Z-DNA site. Each sequence of several nucleotides with probabilities higher than a certain threshold value was marked as a potential site of interest.
Maria Poptsova, Head of the Laboratory of Bioinformatics at the Faculty of Computer Science of the HSE University, Research Director
The results of this study are important because, with the help of neural networks, we were able not only to reproduce the experiments, but also predict the potential sites of Z-DNA formation in the genome. The abundance of Z-DNA signals suggests that they are actively used to turn genes on and off. It is a more rapid signal than motifs in the genome itself. For example, research by a group of scientists from Australia have shown that Z-DNA serves as a signal when learning to suppress fear. Apparently, Z-DNA evolutionarily emerged when a rapid response to a sudden event is required. We are planning to initiate collaborative projects with experimental groups to test the predictions.
The authors demonstrated a new approach to predicting Z-DNA sites using omics data and deep learning techniques. The neural network-generated markup of the genome will help scientists conduct experiments to detect Z-DNA, the full spectrum of which is only beginning to manifest itself.