How the nuclear lamina (NL) impacts on global chromatin architecture is poorly understood. Here, we show that NL disruption in Drosophila S2 cells leads to chromatin compaction and repositioning from the nuclear envelope. This increases the chromatin density in a fraction of topologically-associating domains (TADs) enriched in active chromatin and enhances interactions between active and inactive chromatin. Importantly, upon NL disruption the NL-associated TADs become more acetylated at histone H3 and less compact, while background transcription is derepressed. Two-colour FISH confirms that a TAD becomes less compact following its release from the NL. Finally, polymer simulations show that chromatin binding to the NL can per se compact attached TADs. Collectively, our findings demonstrate a dual function of the NL in shaping the 3D genome. Attachment of TADs to the NL makes them more condensed but decreases the overall chromatin density in the nucleus by stretching interphase chromosomes.
The role of 3’-end stem-loops in transposition was experimentally demonstrated for transposons of various species, where LINE-SINE transposons share the same 3’-end sequences, containing a stem-loop. We have discovered that 62-68% of processed pseduogenes and mRNAs also have 3’-end stem-loops. We investigated the properties of 3’-end stem-loops of human L1s, Alus, processed pseudogenes and mRNAs that do not share the same sequences, but all have 3’-end stem-loops. We have built sequence-based and structure-based machine-learning models that are able to recognize 3’-end L1, Alu, processed pseudogene and mRNA stem-loops with high performance. The sequence-based models use only sequence information and capture compositional bias in 3’-ends. The structure-based models consider physical, chemical and geometrical properties of dinucleotides composing a stem and position-specific nucleotide content of a loop and a bulge. The most important parameters include shift, tilt, rise, and hydrophilicity. The obtained results clearly point to the existence of structural constrains for 3’-end stem-loops of L1 and Alu, which are probably important for transposition, and reveal the potential of mRNAs to be recognized by the L1 machinery. The constructed models are freely available at github (https://github.com/AlexShein/transposons/) and can be used for de novo discovery of transposon-related stem-loops.
Background: Chromosomal rearrangements are the typical phenomena in cancer genomes causing gene disruptions and fusions, corruption of regulatory elements, damage to chromosome integrity. Among the factors contributing to genomic instability are non-B DNA structures with stem-loops and quadruplexes being the most prevalent. We aimed at investigating the impact of specifically these two classes of non-B DNA structures on cancer breakpoint hotspots using machine learning approach.
Methods: We developed procedure for machine learning model building and evaluation as the considered data are extremely imbalanced and it was required to get a reliable estimate of the prediction power. We built logistic regression models predicting cancer breakpoint hotspots based on the densities of stem-loops and quadruplexes, jointly and separately. We also tested Random Forest models varying different resampling schemes (leave-one-out cross validation, train-test split, 3-fold cross-validation) and class balancing techniques (oversampling, stratification, synthetic minority oversampling).
Results: We performed analysis of 487,425 breakpoints from 2234 samples covering 10 cancer types available from the International Cancer Genome Consortium. We showed that distribution of breakpoint hotspots in different types of cancer are not correlated, confirming the heterogeneous nature of cancer. It appeared that stem-loop- based model best explains the blood, brain, liver, and prostate cancer breakpoint hotspot profiles while quadruplex- based model has higher performance for the bone, breast, ovary, pancreatic, and skin cancer. For the overall cancer profile and uterus cancer the joint model shows the highest performance. For particular datasets the constructed models reach high predictive power using just one predictor, and in the majority of the cases, the model built on both predictors does not increase the model performance.
Conclusion: Despite the heterogeneity in breakpoint hotspots’ distribution across different cancer types, our results demonstrate an association between cancer breakpoint hotspots and stem-loops and quadruplexes. Approximately for half of the cancer types stem-loops are the most influential factors while for the others these are quadruplexes. This fact reflects the differences in regulatory potential of stem-loops and quadruplexes at the tissue-specific level, which yet to be discovered at the genome-wide scale. The performed analysis demonstrates that influence of stem- loops and quadruplexes on breakpoint hotspots formation is tissue-specific.
Background. Restriction-modification (R-M) systems protect bacteria and archaea from attacks by bacteriophages and archaeal viruses. An R-M system specifically recognizes short sites in foreign DNA and cleaves it, while such sites in the host DNA are protected by methylation. Prokaryotic viruses have developed a number of strategies to overcome this host defense. The simplest anti-restriction strategy is the elimination of recognition sites in the viral genome: no sites, no DNA cleavage. Even a decrease of the number of recognition sites can help a virus to overcome this type of host defense. Recognition site avoidance has been a known anti-restriction strategy of prokaryotic viruses for decades. However, recognition site avoidance has not been systematically studied with the currently available sequence data. We analyzed the complete genomes of almost 4000 prokaryotic viruses with known host species and more than 17,000 restriction endonucleases with known specificities in terms of recognition site avoidance.
Results. We observed considerable limitations of recognition site avoidance as an anti-restriction strategy. Namely, the avoidance of recognition sites is specific for dsDNA and ssDNA prokaryotic viruses. Avoidance is much more pronounced in the genomes of non-temperate bacteriophages than in the genomes of temperate ones. Avoidance is not observed for the sites of Type I and Type IIG systems and is very rarely observed for the sites of Type III systems. The vast majority of avoidance cases concern recognition sites of orthodox Type II restriction-modification systems. Even under these constraints, complete or almost complete elimination of sites is observed for approximately one-tenth of viral genomes and a significant under-representation for approximately one-fourth of them.
Conclusions. Avoidance of recognition sites of restriction-modification systems is a widespread but not universal anti-restriction strategy of prokaryotic viruses.
Riboswitches are conserved RNA structures located in non-coding regions of mRNA and able to bind small molecules (e.g. metabolites) changing conformation upon binding. This feature enables them to function as regulators of gene expression. The thiamin pyrophosphate (TPP) riboswitch is the only type of riboswitches found not only in bacteria, but also in eukaryotes - in plants, green algae, protists, and fungi. Two main mechanisms of fungal TPP riboswitch action, involving alternative splicing, have been established so far. Here, we report a large-scale bioinformatic study of riboswitch structural features, action mechanisms, and distribution along the fungal taxonomy groups. For each putatively regulated gene, we reconstruct the riboswitch structure, identify other components of the regulation machinery, and establish mechanisms of riboswitch-mediated regulation. In addition to three genes known to be regulated by TPP riboswitches, thiazole synthase THI4, hydroxymethilpyrimidine-syntase NMT1, and putative transporter NCU01977, we identify two new genes, a putative thiamin transporter THI9 and a transporter of unknown specificity. While the riboswitch sequence and structure remain highly conserved in all species and genes, the mode of riboswitch-mediated regulation varies between regulated genes. The riboswitch usage varies strongly between fungal taxa, with the largest number of riboswitch-regulated genes found in Pezizomycotina and no riboswitch-mediated regulation established in Saccaromycotina.
While most endosymbiotic bacteria are transmitted only vertically, Holospora spp., an alphaproteobacterium from the Rickettsiales order, can desert its host and invade a new one. All bacteria from the genus Holospora are intranuclear symbionts of ciliates Paramecium spp. with strict species and nuclear specificity. Comparative metabolic reconstruction based on the newly sequenced genome of Holospora curviuscula, a macronuclear symbiont of Paramecium bursaria, and known genomes of other Holospora species shows that even though all Holospora spp. can persist outside the host, they cannot synthesize most of the essential small molecules, such as amino acids, and lack some central energy metabolic pathways, including glycolysis and the citric acid cycle. As the main energy source, Holospora spp. likely rely on nucleotides pirated from the host. Holospora-specific genes absent from other Rickettsiales are possibly involved in the lifestyle switch from the infectious to the reproductive form and in cell invasion.
Changes in splicing are known to affect the function and regulation of genes. We analyzed splicing events that take place during the postnatal development of the prefrontal cortex in humans, chimpanzees, and rhesus macaques based on data obtained from 168 individuals. Our study revealed that among the 38,822 quantified alternative exons, 15% are differentially spliced among species, and more than 6% splice differently at different age. Mutations in splicing acceptor and/or donor sites might explain more than 14% of all splicing differences among species and up to 64% of high-amplitude differences. A reconstructed trans- regulatory network containing 21 RNA-binding proteins explain a further 4% of splicing variations within species. While most age-dependent splicing patterns are conserved among the three species, developmental changes in intron retention are substantially more pronounced in humans.
Polypedilum vanderplanki is a striking and unique example of an insect that can survive almost complete desiccation. Its genome and a set of dehydration-rehydration transcriptomes, together with the genome of Polypedilum nubifer (a congeneric desiccation-sensitive midge), were recently released. Here, using published and newly generated datasets reflecting detailed transcriptome changes during anhydrobiosis, as well as a developmental series, we show that the TCTAGAA DNA motif, which closely resembles the binding motif of the Drosophila melanogaster heat shock transcription activator (Hsf), is significantly enriched in the promoter regions of desiccation-induced genes in P. vanderplanki, such as genes encoding late embryogenesis abundant (LEA) proteins, thioredoxins, or trehalose metabolism-related genes, but not in P. nubifer Unlike P. nubifer, P. vanderplanki has double TCTAGAA sites upstream of the Hsf gene itself, which is probably responsible for the stronger activation of Hsf in P. vanderplanki during desiccation compared with P. nubifer To confirm the role of Hsf in desiccation-induced gene activation, we used the Pv11 cell line, derived from P. vanderplanki embryo. After preincubation with trehalose, Pv11 cells can enter anhydrobiosis and survive desiccation. We showed that Hsf knockdown suppresses trehalose-induced activation of multiple predicted Hsf targets (including P. vanderplanki-specific LEA protein genes) and reduces the desiccation survival rate of Pv11 cells fivefold. Thus, cooption of the heat shock regulatory system has been an important evolutionary mechanism for adaptation to desiccation in P. vanderplanki.
Genome rearrangements have played an important role in the evolution of Yersinia pestis from its progenitor Yersinia pseudotuberculosis. Traditional phylogenetic trees for Y. pestis based on sequence comparison have short internal branches and low bootstrap supports as only a small number of nucleotide substitutions have occurred. On the other hand, even a small number of genome rearrangements may resolve topological ambiguities in a phylogenetic tree. We reconstructed phylogenetic trees based on genome rearrangements using several popular approaches such as Maximum likelihood for Gene Order and the Bayesian model of genome rearrangements by inversions. We also reconciled phylogenetic trees for each of the three CRISPR loci to obtain an integrated scenario of the CRISPR cassette evolution. Analysis of contradictions between the obtained evolutionary trees yielded numerous parallel inversions and gain/loss events. Our data indicate that an integrated analysis of sequence-based and inversion-based trees enhances the resolution of phylogenetic reconstruction. In contrast, reconstructions of strain relationships based on solely CRISPR loci may not be reliable, as the history is obscured by large deletions, obliterating the order of spacer gains. Similarly, numerous parallel gene losses preclude reconstruction of phylogeny based on gene content.
BACKGROUND: The genus Burkholderia consists of species that occupy remarkably diverse ecological niches. Its best known members are important pathogens, B. mallei and B. pseudomallei, which cause glanders and melioidosis, respectively. Burkholderia genomes are unusual due to their multichromosomal organization, generally comprised of 2-3 chromosomes. RESULTS: We performed integrated genomic analysis of 127 Burkholderia strains. The pan-genome is open with the saturation to be reached between 86,000 and 88,000 genes. The reconstructed rearrangements indicate a strong avoidance of intra-replichore inversions that is likely caused by selection against the transfer of large groups of genes between the leading and the lagging strands. Translocated genes also tend to retain their position in the leading or the lagging strand, and this selection is stronger for large syntenies. Integrated reconstruction of chromosome rearrangements in the context of strains phylogeny reveals parallel rearrangements that may indicate inversion-based phase variation and integration of new genomic islands. In particular, we detected parallel inversions in the second chromosomes of B. pseudomallei with breakpoints formed by genes encoding membrane components of multidrug resistance complex, that may be linked to a phase variation mechanism. Two genomic islands, spreading horizontally between chromosomes, were detected in the B. cepacia group. CONCLUSIONS: This study demonstrates the power of integrated analysis of pan-genomes, chromosome rearrangements, and selection regimes. Non-random inversion patterns indicate selective pressure, inversions are particularly frequent in a recent pathogen B. mallei, and, together with periods of positive selection at other branches, may indicate adaptation to new niches. One such adaptation could be a possible phase variation mechanism in B. pseudomallei.
With the advances in the sequencing technology the International Cancer Genome Consortium (ICGC)  and The Cancer Genome Atlas (TCGA)  collected data on more than 16 000 genome-wide pairs tumor-normal tissue providing a valuable resource to study cancer mutations. In this research we focus on pre- evaluation of the relationship between cancer breakpoint hotspots and DNA regions potentially forming secondary structures such as stem-loops (cruciforms) and quadru- plexes. We performed analysis of 2 234 samples covering 10 cancer types and built machine-learning models predicting cancer breakpoint distribution over chromosome based on the density distribution of stem-loops and quadruplexes. We developed pro- cedure for machine learning models building and evaluation as the considered data are extremely imbalanced and it is needed to get reliable estimate of prediction power. We conducted a set of experiments to select the best appropriate resampling scheme, class balancing technique and parameters of machine learning algorithms. The best final models were applied to cancer breakpoints data. From the performed analysis it could be concluded that the relationship between cancer breakpoints hotspots and studied DNA secondary structures exists, however, generally, this relationship is weak for stem-loops, but higher for quadruplexes. We also found differences in model predictive power depending on cancer types. Thus, stem-loop-based model performs better for pancreatic, prostate, ovary, uterus, brain and liver cancer, and quadruplex- based model works better for blood, bone, skin and breast cancer.
Corals harbor complex and diverse microbial communities that strongly impact host fitness and resistance to diseases, but these microbes themselves can be influenced by stresses, like those caused by the presence of macroscopic symbionts. In addition to directly influencing the host, symbionts may transmit pathogenic microbial communities. We analyzed two coral gall-forming copepod systems by using 16S rRNA gene metagenomic sequencing: (1) the sea fan Gorgonia ventalina with copepods of the genus Sphaerippe from the Caribbean and (2) the scleractinian coral Stylophora pistillata with copepods of the genus Spaniomolgus from the Saudi Arabian part of the Red Sea. We show that bacterial communities in these two systems were substantially different with Actinobacteria, Alphaproteobacteria, and Betaproteobacteria more prevalent in samples from Gorgonia ventalina, and Gammaproteobacteria in Stylophora pistillata. In Stylophora pistillata, normal coral microbiomes were enriched with the common coral symbiont Endozoicomonas and some unclassified bacteria, while copepod and gall-tissue microbiomes were highly enriched with the family ME2 (Oceanospirillales) or Rhodobacteraceae. In Gorgonia ventalina, no bacterial group had significantly different prevalence in the normal coral tissues, copepods, and injured tissues. The total microbiome composition of polyps injured by copepods was different. Contrary to our expectations, the microbial community composition of the injured gall tissues was not directly affected by the microbiome of the gall-forming symbiont copepods.
Sequencing of complete nuclear genomes of Neanderthal and Denisovan stimulated studies about their relationship with modern humans demonstrating, in particular, that DNA alleles from both Neanderthal and Denisovan genomes are present in genomes of modern humans. The Papuan genome is a unique object because it contains both Neanderthal and Denisovan alleles. Here, we have shown that the Papuan genomes contain different gene functional groups inherited from each of the ancient people. The Papuan genomes demonstrate a relative prevalence of Neanderthal alleles in genes responsible for the regulation of transcription and neurogenesis. The enrichment of specific functional groups with Denisovan alleles is less pronounced; these groups are responsible for bone and tissue remodeling. This analysis shows that introgression of alleles from Neanderthals and Denisovans to Papuans occurred independently and retention of these alleles may carry specific adaptive advantages.
The pangenome is the collection of all groups of orthologous genes (OGGs) from a set of genomes. We apply the pangenome analysis to propose a definition of prokaryotic species based on identification of lineage-specific gene sets. While being similar to the classical biological definition based on allele flow, it does not rely on DNA similarity levels and does not require analysis of homologous recombination. Hence this definition is relatively objective and independent of arbitrary thresholds. A systematic analysis of 110 accepted species with the largest numbers of sequenced strains yields results largely consistent with the existing nomenclature. However, it has revealed that abundant marine cyanobacteria Prochlorococcus marinus should be divided into two species. As a control we have confirmed the paraphyletic origin of Yersinia pseudotuberculosis (with embedded, monophyletic Y. pestis) and Burkholderia pseudomallei (with B. mallei). We also demonstrate that by our definition and in accordance with recent studies Escherichia coli and Shigella spp. are one species.
Background. Many algorithms and programs are available for phylogenetic reconstruction of families of proteins. Methods used widely at present use either a number of distance-based principles or character-based principles of maximum parsimony or maximum likelihood.
Results. We developed a novel program, named PQ, for reconstructing protein and nucleic acid phylogenies following a new character-based principle. Being tested on natural sequences PQ improves upon the results of maximum parsimony and maximum likelihood. Working with alignments of 10 and 15 sequences, it also outperforms the FastME program, which is based on one of the distance-based principles. Among all tested programs PQ is proved to be the least susceptible to long branch attraction. FastME outperforms PQ when processing alignments of 45 sequences, however. We confirm a recent result that on natural sequences FastME outperforms maximum parsimony and maximum likelihood. At the same time, both PQ and FastME are inferior to maximum parsimony and maximum likelihood on simulated sequences. PQ is open source and available to the public via an online interface.
Conclusions. The software we developed offers an open-source alternative for phylogenetic reconstruction for relatively small sets of proteins and nucleic acids, with up to a few tens of sequences.
We found earlier that L1-Alu transposons in human genome contain a conservative stem-loop structure at their 3’UTR . We built a machine- learning model that could distinguish L1 3’-UTR stem-loop structures from stem-loops from different genomic locations. Later we found that all LINE transposons contain stem-loops at their 3’-end. Since 3’-end stem-loop structure was experimentally shown to play an important role in recognition of transpos- on RNA by the LINE encoded reverse transcriptase in several species [2-4], we hypothesize that this structure could be preserved for that purpose in other spe- cies. Here we built machine learning model using random forest algorithm to study structural properties of 3’-end transposon stem-loops. The constructed model is based on physical, chemical and structural RNA characteristics such as entalphy, enthropy, Gibbs free energy, hydrophilicity, and helical structural pa- rameters of dinucleotides - Shift, Roll, Slide, Rise, Tilt, Bend . Each stem- loop structure was split into 30 positions and each position was characterized by 23 characteristics so that the final property vector contained 602 position- specific characteristics for each stem-loop. 2200 sequences of all available LINE transposons from different species across the tree of life were extracted from RepBase database . We constructed machine-learning model using ran- dom forest that was able to distinguish 3’-end LINE stem-loops from random stem-loops with 78% of accuracy. Analysis of predictor importance revealed that enthalpy and entropy in loop positions and hydrophilicity and stacking en- ergy in stem positions were the major influential factors for model prediction power. The obtained results support the idea that 3’-end transposon stem-loops share similar structural properties, which are probably required for transposi- tion.
Non-B DNA structures have a great potential to form and influence various genomic processes including transcription. One of the mechanisms of transcription regulation is nucleosome positioning. Even though only B-DNA can be wrapped around a nucleosome, non-B DNA structures can compete with a nucleosome for a genomic location. Here we used permanganate/S1 nuclease footprinting data on non-B DNA structures, such as Z-DNA, H-DNA, G-quadruplexes and stress-induced duplex destabilization (SIDD) sites, together with MNase-seq data on nucleosome positioning in the mouse genome. We found three types of patterns of nucleosome positioning around non-B DNA structures: a structure is surrounded by nucleosomes from both sides, from one side, or nucleosome free region. Machine learning models based on random forest and XGBoost algorithms were constructed to recognize DNA regions of 1kB length containing a particular pattern of nucleosome positioning for four types of DNA structures (Z-DNA, H-DNA, G-quadruplexes and SIDD sites) based on statistics of di- and tri-nucleotides. The best performance (94% of accuracy) was reached for Gquadruplexes while for other types of structures the accuracy was under 70%. We conclude that 1kB regions containing Gquadruplexes have distinct compositional properties, and this fact points to preferential locations of such pattern in the genome and requires further investigation. Gene ontology analysis revealed that the genes intersecting with the discovered patterns are enriched in channel and transmembrane activity, transcription factor and receptor binding. The direction for further research is to study the distribution of the discovered patterns in different tissues to identify well-positioned and dynamic nucleosomes and reveal genes, regulated via DNA structures and nucleosome positioning.
Non-B DNA structures have a great potential to form and influence various genomic processes including transcription. One of the mechanisms of transcription regulation is nucleo- some positioning. Even though only B-DNA can be wrapped around a nucleosome, non-B DNA structures can compete with a nucleosome for a genomic location. Here we used perman- ganate/S1 nuclease footprinting data on non-B DNA structures, such as Z-DNA, H-DNA, G- quadruplexes and stress-induced duplex destabilization (SIDD) sites, together with MNase-seq data on nucleosome positioning in the mouse genome. We found three types of patterns of nucleosome positioning around non-B DNA structures: a structure is surrounded by nucleo- somes from both sides, from one side, or nucleosome free region. Machine learning models based on random forest and XGBoost algorithms were constructed to recognize DNA regions of 1kB length containing a particular pattern of nucleosome positioning for four types of DNA structures (Z-DNA, H-DNA, G-quadruplexes and SIDD sites) based on statistics of di- and tri- nucleotides. The best performance (94% of accuracy) was reached for G-quadruplexes while for other types of structures the accuracy was under 70%. We conclude that 1kB regions con- taining G-quadruplexes have distinct compositional properties, and this fact points to preferen- tial locations of such pattern in the genome and requires further investigation. For other DNA structures a region composition is not a sufficient predictive factor and one should take into account other physical and structural DNA properties to improve nucleosome-DNA-structure pattern recognition.
Comparative genomics analysis of conserved gene cassettes demonstrated resemblance between a recently described cassette of genes involved in sulphoquinovose degradation in Escherichia coli K-12 MG1655 and a Bacilli cassette linked with lactose degradation. Six genes from both cassettes had similar functions related to carbohydrate metabolism, namely, hydrolase, aldolase, kinase, isomerase, transporter, and transcription factor. The Escherichia coli sulphoglycolysis cassette was thus predicted to be associated with lactose degradation. This prediction was confirmed experimentally: expression of genes coding for aldolase (yihT), isomerase (yihS), and kinase (yihV) was dramatically increased during growth on lactose. These genes were previously shown to be activated during growth on sulphoquinovose, so our observation may indicate multi-functional capabilities of the respective proteins. Transcription starts for yihT, yihV and yihW were mapped in silico, in vitro and in vivo. Out of three promoters for yihT, one was active only during growth on lactose. We further showed that switches in yihT transcription are controlled by YihW, a DeoR-family transcription factor in the Escherichia coli cassette. YihW acted as a carbon source-dependent dual regulator involved in sustaining the baseline growth in the absence of lac-operon, with function either complementary, or opposite to a global regulator of carbohydrate metabolism, cAMP-CRP.
Next-generation sequencing technologies made it possible to map numerous functional genomic elements. Thus, it became possible to define positions of epigenetic factors including methylation, histone modifications, sites of open chromatin, regulatory RNA, and also binding sites for transcription factors and other important proteins. Data, generated as a result of NGS-experiments, are stored at the project web-sites and is freely available usually in bed-format. The problem of finding associations between different functional genomic annotations, both experimental and theoretical, is very important. The existing programs for pattern search have significant limitations. most of them are developed to work in the Unix-like systems, they are lacking graphical interface, and programs are complex in usage. In the present work we present a program that can be run in a browser in any operation system, it has a graphical interface, and it accepts as an input two files of genome annotations in .bed format, visualize distribution of functional elements as densities at the level of chromosome and performs a search for patterns of association between different functional genomic annotations. The detected patterns are visualized and information about their position is given in a list. The presented program is designed to solve a broad class of bioinformatics problems of finding patterns of association between different functional genome annotations.