BACKGROUND: Chlamydia are ancient intracellular pathogens with reduced, though strikingly conserved genome. Despite their parasitic lifestyle and isolated intracellular environment, these bacteria managed to avoid accumulation of deleterious mutations leading to subsequent genome degradation characteristic for many parasitic bacteria. RESULTS: We report pan-genomic analysis of sixteen species from genus Chlamydia including identification and functional annotation of orthologous genes, and characterization of gene gains, losses, and rearrangements. We demonstrate the overall genome stability of these bacteria as indicated by a large fraction of common genes with conserved genomic locations. On the other hand, extreme evolvability is confined to several paralogous gene families such as polymorphic membrane proteins and phospholipase D, and likely is caused by the pressure from the host immune system. CONCLUSIONS: This combination of a large, conserved core genome and a small, evolvable periphery likely reflect the balance between the selective pressure towards genome reduction and the need to adapt to escape from the host immunity.
Aromatic compounds are a common carbon and energy source for many microorganisms, some of which can even degrade toxic chloroaromatic xenobiotics. This comparative study of aromatic metabolism in 32 Betaproteobacteria species describes the links between several transcription factors (TFs) that control benzoate (BenR, BenM, BoxR, BzdR), catechol (CatR, CatM, BenM), chlorocatechol (ClcR), methylcatechol (MmlR), 2,4-dichlorophenoxyacetate (TfdR, TfdS), phenol (AphS, AphR, AphT), biphenyl (BphS), and toluene (TbuT) metabolism. We characterize the complexity and variability in the organization of aromatic metabolism operons and the structure of regulatory networks that may differ even between closely related species. Generally, the upper parts of pathways, rare pathway variants, and degradative pathways of exotic and complex, in particular, xenobiotic compounds are often controlled by a single TF, while the regulation of more common and/or central parts of the aromatic metabolism may vary widely and often involves several TFs with shared and/or dual, or cascade regulation. The most frequent and at the same time variable connections exist between AphS, AphR, AphT, and BenR. We have identified a novel LysR-family TF that regulates the metabolism of catechol (or some catechol derivative) and either substitutes CatR(M)/BenM, or shares functions with it. We have also predicted several new members of aromatic metabolism regulons, in particular, some COGs regulated by several different TFs.
DNAsecondary structures are important functional elements thatmay influence cellular processes. One of theirpossible functions is regulation of nucleosome positioning. Here MNAse-seq and ssDNA-seq data were used to define patterns of positional relationship of DNA structures such as Z-DNA, H-DNA and G-quadruplexes with nucleosomes. Three types of patterns werefound: a structure is surrounded by nucleosomes from both sides, from one side, or nucleosome free region. Machine-learning models based on Random forest algorithm and XGBoost weretrained to recognize DNA region of 500 bp length containing a pattern of nucleosome positioning for three types of DNA struc-tures (Z-DNA, H-DNA and G-quadruplexes) based on DNAsequence composi-tional properties. The best performance (more than 86% for ROC-AUC, accu-racy, recall and presicion scores) wasreached for G-quadruplexes. 500 bp re-gions containing G-quadruplexes have distinct compositional properties and point to the preferential locations of the defined patterns, which regulatory functions require further investigation. For other DNA structures a region com-position is less powerful predictive factor and one should take into account oth-er physical and structural DNA properties to improve nucleosome-DNA-structure pattern recognition.
The genus Streptococcus comprises pathogens that strongly influence the health of humans and animals. Genome sequencing of multiple Streptococcus strains demonstrated high variability in gene content and order even in closely related strains of the same species and created a newly emerged object for genomic analysis, the pan-genome. Here we analysed the genome evolution of 25 strains of Streptococcus suis, 50 strains of Streptococcus pyogenes and 28 strains of Streptococcus pneumoniae.
Fractions of the pan-genome, unique, periphery, and universal genes differ in size, functional composition, the level of nucleotide substitutions, and predisposition to horizontal gene transfer and genomic rearrangements. The density of substitutions in intergenic regions appears to be correlated with selection acting on adjacent genes, implying that more conserved genes tend to have more conserved regulatory regions. The total pan-genome of the genus is open, but only due to strain-specific genes, whereas other pan-genome fractions reach saturation. We have identified the set of genes with phylogenies inconsistent with species and non-conserved location in the chromosome; these genes are rare in at least one species and have likely experienced recent horizontal transfer between species. The strain-specific fraction is enriched with mobile elements and hypothetical proteins, but also contains a number of candidate virulence-related genes, so it may have a strong impact on adaptability and pathogenicity. Mapping the rearrangements to the phylogenetic tree revealed large parallel inversions in all species. A parallel inversion of length 15 kB with breakpoints formed by genes encoding surface antigen proteins PhtD and PhtB in S. pneumoniae leads to replacement of gene fragments that likely indicates the action of an antigen variation mechanism.
Members of genus Streptococcus have a highly dynamic, open pan-genome, that potentially confers them with the ability to adapt to changing environmental conditions, i.e. antibiotic resistance or transmission between different hosts. Hence, integrated analysis of all aspects of genome evolution is important for the identification of potential pathogens and design of drugs and vaccines.
How the nuclear lamina (NL) impacts on global chromatin architecture is poorly understood. Here, we show that NL disruption in Drosophila S2 cells leads to chromatin compaction and repositioning from the nuclear envelope. This increases the chromatin density in a fraction of topologically-associating domains (TADs) enriched in active chromatin and enhances interactions between active and inactive chromatin. Importantly, upon NL disruption the NL-associated TADs become more acetylated at histone H3 and less compact, while background transcription is derepressed. Two-colour FISH confirms that a TAD becomes less compact following its release from the NL. Finally, polymer simulations show that chromatin binding to the NL can per se compact attached TADs. Collectively, our findings demonstrate a dual function of the NL in shaping the 3D genome. Attachment of TADs to the NL makes them more condensed but decreases the overall chromatin density in the nucleus by stretching interphase chromosomes.
The role of 3’-end stem-loops in transposition was experimentally demonstrated for transposons of various species, where LINE-SINE transposons share the same 3’-end sequences, containing a stem-loop. We have discovered that 62-68% of processed pseduogenes and mRNAs also have 3’-end stem-loops. We investigated the properties of 3’-end stem-loops of human L1s, Alus, processed pseudogenes and mRNAs that do not share the same sequences, but all have 3’-end stem-loops. We have built sequence-based and structure-based machine-learning models that are able to recognize 3’-end L1, Alu, processed pseudogene and mRNA stem-loops with high performance. The sequence-based models use only sequence information and capture compositional bias in 3’-ends. The structure-based models consider physical, chemical and geometrical properties of dinucleotides composing a stem and position-specific nucleotide content of a loop and a bulge. The most important parameters include shift, tilt, rise, and hydrophilicity. The obtained results clearly point to the existence of structural constrains for 3’-end stem-loops of L1 and Alu, which are probably important for transposition, and reveal the potential of mRNAs to be recognized by the L1 machinery. The constructed models are freely available at github (https://github.com/AlexShein/transposons/) and can be used for de novo discovery of transposon-related stem-loops.
We trained Random Forest model to recognize patterns of nucleosome and non-B DNA structures, considered as potential nucleosome barriers in the mouse genome. We showed that among four types of structures – Z-DNA, H-DNA, G-Quadruplexes and SIDD regions – recognition of G-Quadruplexes and H-DNA showed the best performance.
We built and evaluated two types of models: sequence-based and structure-based for recognition of 3’-end stem- loops of human L1s and Alus and found most important parameters contributing to recognition: Shift, Tilt and Rise, and aslo hydrophilicity.
Background: Chromosomal rearrangements are the typical phenomena in cancer genomes causing gene disruptions and fusions, corruption of regulatory elements, damage to chromosome integrity. Among the factors contributing to genomic instability are non-B DNA structures with stem-loops and quadruplexes being the most prevalent. We aimed at investigating the impact of specifically these two classes of non-B DNA structures on cancer breakpoint hotspots using machine learning approach.
Methods: We developed procedure for machine learning model building and evaluation as the considered data are extremely imbalanced and it was required to get a reliable estimate of the prediction power. We built logistic regression models predicting cancer breakpoint hotspots based on the densities of stem-loops and quadruplexes, jointly and separately. We also tested Random Forest models varying different resampling schemes (leave-one-out cross validation, train-test split, 3-fold cross-validation) and class balancing techniques (oversampling, stratification, synthetic minority oversampling).
Results: We performed analysis of 487,425 breakpoints from 2234 samples covering 10 cancer types available from the International Cancer Genome Consortium. We showed that distribution of breakpoint hotspots in different types of cancer are not correlated, confirming the heterogeneous nature of cancer. It appeared that stem-loop- based model best explains the blood, brain, liver, and prostate cancer breakpoint hotspot profiles while quadruplex- based model has higher performance for the bone, breast, ovary, pancreatic, and skin cancer. For the overall cancer profile and uterus cancer the joint model shows the highest performance. For particular datasets the constructed models reach high predictive power using just one predictor, and in the majority of the cases, the model built on both predictors does not increase the model performance.
Conclusion: Despite the heterogeneity in breakpoint hotspots’ distribution across different cancer types, our results demonstrate an association between cancer breakpoint hotspots and stem-loops and quadruplexes. Approximately for half of the cancer types stem-loops are the most influential factors while for the others these are quadruplexes. This fact reflects the differences in regulatory potential of stem-loops and quadruplexes at the tissue-specific level, which yet to be discovered at the genome-wide scale. The performed analysis demonstrates that influence of stem- loops and quadruplexes on breakpoint hotspots formation is tissue-specific.
Background. Restriction-modification (R-M) systems protect bacteria and archaea from attacks by bacteriophages and archaeal viruses. An R-M system specifically recognizes short sites in foreign DNA and cleaves it, while such sites in the host DNA are protected by methylation. Prokaryotic viruses have developed a number of strategies to overcome this host defense. The simplest anti-restriction strategy is the elimination of recognition sites in the viral genome: no sites, no DNA cleavage. Even a decrease of the number of recognition sites can help a virus to overcome this type of host defense. Recognition site avoidance has been a known anti-restriction strategy of prokaryotic viruses for decades. However, recognition site avoidance has not been systematically studied with the currently available sequence data. We analyzed the complete genomes of almost 4000 prokaryotic viruses with known host species and more than 17,000 restriction endonucleases with known specificities in terms of recognition site avoidance.
Results. We observed considerable limitations of recognition site avoidance as an anti-restriction strategy. Namely, the avoidance of recognition sites is specific for dsDNA and ssDNA prokaryotic viruses. Avoidance is much more pronounced in the genomes of non-temperate bacteriophages than in the genomes of temperate ones. Avoidance is not observed for the sites of Type I and Type IIG systems and is very rarely observed for the sites of Type III systems. The vast majority of avoidance cases concern recognition sites of orthodox Type II restriction-modification systems. Even under these constraints, complete or almost complete elimination of sites is observed for approximately one-tenth of viral genomes and a significant under-representation for approximately one-fourth of them.
Conclusions. Avoidance of recognition sites of restriction-modification systems is a widespread but not universal anti-restriction strategy of prokaryotic viruses.
Riboswitches are conserved RNA structures located in non-coding regions of mRNA and able to bind small molecules (e.g. metabolites) changing conformation upon binding. This feature enables them to function as regulators of gene expression. The thiamin pyrophosphate (TPP) riboswitch is the only type of riboswitches found not only in bacteria, but also in eukaryotes - in plants, green algae, protists, and fungi. Two main mechanisms of fungal TPP riboswitch action, involving alternative splicing, have been established so far. Here, we report a large-scale bioinformatic study of riboswitch structural features, action mechanisms, and distribution along the fungal taxonomy groups. For each putatively regulated gene, we reconstruct the riboswitch structure, identify other components of the regulation machinery, and establish mechanisms of riboswitch-mediated regulation. In addition to three genes known to be regulated by TPP riboswitches, thiazole synthase THI4, hydroxymethilpyrimidine-syntase NMT1, and putative transporter NCU01977, we identify two new genes, a putative thiamin transporter THI9 and a transporter of unknown specificity. While the riboswitch sequence and structure remain highly conserved in all species and genes, the mode of riboswitch-mediated regulation varies between regulated genes. The riboswitch usage varies strongly between fungal taxa, with the largest number of riboswitch-regulated genes found in Pezizomycotina and no riboswitch-mediated regulation established in Saccaromycotina.
While most endosymbiotic bacteria are transmitted only vertically, Holospora spp., an alphaproteobacterium from the Rickettsiales order, can desert its host and invade a new one. All bacteria from the genus Holospora are intranuclear symbionts of ciliates Paramecium spp. with strict species and nuclear specificity. Comparative metabolic reconstruction based on the newly sequenced genome of Holospora curviuscula, a macronuclear symbiont of Paramecium bursaria, and known genomes of other Holospora species shows that even though all Holospora spp. can persist outside the host, they cannot synthesize most of the essential small molecules, such as amino acids, and lack some central energy metabolic pathways, including glycolysis and the citric acid cycle. As the main energy source, Holospora spp. likely rely on nucleotides pirated from the host. Holospora-specific genes absent from other Rickettsiales are possibly involved in the lifestyle switch from the infectious to the reproductive form and in cell invasion.
Changes in splicing are known to affect the function and regulation of genes. We analyzed splicing events that take place during the postnatal development of the prefrontal cortex in humans, chimpanzees, and rhesus macaques based on data obtained from 168 individuals. Our study revealed that among the 38,822 quantified alternative exons, 15% are differentially spliced among species, and more than 6% splice differently at different age. Mutations in splicing acceptor and/or donor sites might explain more than 14% of all splicing differences among species and up to 64% of high-amplitude differences. A reconstructed trans- regulatory network containing 21 RNA-binding proteins explain a further 4% of splicing variations within species. While most age-dependent splicing patterns are conserved among the three species, developmental changes in intron retention are substantially more pronounced in humans.
Polypedilum vanderplanki is a striking and unique example of an insect that can survive almost complete desiccation. Its genome and a set of dehydration-rehydration transcriptomes, together with the genome of Polypedilum nubifer (a congeneric desiccation-sensitive midge), were recently released. Here, using published and newly generated datasets reflecting detailed transcriptome changes during anhydrobiosis, as well as a developmental series, we show that the TCTAGAA DNA motif, which closely resembles the binding motif of the Drosophila melanogaster heat shock transcription activator (Hsf), is significantly enriched in the promoter regions of desiccation-induced genes in P. vanderplanki, such as genes encoding late embryogenesis abundant (LEA) proteins, thioredoxins, or trehalose metabolism-related genes, but not in P. nubifer Unlike P. nubifer, P. vanderplanki has double TCTAGAA sites upstream of the Hsf gene itself, which is probably responsible for the stronger activation of Hsf in P. vanderplanki during desiccation compared with P. nubifer To confirm the role of Hsf in desiccation-induced gene activation, we used the Pv11 cell line, derived from P. vanderplanki embryo. After preincubation with trehalose, Pv11 cells can enter anhydrobiosis and survive desiccation. We showed that Hsf knockdown suppresses trehalose-induced activation of multiple predicted Hsf targets (including P. vanderplanki-specific LEA protein genes) and reduces the desiccation survival rate of Pv11 cells fivefold. Thus, cooption of the heat shock regulatory system has been an important evolutionary mechanism for adaptation to desiccation in P. vanderplanki.
Genome rearrangements have played an important role in the evolution of Yersinia pestis from its progenitor Yersinia pseudotuberculosis. Traditional phylogenetic trees for Y. pestis based on sequence comparison have short internal branches and low bootstrap supports as only a small number of nucleotide substitutions have occurred. On the other hand, even a small number of genome rearrangements may resolve topological ambiguities in a phylogenetic tree. We reconstructed phylogenetic trees based on genome rearrangements using several popular approaches such as Maximum likelihood for Gene Order and the Bayesian model of genome rearrangements by inversions. We also reconciled phylogenetic trees for each of the three CRISPR loci to obtain an integrated scenario of the CRISPR cassette evolution. Analysis of contradictions between the obtained evolutionary trees yielded numerous parallel inversions and gain/loss events. Our data indicate that an integrated analysis of sequence-based and inversion-based trees enhances the resolution of phylogenetic reconstruction. In contrast, reconstructions of strain relationships based on solely CRISPR loci may not be reliable, as the history is obscured by large deletions, obliterating the order of spacer gains. Similarly, numerous parallel gene losses preclude reconstruction of phylogeny based on gene content.
BACKGROUND: The genus Burkholderia consists of species that occupy remarkably diverse ecological niches. Its best known members are important pathogens, B. mallei and B. pseudomallei, which cause glanders and melioidosis, respectively. Burkholderia genomes are unusual due to their multichromosomal organization, generally comprised of 2-3 chromosomes. RESULTS: We performed integrated genomic analysis of 127 Burkholderia strains. The pan-genome is open with the saturation to be reached between 86,000 and 88,000 genes. The reconstructed rearrangements indicate a strong avoidance of intra-replichore inversions that is likely caused by selection against the transfer of large groups of genes between the leading and the lagging strands. Translocated genes also tend to retain their position in the leading or the lagging strand, and this selection is stronger for large syntenies. Integrated reconstruction of chromosome rearrangements in the context of strains phylogeny reveals parallel rearrangements that may indicate inversion-based phase variation and integration of new genomic islands. In particular, we detected parallel inversions in the second chromosomes of B. pseudomallei with breakpoints formed by genes encoding membrane components of multidrug resistance complex, that may be linked to a phase variation mechanism. Two genomic islands, spreading horizontally between chromosomes, were detected in the B. cepacia group. CONCLUSIONS: This study demonstrates the power of integrated analysis of pan-genomes, chromosome rearrangements, and selection regimes. Non-random inversion patterns indicate selective pressure, inversions are particularly frequent in a recent pathogen B. mallei, and, together with periods of positive selection at other branches, may indicate adaptation to new niches. One such adaptation could be a possible phase variation mechanism in B. pseudomallei.
With the advances in the sequencing technology the International Cancer Genome Consortium (ICGC)  and The Cancer Genome Atlas (TCGA)  collected data on more than 16 000 genome-wide pairs tumor-normal tissue providing a valuable resource to study cancer mutations. In this research we focus on pre- evaluation of the relationship between cancer breakpoint hotspots and DNA regions potentially forming secondary structures such as stem-loops (cruciforms) and quadru- plexes. We performed analysis of 2 234 samples covering 10 cancer types and built machine-learning models predicting cancer breakpoint distribution over chromosome based on the density distribution of stem-loops and quadruplexes. We developed pro- cedure for machine learning models building and evaluation as the considered data are extremely imbalanced and it is needed to get reliable estimate of prediction power. We conducted a set of experiments to select the best appropriate resampling scheme, class balancing technique and parameters of machine learning algorithms. The best final models were applied to cancer breakpoints data. From the performed analysis it could be concluded that the relationship between cancer breakpoints hotspots and studied DNA secondary structures exists, however, generally, this relationship is weak for stem-loops, but higher for quadruplexes. We also found differences in model predictive power depending on cancer types. Thus, stem-loop-based model performs better for pancreatic, prostate, ovary, uterus, brain and liver cancer, and quadruplex- based model works better for blood, bone, skin and breast cancer.
Corals harbor complex and diverse microbial communities that strongly impact host fitness and resistance to diseases, but these microbes themselves can be influenced by stresses, like those caused by the presence of macroscopic symbionts. In addition to directly influencing the host, symbionts may transmit pathogenic microbial communities. We analyzed two coral gall-forming copepod systems by using 16S rRNA gene metagenomic sequencing: (1) the sea fan Gorgonia ventalina with copepods of the genus Sphaerippe from the Caribbean and (2) the scleractinian coral Stylophora pistillata with copepods of the genus Spaniomolgus from the Saudi Arabian part of the Red Sea. We show that bacterial communities in these two systems were substantially different with Actinobacteria, Alphaproteobacteria, and Betaproteobacteria more prevalent in samples from Gorgonia ventalina, and Gammaproteobacteria in Stylophora pistillata. In Stylophora pistillata, normal coral microbiomes were enriched with the common coral symbiont Endozoicomonas and some unclassified bacteria, while copepod and gall-tissue microbiomes were highly enriched with the family ME2 (Oceanospirillales) or Rhodobacteraceae. In Gorgonia ventalina, no bacterial group had significantly different prevalence in the normal coral tissues, copepods, and injured tissues. The total microbiome composition of polyps injured by copepods was different. Contrary to our expectations, the microbial community composition of the injured gall tissues was not directly affected by the microbiome of the gall-forming symbiont copepods.
Sequencing of complete nuclear genomes of Neanderthal and Denisovan stimulated studies about their relationship with modern humans demonstrating, in particular, that DNA alleles from both Neanderthal and Denisovan genomes are present in genomes of modern humans. The Papuan genome is a unique object because it contains both Neanderthal and Denisovan alleles. Here, we have shown that the Papuan genomes contain different gene functional groups inherited from each of the ancient people. The Papuan genomes demonstrate a relative prevalence of Neanderthal alleles in genes responsible for the regulation of transcription and neurogenesis. The enrichment of specific functional groups with Denisovan alleles is less pronounced; these groups are responsible for bone and tissue remodeling. This analysis shows that introgression of alleles from Neanderthals and Denisovans to Papuans occurred independently and retention of these alleles may carry specific adaptive advantages.
The pangenome is the collection of all groups of orthologous genes (OGGs) from a set of genomes. We apply the pangenome analysis to propose a definition of prokaryotic species based on identification of lineage-specific gene sets. While being similar to the classical biological definition based on allele flow, it does not rely on DNA similarity levels and does not require analysis of homologous recombination. Hence this definition is relatively objective and independent of arbitrary thresholds. A systematic analysis of 110 accepted species with the largest numbers of sequenced strains yields results largely consistent with the existing nomenclature. However, it has revealed that abundant marine cyanobacteria Prochlorococcus marinus should be divided into two species. As a control we have confirmed the paraphyletic origin of Yersinia pseudotuberculosis (with embedded, monophyletic Y. pestis) and Burkholderia pseudomallei (with B. mallei). We also demonstrate that by our definition and in accordance with recent studies Escherichia coli and Shigella spp. are one species.
Background. Many algorithms and programs are available for phylogenetic reconstruction of families of proteins. Methods used widely at present use either a number of distance-based principles or character-based principles of maximum parsimony or maximum likelihood.
Results. We developed a novel program, named PQ, for reconstructing protein and nucleic acid phylogenies following a new character-based principle. Being tested on natural sequences PQ improves upon the results of maximum parsimony and maximum likelihood. Working with alignments of 10 and 15 sequences, it also outperforms the FastME program, which is based on one of the distance-based principles. Among all tested programs PQ is proved to be the least susceptible to long branch attraction. FastME outperforms PQ when processing alignments of 45 sequences, however. We confirm a recent result that on natural sequences FastME outperforms maximum parsimony and maximum likelihood. At the same time, both PQ and FastME are inferior to maximum parsimony and maximum likelihood on simulated sequences. PQ is open source and available to the public via an online interface.
Conclusions. The software we developed offers an open-source alternative for phylogenetic reconstruction for relatively small sets of proteins and nucleic acids, with up to a few tens of sequences.