AIC Lab Seminar "Progress report on Lab projects"

Мероприятие завершено

Title: Progress report on Lab projects
Abstract: The seminar will present the results of five sub projects performed in the AIC Lab.
Link: Zoom

Асад Мухаммад,
Научно-учебная лаборатория искусственного интеллекта для вычислительной биологии: Стажер-исследователь

This course work presents a Retrieval-Augmented Generation (RAG) based chatbot designed to provide accurate and contextually relevant responses by leveraging a corpus of 30,000 PDF documents. We describe the methodology, including PDF parsing, text chunking, embedding creation, vector storage, and response generation using the Ollama platform. The system’s performance is evaluated, and its limitations, such as retrieval accuracy and PDF parsing challenges, are discussed.

Тевяшов Михаил Михайлович,
Научно-учебная лаборатория искусственного интеллекта для вычислительной биологии: Стажер-исследователь

This paper analyzes the efficiency and accuracy of predictions of De novo peptide sequencing and Database search methods based on spectrometric data. The accuracy of predictions was compared using the coincidence analysis of theoretically calculated and experimentally obtained peaks. The Levenshtein distance was also used to quantify the differences between amino acid sequences. In addition, the accuracy of predictions of both methods was compared for each individual Levenshtein distance value. During the work, software tools were implemented for comparing spectrometric data, calculating theoretical peaks, analyzing coincidences, visualizing distributions, and assessing the dependence of prediction accuracy on the Levenshtein distance. The results confirm the hypothesis set by the authors before the work began. Namely, the assumption that De novo sequencing on average shows lower results than database search and, as a result, requires combined use with other methods.

Гладких Роман Евгеньевич,
Научно-учебная лаборатория искусственного интеллекта для вычислительной биологии: Стажер-исследователь

Tandem mass spectrometry is the only method that can quickly analyze the protein content of complex biological samples. It is the main technology that is driving the development of proteomics. One of the main challenges in this field is to determine the sequence of amino acids that generate each observed spectrum without relying on a pre-existing database of peptide sequences. With the advent of deep learning, models such as Casanovo, a transformer-based architecture specifically designed for de novo peptide sequencing, have achieved notable improvements in prediction performance. Despite these advancements, a persistent issue remains: the reliability of these predictions is often undermined by the absence of well-calibrated confidence estimations. To address this limitation, this paper describes the process of creating and researching new statistical calibration strategy for the Casanovo prediction outputs. Drawing inspiration from methods such as Tailor, which successfully apply nonparametric p-value calibration in the context of database-driven peptide identification, we adapt and extend this principle to the de novo sequencing scenario. Our methodology introduces a confidence scoring system based on p-values extracted from Casanovo model Transformer scores of choosen sequence of amino acids.

Джоши Картик,
Научно-учебная лаборатория искусственного интеллекта для вычислительной биологии: Стажер-исследователь

De novo peptide sequencing from mass spectrometry (MS) data is pivotal for proteomics, yet challenges persist in achieving robustness against spectral variability and experimental biases. This thesis evaluates five state-of-the-art deep learning models—PepNet, Casanovo, π-PrimeNovo, π-HelixNovo, and Contra-novo—to assess their accuracy and resilience under modulated data conditions, including noise injection and ion shifts. Using a curated dataset (ProteomeXchange PXD004452), we benchmarked models on high-confidence spectra, perturbed via Gaussian noise and systematic m/z shifts mimicking real-world artifacts. Results revealed ContraNovo’s superior performance on unperturbed data (34.23% peptide-level precision) and PepNet’s noise resilience (30.18% vs. 31.36% for Con- traNovo). However, all models failed completely under ion shift perturbations, highlighting critical vulnerabilities to structural spectral changes. Comparative analysis underscored architectural strengths: ContraNovo’s contrastive learning enhanced discriminative alignment, PepNet’s convolutional design improved noise tolerance, and π-HelixNovo’s complementary spectra mitigated missing peaks. The study emphasizes the limitations of current models, particularly their overreliance on homogeneous training data and inability to generalize across structural perturbations. These findings advocate for enriched training datasets encom-passing diverse fragmentation patterns and species-specific variability, alongside architectural innovations integrating adversarial training, multi-modal data, and enhanced attention mechanisms. This work advances the development of robust computational frameworks for de novo equencing, essential for applications in clinical proteomics and precision medicine.

Дата

4 июня 14:00

В статье упомянуты

Научно-учебная лаборатория искусственного интеллекта для вычислительной биологии