Семинар НУЛ Искусственного интеллекта для вычислительной биологии
В Научно-учебной лаборатории искусственного интеллекта для вычислительной биологии состоится семинар, на котором стажёры-исследователи лаборатории - Афина Подерни и Сулимов Даниил представят свои исследования.
Семинар состоится 6 июня в 14:00.
Название: "Deep Neural Networks for Peptide Identification in Data Independent Acquisition Mass Spectrometry"
Аннотация: Mass spectrometry is an analysis technique used for the quantification and structural determination of molecules, which is widely used in fields such as medicine, cosmetology, marine sciences and others. It was developed over 100 years ago, but continues to evolve with the development of computer technologies. Mass spectrometers produce a huge amount of information that needs to be analyzed and stored, and researchers proposed various approaches for data acquisition and preprocessing. A noticeable attainment in this domain is data-independent acquisition mass spectrometry, which minimizes data loss and looks promising in peptide identification quality increasing. However, it increases the size and complexity of the data, but as technological progress moves rapidly forward, this problem is becoming less substantial and the interest of researchers in data-independent acquisition mass spectrometry is escalating. The use of machine learning algorithms and deep neural networks in computational biology offers promising development of MS data analysis. Mass spectrometry data is prone to sudden appearances and disappearances of spectral ions, but at the same time, due to its chemical nature, it has many connected components that neural networks could generalize and find hidden patterns in the data. The goal of this work is to develop a convolutional neural network for preprocessing DIA spectra, which will help to improve the quality of peptide identification. For this purpose, basic methods and tools for MS analysis were studied and implemented in the pipeline along with the model training. This thesis presents the results of experiments on training the model with different parameters on DIA dataset consisting of mass spectrometry experiments with Saccharomyces cerevisiae (Baker's yeast), which helped to improve the quality of peptide identification for low-resolution data.
Научно-учебная лаборатория искусственного интеллекта для вычислительной биологии: Стажер-исследователь
Название: PEFT (Parameter-Efficient Fine-Tuning) for GPT-like Deep Models to Reduce Hallucinations and to Improve Reproducibility in Scientific Text Generation Using Stochastic Optimization Techniques
Аннотация: Large Language Models (LLMs) have demonstrated impressive performance in a variety of language-related tasks, including text generation, machine translation, text summarising. Sometimes the result produced by a LLM turns out to be inaccurate. This thesis aims to fine-tune the existing LLM, GPT-2 by OpenAI, to reduce model's hallucinations and increase the answers' reproducibility in mass spectrometry. The research involved the application of the following scope of skills: data engineering, stochastic modelling, data science and statistics. I used two servers for all experiments: cHARISMa Higher School of Economics (HSE) server for fine-tuning and AI for Computational biology (AIC) server, where I run Docker images, necessary for the data preprocessing. Our fine-tuned model was named MassSpecGPT (MS-GPT). The thesis includes the novel approach of reproducibility score computations and calculation of Wilcoxon rank sum statistical test to compare the fine-tuned model MS-GPT against the base GPT-2 by OpenAI in reproducibility domain. The selection of optimal parameters (optimizer, learning rate) was based on several factors: validation error, run time, random-access memory (RAM) usage and Electricity usage. The fine-tuning of the model involved Low-Rank Adaptation of Large Language Models (LoRA) adapters, the state-of-the art (SOTA) method by now. I used common Natural Language Generation (NLG) evaluation metrics to compare the models' accuracies: Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Perplexity. As the result of the research, the BLEU score increased from 0.33 to 0.34, ROUGE-1 - from 0.42 to 0.44, ROUGE-L - from 0.57 to 0.62, Perplexity reduced from 13586.37 to 10092.12 and reproducibility score went from 0.83 to 0.84. Statistically significant under 5\% significance level turned out to be Perplexity score and reproducubility.