11, Pokrovsky boulevard.
Phone: +7 (495) 531-00-00 *27254
Rodomanov A., Kropotov D.
SIAM Journal on Optimization. 2020. Vol. 30. No. 3. P. 1878-1904.
A. Boldyrev, D. Derkach, F. Ratnikov et al.
Journal of Instrumentation. 2020. Vol. 15. P. 1-7.
Dvurechensky P., Eduard Gorbunov, Gasnikov A.
European Journal of Operational Research. 2021. Vol. 290. No. 2. P. 601-621.
Kuznetsov S., Kaytoue M., Belfodil A.
In bk.: International Journal of General Systems. Iss. 49. 2020. P. 271-285.
Kaledin M., Moulines E., Naumov A. et al.
In bk.: Proceedings of Machine Learning Research. Vol. 125: Proceedings of Thirty Third Conference on Learning Theory. 2020. P. 2144-2203.
The Faculty of Computer Science was created with the goal of becoming one of the world’s leading faculties for developers and researchers in data analysis, machine learning, big data, theoretical computer science, bioinformatics, system and software engineering, system programming, and distributed computing. In cooperation with major companies like Yandex, Sberbank, SAS, Samsung, 1C, and many others, the Faculty provides both deep theoretical knowledge and hands-on practical experience in many branches of contemporary computer science.
September 28, 18:10
Speaker: Tatiana Likhomanenko (Apple)
Title: Positional Embedding in Transformer-based Models
Transformers have been shown to be highly effective on problems involving sequential modeling, such as in machine translation (MT) and natural language processing (NLP). Following its success on these tasks, the Transformer architecture raised immediate interest in other domains: automatic speech recognition (ASR), music generation, object detection, and finally image recognition and video understanding. Two major components of the Transformer are the attention mechanism and the positional encoding. Without the latter, vanilla attention Transformers are invariant with respect to input tokens permutations (making "cat eats fish" and "fish eats cat" identical to the model). In this talk we will discuss different approaches on how to encode positional information, their pros and cons: absolute and relative, fixed and learnable, 1D and multidimensional, additive and multiplicative, continuous and augmented positional embeddings. We will also focus on how well different positional embeddings generalize to unseen positions for both interpolation and extrapolation tasks.