Transformer Architectures for Multimodal Signal Processing and Decision Making

Course Program

The Transformer neural architectures have become the de-facto model of choice in natural language processing (NLP). In computer vision, there has recently been a surge of interest in end-to-end Transformers, prompting the efforts to replace hand-wired features or inductive biases with general-purpose neural architectures powered by data-driven training. The Transformer architectures have also arrived at state-of-the-art performance in multimodal learning, protein structure prediction, decision making, and so on.

These results indicate the Transformer architectures’ great potential beyond the previously mentioned domains and in the signal processing (SP) community. We envision these efforts may lead to a unified knowledge base that produces versatile representations for different data modalities, simplifying the inference and deployment of deep learning models in various application scenarios. Hence, we believe it is timely to provide a short course on the Transformer architectures and related learning algorithms.

What you will learn:

Become familiar with self-attention and other building blocks of Transformers, the vanilla Transformer architecture, and its variations
Learn about Transformers’ applications in computer vision and natural language processing: ViT, Swin-Transformers, BERT, GPT-3, etc.
Understand supervised, self-supervised, and multimodal self-supervised learn algorithms to train a Transformer
Acquire visualization methods to inspect a Transformer
Learn advanced topics: related neural architectures (e.g., MLP-Mixer), applications in visual navigation, decision Transformers, etc.

Instructors

Sun, Chen

Chen Sun

Chen Sun received his Ph.D. from the University of Southern California in 2016, advised by Prof. Ram Nevatia, and completed his bachelor degree in Computer Science at Tsinghua University in 2011. He is an assistant professor of computer science at Brown University, where he directs the PALM research lab, studying computer vision, machine learning, and artificial intelligence, and works part-time as a staff research scientist at Google DeepMind. He has received Brown University's Richard B. Salomon Faculty Research Award and Samsung AIT's Global Research Outreach Award for multimodal concept learning from videos. His research on behavior prediction appeared in the CVPR 2019 best paper finalist. Currently, he is serving as area chairs for CVPR, NeurIPS, and ACL conferences, and also a junior faculty teaching fellow at Brown.

Boqing Gong (@BoqingGo) / X

Boqing Gong

Boqing Gong received his Ph.D. in 2015 at the University of Southern California, where the Viterbi Fellowship partially supported his work. He is a research scientist at Google, Seattle. His research in machine learning and computer vision focuses on efficiency, generalization, and the visual analytics of objects, scenes, human activities, and their attributes. Before joining Google in 2019, he worked in Tencent and was a tenure-track Assistant Professor at the University of Central Florida (UCF). He received an NSF CRII award in 2016 and an NSF BIGDATA award in 2017, both of which were the first of their kinds ever granted to UCF.

Publication Year: 2024

Transformer Architectures for Multimodal Signal Processing and Decision Making

Course Provider: Signal Processing Society
Course Number: SPSILN001
Credits: 1 CEU/ 10 PDH