Mohit Bansal - Multimodal Generative LLMs: Unification, Interpretability, Evaluation


In this talk, I will present our journey of large-scale multimodal pretrained (generative) models across various modalities (text, images, videos, audio, layouts, etc.) and enhancing important aspects such as unification, efficiency, interpretability, and evaluation. We will start by discussing early cross-modal vision-and-language pretraining models (LXMERT). We will then look at early unified models (VL-T5) to combine several multimodal tasks (such as visual QA, referring expression comprehension, visual entailment, visual commonsense reasoning, captioning, and multimodal translation) by treating all tasks as text generation. We will also look at recent advanced unified models (with joint objectives and architecture, as well as newer unified modalities during encoding and decoding) such as textless video-audio transformers (TVLT), vision-text-layout transformers for universal document processing (UDOP), and composable any-to-any text-audio-image-video multimodal generation (CoDi). Second, we will discuss interpretable and controllable multimodal generation (to improve faithfulness) via LLM-based planning and programming, such as layout-controllable image generation via visual programming (VPGen), consistent multi-scene video generation via LLM-guided planning (VideoDirectorGPT), and open-domain, open-platform diagram generation (DiagrammerGPT). I will conclude with important faithfulness and bias evaluation aspects of multimodal generation models, based on fine-grained skill and social bias evaluation (DALL-Eval), as well as interpretable and explainable visual programs (VPEval).


Dr. Mohit Bansal is the John R. & Louise S. Parker Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science department at UNC Chapel Hill. He received his PhD from UC Berkeley in 2013 and his BTech from IIT Kanpur in 2008. His research expertise is in natural language processing and multimodal machine learning, with a particular focus on multimodal generative models, grounded and embodied semantics, faithful language generation, and interpretable and generalizable deep learning. He is a recipient of IIT Kanpur Young Alumnus Award, DARPA Director's Fellowship, NSF CAREER Award, Google Focused Research Award, Microsoft Investigator Fellowship, Army Young Investigator Award (YIP), DARPA Young Faculty Award (YFA), and outstanding paper awards at ACL, CVPR, EACL, COLING, and CoNLL. He has been a keynote speaker for the AACL 2023, CoNLL 2023, and INLG 2022 conferences. His service includes ACL Executive Committee, ACM Doctoral Dissertation Award Committee, CoNLL Program Co-Chair, ACL Americas Sponsorship Co-Chair, and Associate/Action Editor for TACL, CL, IEEE/ACM TASLP, and CSL journals.