Generalizable and Transparent Vision Language Modeling

  • Bansal, Mohit M. (PI)

Project Details

Description

The objective of the proposed research is to develop a noise-robust, generalizable, opendomain VL framework that can learn about complex real-world events via strong compositional representations and by harnessing complementary information from diverse modalitieswith the ability to learn in a continuous, transparent, and collaborative manner. Our proposed research program will address the shortcomings of modern VL systems, such as (i) the inability to perform complex/structured compositional reasoning, (ii) the lack of robustness to noise, (iii) poor opendomain generalization, and (iv) limited explainability/interpretability. In doing so, we will realize VL agents capable of reasoning about complex opendomain real-world events/activities. We will first extend our prior work to develop a novel unified deep learning (DL) architecture that can learn from complex data streams across many modalities, including images, videos, text, and knowledge graphs. Such joint multimodal VL representations will allow our system to incorporate complementarycharacteristics from each modality, such as descriptive properties of the language, functional visual cues, useful for physical interactions, as well as structured knowledge from the knowledge graphs that may not be learned from vision or language modalities. Moreover, to facilitate transparency as well as human-agent collaboration, we will decode and represent the outputs of our VL system asexplainable and editable relational graphs that capture diverse spatial and temporal relationships between objects, actions, scenes, etc. This willallow human collaborators to leverage our framework for transparent and trustworthy high-stakesdecision-making. Additionally, to alleviate the effect of noisy vision and language alignment in the Web-basedVL datasets, we will use noise-robust VL calibration schemes, which will allow us to train our system on noisy VL inputs effectively. Furthermore, to enable generalization andfew-shot opendomainrecognition, we will develop compositional representations for the visual recognition of difficult and rare activities and objects. We will do so by learning to ground visual representations using language descriptions and vice-versa, as well as learning disentangled object and context representations, which will lead to compositional generalization, and also allow our system to encode visual properties that cannot be easily conveyed via language (e.g., object affordances, grasp points, etc.) but are useful for physical interactions. Lastly, our VL system will allow human collaborators to edit the relational graph outputs and correct its mistakes, thus, allowing our system to continually learn and improve its performance over time. We will incorporate the human feedback by using a novel long-term memory mechanism and parameter-efficient fine-tuning. In particular, instead of re-training our full VL system for every new task/dataset or every time it receives new human feedback, we willonly update a small number of parameters, which will enable efficient adaptation. Application Scenarios. Our robust, opendomain VL framework will enable a wide range of real-world applications relevant to the Navy#s missions, including (i) large-scale text-toimage/ video retrieval from noisy VL datasets, (ii) structured reasoning by harnessing the complementary information from images, video, language and knowledge graphs, (iii) zero-shot opendomain recognition of visually ambiguous and rare human activities/objects, (iv) spatiotemporal grounding of complex visual concepts using language, (v) visual captioning from multimodal image, video and knowledge graph inputs, and (vi) continual VL multi-task learning. The proposed team brings together expertise in multimodal deep learning, natural language processing, and computer vision. Weekly team meetings and strong interdisciplinary research are planned, leveraging the close proximity and existing collaborations between the investigators.

StatusActive
Effective start/end date1/4/23 → …

Funding

  • U.S. Navy: US$894,188.00

ASJC Scopus Subject Areas

  • Artificial Intelligence
  • Social Sciences(all)

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.