white circle icons connected by lines on a green gradient background
Accelerating Foundation Models Research

Multimodal and Crossmodal Learning

Academic research plays such an important role in advancing science, technology, culture, and society. This grant program helps ensure this community has access to the latest and leading AI models.

Brad Smith, Vice Chair and President
green icon of a person standing on a circle with four smaller circles connected

AFMR Goal: Align AI with shared human goals, values, and preferences via research on models

which enhances safety, robustness, sustainability, responsibility, and transparency, while ensuring rapid progress can be measured via new evaluation methods

The research projects focus on improving and applying Multi-Modal foundation models in various ways. Some projects focus on foundational aspects, such as enhancing the efficiency of foundational vision and language models, training audio-visual foundation models for tasks like segmentation and localization, and curating multimodal video datasets, and aligning multi-modal vision-language foundation models to understand their capabilities and limitations. Others address its applicational aspects, such as advancing traffic monitoring, geospatial data interaction, and predicting human mobility using Multi-Modal foundation models, enhancing video-based foundation models for reasoning, and addressing demographic bias in image generation by balancing model bias. This comprehensive approach ensures a broad and deep exploration of multimodal models and their potential applications.

  • NC A&T State University: Leila Hashemi-Beni (PI)

    Effective traffic monitoring is critical for transportation agencies and city planners to understand traffic patterns, congestion, and safety hazards. This proposal outlines a project to develop an advanced AI-based traffic monitoring system. The system leverages Physics-Informed Neural Networks (PINN) to model traffic state and employs Generative Pre-trained Transformer (GPT) models to interpret user input and PINN model outputs. To train the PINN models, a high-resolution dataset collected by unmanned aerial vehicles (UAVs) will be utilized. The primary goal of this project is to create a highly accurate and efficient traffic monitoring system capable of identifying traffic states and computing traffic state parameters.

    Related paper:

  • University of Wisconsin-Madison: Yong Jae Lee (PI)

    The proposal aims to advance research on aligning multi-modal vision-language foundation models by building and improving upon the Large Language and Vision Assistant (LLaVA) framework. The research project aims to study, enhance, and understand the capabilities, limitations, and emergent behaviors of these models. Questions such as how to evaluate foundation models, mitigate risks and harms, and ensure their fidelity will be explored.

  • University of Bath: Vinay Namboodiri (PI)

    A proposal to improve the efficiency and adaptation of foundational vision and language models through the development of spectral operator-based models and improved adapters. These advancements may lead to a wide range of applications including assistive technology.

  • York University: Mojgan Jadidi (PI)

    This proposal aims to transform the interaction and analysis of geospatial data by integrating OpenAI NLP models. It plans to create a user-friendly interface allowing natural language prompts to be turned into actionable geospatial queries. The output will be map-based visualisations generated by the system. The system’s adaptability will promote cross-sector collaboration and responsible AI use. The proof of concept will employ the City of Toronto open data platform.

  • University of Washington: Ali Farhadi (PI)

    This proposal focuses on enhancing multi-modal video datasets using foundation models in an automated pipeline at scale. We aim to explore various aspects of pre-training and post-pretraining alignment data and their effects on optimizing the performance of large-scale video models. Some of these aspects include domain coverage and diversity of the videos, quality of the paired texts, and the temporality aspect of actions in the videos. We also propose a paradigm shift from traditional video-text model training to an LLM-based domain adaptation tailored to the down-stream task and dataset.

  • University of California, Santa Cruz: Xin Wang (PI)

    The rapid evolution of Multimodal Large Language Models (MLLMs) has spurred significant interest in their application across various fields. However, a critical gap exists in their ability to generate coherent images alongside relevant texts. To close this gap, we introduce VilGen, a novel framework that effectively unifies vision and language modalities for interleaved multimodal generation.

    Central to VilGen is the concept of “generative vokens”, a novel mechanism that adeptly bridges LLMs and diffusion-based text-to-image generation models. This bridge is established by aligning generative vokens with latent visual features in diffusion models, enhancing the coherence and relevance of the multimodal output. Moreover, VilGen incorporates a retrieval-augmented approach, leveraging contextual image-text pairs to further refine and ensure the fidelity of generated content.

    Our model will be rigorously evaluated across a variety of benchmarks, focusing on multimodal storytelling and dialog, to validate its efficacy. We foresee that VilGen will outperform existing baselines in generating more consistent and faithful multimodal outputs, paving the way for next-generation MLLMs to adopt our methodologies for enhanced in-context learning and more robust multimodal generation capabilities.

    Related paper:

  • University of Georgia: Gengchen Mai (PI)

    This research wants to understand what is the best way to represent geospatial data so that it can be used for various downstream tasks. Given the complexity of the spatial and temporal information mapping, deterring the right representation for such data is going to be crucial for any foundation multimodal models. The proposal focuses on three parts, representation, fine-tuning and applying this to few downstream tasks. Such Geospatial multimodal models are going to be very useful for several real-world applications. As part of the program we plan to develop/finetune new geospatial models, curate datasets and evaluate on various downstream tasks.

  • Rutgers University: Hao Wang (PI)

    This project will allow us to define new techniques to interpret LLMs behavior to include building interprets for both blackbox and Whitebox models. Interpretability of LLMs is going to be critical for its usage in real-world scenarios such as education, healthcare, etc.

  • University of North Carolina at Chapel Hill: Mohit Bansal (PI)

    Explore Multimodal Reasoning and Explanation Generation using Large Language Models (LLMs) such as GPT-3 and GPT-4. Our goal is to enable LLMs to generate visualizations and textual explanations that complement each other, improving understanding and trustworthiness of responses.

    Related papers:

  • University of North Texas: Yunhe Feng (PI)

    The proposal presents an end-to-end framework, PreciseDebias, that aims to rectify demographic bias in image generation from text prompts. The core component of PreciseDebias is a novel instruction-following Large Language Model (LLM) designed with an emphasis on assessing and balancing model bias. The framework transforms generic text prompts to produce images that reflect specified demographic distributions. The proposed method autonomously refines prompts to match demographic distributions and guides the biased image generation model towards more statistically representative demographic outputs.

    Related paper:

  • University of Southern California: Cyrus Shahabi (PI)

    Large language models (LLM) have proven to be very useful for understanding and generating text. Our goal is to apply the same techniques to human location data. While an LLM uses sequences of words, our models will use sequences of locations, such as visit sequence consisting of a coffee shop, gym, work, restaurant, and home. Generating location sequences can be useful for creating synthetic data for research when privacy restrictions prevent using real sequences. The model could also be used to predict future locations and fill in missing locations. The applications include traffic prediction, pandemic modeling, and detecting anomalous behavior. For this task, we investigate the use of existing LLMs as well as training our own transformers.

  • University of Virginia: Sheng Li (PI)

    The proposal aims to enhance the theory-of-mind (ToM) capabilities of large language models (LLMs) towards understanding video scenes. The main objective is to enable video-based LLMs to reason about characters’ mental states in dynamic scenes, answer questions focused on social and emotional reasoning, and retrieve the key moments of transformation in the mental states of actors. The researchers propose leveraging a new multimodal temporal graph neural network (TGNN) architecture, Evolutionary Theory of Mind (EToM), that models the temporal evolution of mental states synthesized from video frames, transcripts, and video captioning data.

  • University of Rochester: Zhiyao Duan (PI)

    The proposal aims to develop a novel approach to train an audio-visual foundation model that models fine-grained dependencies within and across modalities to benefit various challenging downstream tasks, such as audio-visual segmentation, localization, and source separation. The key idea is to apply the masked auto-encoder (MAE) self-supervised learning paradigm to a large amount of unlabeled audio-visual data. The proposal also introduces the development of innovative masking strategies and auxiliary contrastive objectives to improve the effectiveness of the model training.

  • University of California, Riverside: Amr Magdy (PI)

    Visual foundation models (VFMs) are transforming various domains but face challenges in deployment on edge devices due to their high computational demands and memory requirements. These models, including advanced computer vision models like ViT and Microsofts Visual ChatGPT, require a significant number of floating-point operations (FLOPs), far exceeding the capabilities of resource-limited edge devices. Despite the potential of edge devices to enhance AI application responsiveness, their limited processing power restricts the use of such sophisticated models. To address this, researchers are investigating several techniques to reduce the computational burden of VFMs while maintaining their output quality. These strategies aim to deploy large foundation models on edge devices, ushering in a new era of efficient AI-powered edge computing applications.This project focuses on using knowledge distillation to optimize VFMs for edge devices. Knowledge distillation involves transferring knowledge from a larger model to a smaller one, a concept similar to the approach used in Microsofts Orca2 model for language knowledge. Our research aims to enhance inference cameras with self-inference capabilities, enabling VFMs to support near-real-time processing for various applications. We highlight one use case in sustainable agriculture and briefly mention other areas that could benefit from this technology.