Multimodal Artificial Intelligent Assistants

PI
Xifeng Yan, University of California at Santa Barbara
Publications

LaViA: Fine-Tuning Multimodal LLMs as Task Assistants with Video Instructions

LaViA, Large Language and Video Assistant, is an AI prototype that demonstrates the possibility of leveraging multimodal LLMs to assist users in completing physical tasks when they need help. It extracts knowledge from training video clips, augments it with manuals and additional knowledge sources, perceive the task execution, reason about the task state, and enable conversational guidance. The goal is to enable users to perform tasks by providing just-in-time feedback (e.g., identifying and correcting an error during a task, instructing users what is the next step, and answering their questions) and knowledge required to successfully complete tasks.

LaViA is powered by Multimodal Large Language Models (MLLMs). It takes task instructions and real-time video as input and tell you what is the right next step to perform on. It can also answer task-specific questions. With LaViA, you can quickly build up your own task guidance assistants. While it can take advantage of annotations generated by GPT-4o, LaViA has a local vision/language model that can be deployed on premise, offering complete privacy, data security and low inference cost.

Framework

The framework of LaViA consists of three major steps:

Video data preparation from a real task;
Instruction data construction with GPT-4o;
Video instruction tuning with multimodal LLMs.

Download

LaVia is available at https://github.com/Victorwz/LaViA

Its pre-trained model is available at https://huggingface.co/weizhiwang/LLaVA-Video-Llama-3

Acknowledgment

LaViA is led by Weizhi Wang (weizhiwang@ucsb.edu) at Xifeng Yan's lab @UCSB. It is a part of the "Autonomous Multimodal Ingestion for Goal-Oriented Support" (AMIGOS) team directed by Dr. Charles Ortiz at PARC/SRI. AMIGOS is funded by DARPA Perceptually-enabled Task Guidance (PTG) program.