Multimodal Artificial Intelligent Assistants |
PI Xifeng Yan, University of California at Santa Barbara |
Publications |
LaViA, Large Language and Video Assistant, is an AI prototype that demonstrates the possibility of leveraging multimodal LLMs to assist users in completing physical tasks when they need help. It extracts knowledge from training video clips, augments it with manuals and additional knowledge sources, perceive the task execution, reason about the task state, and enable conversational guidance. The goal is to enable users to perform tasks by providing just-in-time feedback (e.g., identifying and correcting an error during a task, instructing users what is the next step, and answering their questions) and knowledge required to successfully complete tasks.
LaViA is powered by Multimodal Large Language Models (MLLMs). It takes task instructions and real-time video as input and tell you what is the right next step to perform on. It can also answer task-specific questions. With LaViA, you can quickly build up your own task guidance assistants. While it can take advantage of annotations generated by GPT-4o, LaViA has a local vision/language model that can be deployed on premise, offering complete privacy, data security and low inference cost.
The framework of LaViA consists of three major steps:
LaVia is available at https://github.com/Victorwz/LaViA
Its pre-trained model is available at https://huggingface.co/weizhiwang/LLaVA-Video-Llama-3
LaViA is led by Weizhi Wang (weizhiwang@ucsb.edu) at Xifeng Yan's lab @UCSB. It is a part of the "Autonomous Multimodal Ingestion for Goal-Oriented Support" (AMIGOS) team directed by Dr. Charles Ortiz at PARC/SRI. AMIGOS is funded by DARPA Perceptually-enabled Task Guidance (PTG) program.