Meet VideoChat: An End-to-End Chat-Centric Video Understanding System Developed by Merging Language and Visual Models

Mai 18, às 08:52


5 min de leitura


0 leituras

Real-world applications like autonomous driving and human-robot interaction rely heavily on intelligent visual understanding. Current video comprehension methods’ spatial and temporal interpretations do not...
Meet VideoChat: An End-to-End Chat-Centric Video Understanding System Developed by Merging Language and Visual Models

Real-world applications like autonomous driving and human-robot interaction rely heavily on intelligent visual understanding. Current video comprehension methods’ spatial and temporal interpretations do not successfully generalize and instead rely on task-specific fine-tuning of video foundation models. Due to the task-specific tailoring of pre-trained video foundation models, the existing video understanding paradigm needs to be expanded in its ability to provide a general spatiotemporal understanding of client-level needs. Recent years have seen the emergence of vision-centric multimodal discourse systems as a crucial study area. These systems may conduct image-related activities through multi-round dialogues with user inquiries by leveraging a pre-trained large language model (LLM), an image encoder, and extra learnable modules. This changes the game for various uses, but current solutions need to properly approach video-centric problems from a data-centric viewpoint using machine learning.

Researchers from the Shanghai AI Laboratory’s OpenGVLab, Nanjing University, the University of Hong Kong, the Shenzhen Institute of Advanced Technology, and the Chinese Academy of Sciences collaborated to create VideoChat. This innovative end-to-end chat-centric video understanding system employs state-of-the-art video and language models to enhance spatiotemporal reasoning, event localization, and causal relationship inference. The group developed a novel dataset containing thousands of videos and densely captioned descriptions and discussions given to ChatGPT chronologically. This dataset is useful for training video-centric multimodal discourse systems because of its focus on spatiotemporal objects, actions, events, and causal relationships.

All of the methods required to develop the system from a data perspective are provided by the proposed VideoChat, which combines state-of-the-art video foundation models with LLMs in a learnable neural interface. The video and language foundation models are combined with a learnable video-language token interface (VLTF) tuned with video-text data to encode the videos as embeddings; these two processes make up the proposed framework. After that, an LLM is fed the video tokens, user inquiries, and dialogue context for talking.

The stack consists of a pre-trained vision transformer equipped with a global multi-head relation aggregator temporal modeling module and a pre-trained QFormer that serves as the token interface and features additional linear projection and query tokens. The generated video embeddings are tiny and LLM-compatible, making them useful for subsequent conversations. To fine-tune their system, the researchers also designed a video-centric instruction dataset consisting of thousands of videos matched with detailed descriptions and conversations and a two-stage joint training paradigm that uses publicly available image instruction data.

Researchers have begun a groundbreaking exploration of broad video comprehension by creating VideoChat, a multimodal discussion system optimized for videos. A text-based version of VideoChat shows how well big language models work as universal decoders for video jobs, and an end-to-end performance makes an initial attempt to solve the problem of video understanding using an instructed video-to-text formulation. All the pieces work together thanks to a neural interface that can be trained to combine video foundation models with huge language models successfully. Researchers have presented a video-centric instructional dataset to boost the system’s performance. The dataset emphasizes spatiotemporal reasoning and causality and is a learning resource for video-based multimodal dialogue systems. Early qualitative assessments demonstrate the system’s potential across various video applications and motivate its continued development.

Challenges and Constraints

  • Long-form videos (> 1 minute) are difficult to manage in both VideoChat-Text and VideoChat-Embed. On the one hand, further investigation is still needed into how to model the context of long videos efficiently and effectively. Conversely, it might be difficult to provide user-friendly interactions when processing lengthier films due to balancing response time, GPU memory utilization, and user expectations for system performance.
  • Temporal and causal reasoning abilities are still in their infancy in the system. The current magnitude of the instruction data and the methods utilized to produce it impose these limitations on the system and the models employed.
  • Egocentric task instruction prediction and intelligent monitoring are examples of time-sensitive and performance-critical applications where addressing performance gaps is a continuing problem.

The group’s goal is to pave the path for various real-world applications in multiple fields by advancing the integration of video and natural language processing for video understanding and reasoning. Future focus, according to the team:

  • Improving video foundation models’ spatiotemporal modeling requires expanding their capacity and data.
  • Multimodal training data and reasoning benchmark with a focus on video for large-scale assessments.
  • Methods of processing videos for the long haul.

Check out the Paper and Github link. Don’t forget to join our 21k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]

🚀 Check Out 100’s AI Tools in AI Tools Club

Dhanshree Shripad Shenwai

Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone's life easy.

Continue lendo

AI | Techcrunch

Disney is reportedly preparing a standalone ESPN streaming service
Disney is actively preparing to launch a standalone ESPN streaming service, according to a new report from the Wall Street Journal. The report indicates that ESPN is planning to sell its channel directly to...

Hoje, às 15:51

AI | Techcrunch

The billionaires are trying to live longer… again
Hello, and welcome back to Equity, a podcast about the business of startups, where we unpack the numbers and nuance behind the headlines. This week Mary Ann, Becca, and Alex gathered to chew through the biggest news of the week. Here’s what the gang got into today: Vice goes bankrupt: Now is not a great time […]

Hoje, às 15:17

AI | Techcrunch

NASA picks Blue Origin-led team to build second human landing system on the moon, joining SpaceX
NASA has chosen a Blue Origin-led team to develop a second lunar landing system for the Artemis program, as the agency looks to provide competition with SpaceX and support long-term exploration of the...

Hoje, às 14:41

AI | TechCrunch

Apple reportedly limits internal use of AI-powered tools like ChatGPT and GitHub Copilot
As big tech companies are in a fierce race with each other to build generative AI tools, they are being cautious about giving their secrets away. In a move to prevent any of its data from ending up with...

Hoje, às 13:55

AI | Techcrunch

Apple is on the hunt for generative AI talent
Apple, like a number of companies right now, may be grappling with what role the newest advances in AI are playing, and should play, in its business. But one thing Apple is confident about is the fact that it...

Hoje, às 13:16

Victoria Lo

Enhancing Public Speaking Skills: A Guide by an Introvert
Public speaking can be a daunting task for many people, especially for introverts who may feel uncomfortable in large groups or social situations. However, with a bit of preparation and practice, introverts...

Hoje, às 13:16


How React Preserve and Reset State
State is isolated between components. React keeps track of which state belongs to which component based on their place in the UI tree. You can control when to preserve state and when to reset it between...

Hoje, às 12:55

AI | Techcrunch

Restaurant365 gobbles up $135M to supersize its software for the food service industry
The price of food continues to go up and up, but surprisingly that hasn’t (yet?) played out as pressure on the wider restaurant industry. Now, a startup that’s building technology to serve that sector announced a supersized round of funding to nourish its growth. Restaurant365, which develops all-in-one restaurant management software, announced $135 million in […]

Hoje, às 11:57

AI | Techcrunch

To secure early-stage funding, entrepreneurs should build ESG into their business models
The fiduciary duty of investment managers would suggest a long-term imperative to ensure that the funds they manage are not placed into assets that will become stranded or obsolete.

Hoje, às 11:30

Hacker News

WSJ News Exclusive | Apple Restricts Employee Use of ChatGPT, Joining Other Companies Wary of Leaks
By Aaron Tilley and Miles KruppaUpdated May 18, 2023 7:35 pm ETSam Altman, CEO of ChatGPT creator OpenAI, touted the benefits of AI and acknowledged potential downsides of the technology during a Senate...

Hoje, às 10:55