bitcoin
bitcoin

$94280.819042 USD

-0.31%

ethereum
ethereum

$3225.192535 USD

-1.73%

xrp
xrp

$2.515986 USD

-1.21%

tether
tether

$0.999457 USD

-0.01%

bnb
bnb

$684.498527 USD

-1.66%

solana
solana

$184.872413 USD

-1.16%

dogecoin
dogecoin

$0.331661 USD

-3.14%

usd-coin
usd-coin

$0.999891 USD

-0.01%

cardano
cardano

$0.952115 USD

-6.43%

tron
tron

$0.229506 USD

-4.53%

avalanche
avalanche

$35.505652 USD

-4.37%

sui
sui

$4.605430 USD

-6.06%

toncoin
toncoin

$5.191302 USD

-3.70%

stellar
stellar

$0.423160 USD

-4.08%

shiba-inu
shiba-inu

$0.000021 USD

-3.27%

加密貨幣新聞文章

Sa2VA:圖像和視訊的密集理解的統一模型

2025/01/13 03:31

來自加州大學默塞德分校、位元組跳動種子公司、武漢大學和北京大學的研究人員提出了 Sa2VA,這是一種突破性的統一模型,旨在對圖像和視訊進行深入的理解。

Sa2VA:圖像和視訊的密集理解的統一模型

Multi-Modal Large Language Models (MLLMs) have seen rapid advancements in handling various image and video-related tasks, including visual question answering, narrative generation, and interactive editing. However, achieving fine-grained video content understanding, such as pixel-level segmentation, tracking with language descriptions, and performing visual question answering on specific video prompts, still poses a critical challenge in this field. State-of-the-art video perception models excel at tasks like segmentation and tracking but lack open-ended language understanding and conversation capabilities. At the same time, video MLLMs demonstrate strong performance in video comprehension and question answering but fall short in handling perception tasks and visual prompts.

多模態大型語言模型 (MLLM) 在處理各種影像和影片相關任務方面取得了快速進步,包括視覺問答、敘事生成和互動式編輯。然而,實現細粒度的視訊內容理解,例如像素級分割、語言描述追蹤以及對特定視訊提示進行視覺問答,仍然是該領域的關鍵挑戰。最先進的視訊感知模型擅長分割和追蹤等任務,但缺乏開放式語言理解和對話能力。同時,視訊 MLLM 在視訊理解和問答方面表現出強大的性能,但在處理感知任務和視覺提示方面表現不佳。

Existing attempts to address video understanding challenges have followed two main approaches: MLLMs and Referring Segmentation systems. Initially, MLLMs focused on developing improved multi-modal fusion methods and feature extractors, eventually evolving towards instruction tuning on LLMs with frameworks like LLaVA. Recent developments have attempted to unify image, video, and multi-image analysis in single frameworks, such as LLaVA-OneVision. In parallel, Referring Segmentation systems have progressed from basic fusion modules to transformer-based methods that integrate segmentation and tracking within videos. However, these solutions lack a comprehensive integration of perception and language understanding capabilities.

解決視訊理解挑戰的現有嘗試遵循兩種主要方法:MLLM 和參考分割系統。最初,MLLM 專注於開發改進的多模態融合方法和特徵提取器,最終發展到使用 LLaVA 等框架對 LLM 進行指令調整。最近的發展嘗試將圖像、視訊和多圖像分析統一到單一框架中,例如 LLaVA-OneVision。同時,參考分割系統已經從基本的融合模組發展到基於變壓器的方法,在視訊中整合了分割和追蹤。然而,這些解決方案缺乏感知和語言理解能力的全面整合。

To overcome this limitation, researchers from UC Merced, Bytedance Seed, Wuhan University, and Peking University have proposed Sa2VA, a groundbreaking unified model for a dense grounded understanding of images and videos. The model differentiates itself by supporting a comprehensive range of image and video tasks through minimal one-shot instruction tuning, addressing the limitations of existing multi-modal large language models. Sa2VA’s innovative approach integrates SAM-2 with LLaVA, unifying text, image, and video in a shared LLM token space. The researchers have also introduced Ref-SAV, an extensive auto-labeled dataset containing over 72K object expressions in complex video scenes, with 2K manually validated video objects to ensure robust benchmarking capabilities.

為了克服這一限制,來自加州大學默塞德分校、字節跳動種子公司、武漢大學和北京大學的研究人員提出了Sa2VA,這是一種突破性的統一模型,用於對圖像和視頻進行密集的理解。該模型的獨特之處在於,透過最少的一次性指令調整來支援全面的圖像和視訊任務,解決了現有多模態大語言模型的局限性。 Sa2VA 的創新方法將 SAM-2 與 LLaVA 集成,在共享的 LLM 代幣空間中統一文字、圖像和視訊。研究人員還推出了 Ref-SAV,這是一個廣泛的自動標記資料集,包含複雜視訊場景中超過 72K 的對象表達,以及 2K 手動驗證的視訊對象,以確保強大的基準測試功能。

Sa2VA’s architecture integrates two main components: a LLaVA-like model and SAM-2, connected through a novel decoupled design. The LLaVA-like component consists of a visual encoder processing images and videos, a visual projection layer, and an LLM for text token prediction. The system employs a unique decoupled approach where SAM-2 operates alongside the pre-trained LLaVA model without direct token exchange, maintaining computational efficiency and enabling plug-and-play functionality with various pre-trained MLLMs. The key innovation lies in the connection mechanism using a special “[SEG]” token, allowing SAM-2 to generate segmentation masks while enabling gradient backpropagation through the “[SEG]” token to optimize the MLLM’s prompt generation capabilities.

Sa2VA 的架構整合了兩個主要元件:類別 LLaVA 模型和 SAM-2,透過新穎的解耦設計連接。類似 LLaVA 的組件由處理影像和影片的視覺編碼器、視覺投影層和用於文字標記預測的 LLM 組成。該系統採用獨特的解耦方法,其中 SAM-2 與預先訓練的 LLaVA 模型一起運行,無需直接進行令牌交換,從而保持計算效率並透過各種預訓練的 MLLM 實現即插即用功能。關鍵的創新在於使用特殊「[SEG]」令牌的連接機制,允許SAM-2產生分段掩碼,同時透過「[SEG]」令牌實現梯度反向傳播,以優化MLLM的提示產生能力。

The Sa2VA model achieves state-of-the-art results on referring segmentation tasks, with Sa2VA-8B scoring 81.6, 76.2, and 78.9 cIoU on RefCOCO, RefCOCO+, and RefCOCOg respectively, outperforming previous systems like GLaMM-7B. In conversational capabilities, Sa2VA shows strong performance with scores of 2128 on MME, 81.6 on MMbench, and 75.1 on SEED-Bench. The model excels in video benchmarks, surpassing previous state-of-the-art VISA-13B by substantial margins on MeVIS, RefDAVIS17, and ReVOS. Moreover, Sa2VA’s performance is noteworthy considering its smaller model size compared to competitors, showing its efficiency and effectiveness across both image and video understanding tasks.

Sa2VA 模型在參考分割任務上取得了最先進的結果,Sa2VA-8B 在 RefCOCO、RefCOCO+ 和 RefCOCOg 上的得分分別為 81.6、76.2 和 78.9 cIoU,優於 GLaMM-7B 等之前的系統。在對話能力方面,Sa2VA 表現強勁,在 MME 上得分為 2128,在 MMbench 上得分為 81.6,在 SEED-Bench 上得分為 75.1。該模型在視訊基準測試中表現出色,在 MeVIS、RefDAVIS17 和 ReVOS 上大幅超越了之前最先進的 VISA-13B。此外,考慮到與競爭對手相比,Sa2VA 的模型尺寸更小,其性能值得注意,顯示了其在圖像和視訊理解任務中的效率和有效性。

In this paper, researchers introduced Sa2VA which represents a significant advancement in multi-modal understanding by successfully integrating SAM-2’s video segmentation capabilities with LLaVA’s language processing abilities. The framework's versatility is shown through its ability to handle diverse image and video understanding tasks with minimal one-shot instruction tuning, addressing the long-standing challenge of combining perception and language understanding. Sa2VA’s strong performance across multiple benchmarks, from referring segmentation to conversational tasks, validates its effectiveness as a unified solution for a dense, grounded understanding of visual content, marking a significant step forward in the multi-modal AI systems field.

在本文中,研究人員介紹了 Sa2VA,它成功地將 SAM-2 的視訊分割功能與 LLaVA 的語言處理能力整合在一起,代表了多模態理解的重大進步。該框架的多功能性體現在它能夠透過最少的一次性指令調整來處理不同的影像和視訊理解任務,解決將感知和語言理解相結合的長期挑戰。 Sa2VA 在從指涉分割到對話任務等多個基準測試中的強勁表現,驗證了其作為統一解決方案的有效性,可對視覺內容進行密集、紮實的理解,標誌著多模式人工智慧系統領域向前邁出了重要一步。

Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

查看擁抱臉上的紙張和模型。這項研究的所有功勞都歸功於該計畫的研究人員。另外,不要忘記在 Twitter 上關注我們並加入我們的 Telegram 頻道和 LinkedIn 群組。不要忘記加入我們 65k+ ML SubReddit。

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence

即將舉行的免費人工智慧網路研討會(2025 年 1 月 15 日):利用綜合數據和評估智慧提高 LLM 準確性

Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.output

參加本次網路研討會,獲得可操作的見解,以提高 LLM 模型的效能和準確性,同時保護資料隱私。

新聞來源:www.marktechpost.com

免責聲明:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

2025年01月13日 其他文章發表於