市值: $2.6519T 5.030%
體積(24小時): $92.5927B -28.910%
  • 市值: $2.6519T 5.030%
  • 體積(24小時): $92.5927B -28.910%
  • 恐懼與貪婪指數:
  • 市值: $2.6519T 5.030%
加密
主題
加密植物
資訊
加密術
影片
頭號新聞
加密
主題
加密植物
資訊
加密術
影片
bitcoin
bitcoin

$80526.588300 USD

-1.29%

ethereum
ethereum

$1540.127221 USD

-4.23%

tether
tether

$0.999410 USD

-0.03%

xrp
xrp

$1.992067 USD

0.59%

bnb
bnb

$578.240064 USD

0.73%

usd-coin
usd-coin

$1.000005 USD

0.01%

solana
solana

$114.989272 USD

-0.41%

dogecoin
dogecoin

$0.156351 USD

1.19%

tron
tron

$0.235315 USD

-1.20%

cardano
cardano

$0.620256 USD

1.42%

unus-sed-leo
unus-sed-leo

$9.411993 USD

0.23%

chainlink
chainlink

$12.296466 USD

0.33%

avalanche
avalanche

$18.470197 USD

2.97%

toncoin
toncoin

$2.925237 USD

-3.48%

hedera
hedera

$0.169941 USD

2.85%

加密貨幣新聞文章

Videollama3:一個以任何分辨率願景令牌和差異框架修剪器的多模型模型的以視覺為中心的框架

2025/01/26 14:00

多模式智能的進步取決於處理和理解圖像和視頻。圖像可以通過提供有關對象,文本和空間關係等詳細信息的信息來揭示靜態場景。但是,這是以極其挑戰的代價。視頻理解涉及隨著時間的推移跟踪變化,以及其他操作,同時確保跨幀的一致性,需要動態的內容管理和時間關係。這些任務變得更加艱鉅,因為與圖像文本數據集相比,視頻文本數據集的收集和註釋相對困難。

Videollama3:一個以任何分辨率願景令牌和差異框架修剪器的多模型模型的以視覺為中心的框架

Advancements in multimodal intelligence hinge on the ability to process and understand images and videos. While images provide a snapshot of a static scene, offering details on objects, text, and spatial relationships, videos introduce an additional layer of complexity. Video comprehension entails tracking changes over time and ensuring consistency across frames, demanding dynamic content management and an understanding of temporal relationships. However, the collection and annotation of video-text datasets pale in comparison to the abundance of image-text datasets.

多模式智能的進步取決於處理和理解圖像和視頻的能力。雖然圖像提供了靜態場景的快照,並提供有關對象,文本和空間關係的詳細信息,但視頻引入了額外的複雜性。視頻理解需要隨著時間的推移跟踪變化,並確保跨幀的一致性,要求動態內容管理和對時間關係的理解。但是,與圖像-TEXT數據集的豐度相比,視頻文本數據集的收集和註釋蒼白。

Traditional methods for multimodal large language models (MLLMs) encounter challenges in video understanding. Approaches such as sparsely sampled frames, basic connectors, and image-based encoders fail to effectively capture temporal dependencies and dynamic content. Techniques like token compression and extended context windows struggle with long-form video complexity, while integrating audio and visual inputs often lacks seamless interaction. Efforts in real-time processing and scaling model sizes remain inefficient, and existing architectures are not optimized for handling long video tasks.

多模式大語言模型(MLLM)的傳統方法在視頻理解中遇到挑戰。諸如稀疏採樣幀,基本連接器和基於圖像的編碼器之類的方法無法有效捕獲時間依賴性和動態內容。如令牌壓縮和擴展上下文Windows之類的技術與長形式視頻複雜性掙扎,而集成音頻和視覺輸入通常缺乏無縫的交互。實時處理和縮放模型大小的努力效率保持降低,現有架構並未用於處理長時間的視頻任務。

To address these challenges in video understanding, researchers from Alibaba Group proposed the VideoLLaMA3 framework, which incorporates Any-resolution Vision Tokenization (AVT) and Differential Frame Pruner (DiffFP). AVT improves upon traditional fixed-resolution tokenization by enabling vision encoders to process variable resolutions dynamically, reducing information loss. This is achieved by adapting ViT-based encoders with 2D-RoPE for flexible position embedding.

為了解決視頻理解中的這些挑戰,來自阿里巴巴集團的研究人員提出了Videollama3框架,該框架結合了任何分辨率的視覺令牌化(AVT)和差異框架pruner(difffp)。 AVT通過使視覺編碼器動態處理變量分辨率,從而減少信息丟失,從而改善傳統的固定分辨率象徵化。這是通過將基於VIT的編碼器與2D繩適應靈活位置嵌入的。

To preserve vital information, DiffFP deals with redundant and long video tokens by pruning frames with minimal differences as taken through a 1-norm distance between the patches. Dynamic resolution handling, in combination with efficient token reduction, improves the representation while reducing the costs.

為了保留重要信息,DIFFFP通過修剪框架來處理冗餘和長時間的視頻令牌,這些框架通過斑塊之間的1個norm距離所取得的差異很小。動態分辨率處理,結合有效的令牌降低,可以改善表示形式,同時降低成本。

The model consists of a vision encoder, video compressor, projector, and large language model (LLM), initializing the vision encoder using a pre-trained SigLIP model. It extracts visual tokens, while the video compressor reduces video token representation. The projector connects the vision encoder to the LLM, and Qwen2.5 models are used for the LLM.

該模型由視覺編碼器,視頻壓縮機,投影儀和大語言模型(LLM)組成,使用預訓練的siglip模型初始化視覺編碼器。它提取視覺令牌,而視頻壓縮機則減少了視頻令牌表示。投影儀將視覺編碼器連接到LLM,QWEN2.5模型用於LLM。

Training occurs in four stages: Vision Encoder Adaptation, Vision-Language Alignment, Multi-task Fine-tuning, and Video-centric Fine-tuning. The first three stages focus on image understanding, and the final stage enhances video understanding by incorporating temporal information.

訓練分為四個階段:視覺編碼器適應,視覺語言對準,多任務微調和以視頻為中心的微調。前三個階段的重點是圖像理解,最後階段通過合併時間信息來增強視頻理解。

The Vision Encoder Adaptation Stage focuses on fine-tuning the vision encoder, initialized with SigLIP, on a large-scale image dataset, allowing it to process images at varying resolutions. The Vision-Language Alignment Stage introduces multimodal knowledge, making the LLM and the vision encoder trainable to integrate vision and language understanding.

視覺編碼器適應階段著重於在大規模圖像數據集上用siglip初始化的視覺編碼器進行微調,從而使其可以在不同的分辨率下處理圖像。視覺對齊階段引入了多模式知識,使LLM和視覺編碼器可訓練以整合視覺和語言的理解。

In the Multi-task Fine-tuning Stage, instruction fine-tuning is performed using multimodal question-answering data, including image and video questions, improving the model’s ability to follow natural language instructions and process temporal information. The Video-centric Fine-tuning Stage unfreezes all parameters to enhance the model’s video understanding capabilities.

在多任務微調階段,使用多模式提問數據(包括圖像和視頻問題)進行了指導微調,從而提高了模型遵循自然語言指令和處理時間信息的能力。以視頻為中心的微調階段揭開了所有參數,以增強模型的視頻理解功能。

The training data comes from diverse sources like scene images, documents, charts, fine-grained images, and video data, ensuring comprehensive multimodal understanding.

培訓數據來自不同的來源,例如場景圖像,文檔,圖表,細粒度圖像和視頻數據,從而確保了全面的多模式理解。

Experiments were conducted to evaluate the performance of VideoLLaMA3 across image and video tasks. For image-based tasks, the model was tested on document understanding, mathematical reasoning, and multi-image understanding, where it outperformed previous models, showing improvements in chart understanding and real-world knowledge question answering (QA).

進行了實驗,以評估跨圖像和視頻任務的Videollama3的性能。對於基於圖像的任務,對模型進行了測試,該模型是根據文檔理解,數學推理和多圖像理解的,它的表現優於先前的模型,顯示了圖表理解和現實世界知識問題答案(QA)的改進。

In video-based tasks, VideoLLaMA3 performed strongly in benchmarks like VideoMME and MVBench, proving proficient in general video understanding, long-form video comprehension, and temporal reasoning. The 2B and 7B models performed very competitively, with the 7B model leading in most video tasks, which underlines the model’s effectiveness in multimodal tasks.

在基於視頻的任務中,Videollama3在視頻和MVBench等基準中表現出色,證明了熟練的視頻理解,長期視頻理解和時間推理。 2B和7B模型在大多數視頻任務中都領導著7B模型,這強調了該模型在多模式任務中的有效性。

Other areas where important improvements were reported were OCR, mathematical reasoning, multi-image understanding, and long-term video comprehension.

報告了重要改進的其他領域包括OCR,數學推理,多圖像理解和長期視頻理解。

At last, the proposed framework advances vision-centric multimodal models, offering a strong framework for understanding images and videos. By utilizing high-quality image-text datasets it addresses video comprehension challenges and temporal dynamics, achieving strong results across benchmarks. However, challenges like video-text dataset quality and real-time processing remain.

最後,擬議的框架推進了以視覺為中心的多模式模型,為理解圖像和視頻提供了強大的框架。通過利用高質量的圖像文本數據集,它可以解決視頻理解挑戰和時間動態,從而在基準中取得了強勁的結果。但是,仍然存在視頻文本數據集質量和實時處理之類的挑戰。

Future research can enhance video-text datasets, optimize for real-time performance, and integrate additional modalities like audio and speech. This work can serve as a baseline for future advancements in multimodal understanding, improving efficiency, generalization, and integration.

未來的研究可以增強視頻文本數據集,優化實時性能,並整合諸如音頻和語音之類的其他模式。這項工作可以作為多模式理解,提高效率,概括和集成的未來進步的基準。

Check out the Paper and GitHub Page.

查看紙張和GitHub頁面。

All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

這項研究的所有信用都歸該項目的研究人員。另外,不要忘記在Twitter上關注我們,並加入我們的電報頻道和LinkedIn組。不要忘記加入我們的70k+ ml subreddit。

🚨 [Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

🚨[推薦閱讀] Nebius AI Studio通過視覺模型,新語言模型,嵌入式和Lora(晉升)擴展

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!

如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。

2025年04月12日 其他文章發表於