市值: $2.7015T 4.060%
成交额(24h): $76.2426B -26.260%
  • 市值: $2.7015T 4.060%
  • 成交额(24h): $76.2426B -26.260%
  • 恐惧与贪婪指数:
  • 市值: $2.7015T 4.060%
加密货币
话题
百科
资讯
加密话题
视频
热门新闻
加密货币
话题
百科
资讯
加密话题
视频
bitcoin
bitcoin

$82913.565485 USD

2.96%

ethereum
ethereum

$1550.841229 USD

0.70%

tether
tether

$0.999566 USD

0.02%

xrp
xrp

$2.009653 USD

0.91%

bnb
bnb

$584.683291 USD

1.11%

solana
solana

$120.104760 USD

4.51%

usd-coin
usd-coin

$0.999915 USD

-0.01%

dogecoin
dogecoin

$0.158387 USD

1.30%

tron
tron

$0.243120 USD

3.32%

cardano
cardano

$0.620112 USD

-0.02%

unus-sed-leo
unus-sed-leo

$9.329467 USD

-0.88%

chainlink
chainlink

$12.512675 USD

1.76%

avalanche
avalanche

$18.895291 USD

2.30%

stellar
stellar

$0.233604 USD

0.98%

shiba-inu
shiba-inu

$0.000012 USD

1.28%

加密货币新闻

Videollama3:一个以任何分辨率愿景令牌和差异框架修剪器的多模型模型的以视觉为中心的框架

2025/01/26 14:00

多模式智能的进步取决于处理和理解图像和视频。图像可以通过提供有关对象,文本和空间关系等详细信息的信息来揭示静态场景。但是,这是以极其挑战的代价。视频理解涉及随着时间的推移跟踪变化,以及其他操作,同时确保跨帧的一致性,需要动态的内容管理和时间关系。这些任务变得更加艰巨,因为与图像文本数据集相比,视频文本数据集的收集和注释相对困难。

Videollama3:一个以任何分辨率愿景令牌和差异框架修剪器的多模型模型的以视觉为中心的框架

Advancements in multimodal intelligence hinge on the ability to process and understand images and videos. While images provide a snapshot of a static scene, offering details on objects, text, and spatial relationships, videos introduce an additional layer of complexity. Video comprehension entails tracking changes over time and ensuring consistency across frames, demanding dynamic content management and an understanding of temporal relationships. However, the collection and annotation of video-text datasets pale in comparison to the abundance of image-text datasets.

多模式智能的进步取决于处理和理解图像和视频的能力。虽然图像提供了静态场景的快照,并提供有关对象,文本和空间关系的详细信息,但视频引入了额外的复杂性。视频理解需要随着时间的推移跟踪变化,并确保跨帧的一致性,要求动态内容管理和对时间关系的理解。但是,与图像-TEXT数据集的丰度相比,视频文本数据集的收集和注释苍白。

Traditional methods for multimodal large language models (MLLMs) encounter challenges in video understanding. Approaches such as sparsely sampled frames, basic connectors, and image-based encoders fail to effectively capture temporal dependencies and dynamic content. Techniques like token compression and extended context windows struggle with long-form video complexity, while integrating audio and visual inputs often lacks seamless interaction. Efforts in real-time processing and scaling model sizes remain inefficient, and existing architectures are not optimized for handling long video tasks.

多模式大语言模型(MLLM)的传统方法在视频理解中遇到挑战。诸如稀疏采样帧,基本连接器和基于图像的编码器之类的方法无法有效捕获时间依赖性和动态内容。如令牌压缩和扩展上下文Windows之类的技术与长形式视频复杂性挣扎,而集成音频和视觉输入通常缺乏无缝的交互。实时处理和缩放模型大小的努力效率保持降低,现有架构并未用于处理长时间的视频任务。

To address these challenges in video understanding, researchers from Alibaba Group proposed the VideoLLaMA3 framework, which incorporates Any-resolution Vision Tokenization (AVT) and Differential Frame Pruner (DiffFP). AVT improves upon traditional fixed-resolution tokenization by enabling vision encoders to process variable resolutions dynamically, reducing information loss. This is achieved by adapting ViT-based encoders with 2D-RoPE for flexible position embedding.

为了解决视频理解中的这些挑战,来自阿里巴巴集团的研究人员提出了Videollama3框架,该框架结合了任何分辨率的视觉令牌化(AVT)和差异框架pruner(difffp)。 AVT通过使视觉编码器动态处理变量分辨率,从而减少信息丢失,从而改善传统的固定分辨率象征化。这是通过将基于VIT的编码器与2D绳适应灵活位置嵌入的。

To preserve vital information, DiffFP deals with redundant and long video tokens by pruning frames with minimal differences as taken through a 1-norm distance between the patches. Dynamic resolution handling, in combination with efficient token reduction, improves the representation while reducing the costs.

为了保留重要信息,DIFFFP通过修剪框架来处理冗余和长时间的视频令牌,这些框架通过斑块之间的1个norm距离所取得的差异很小。动态分辨率处理,结合有效的令牌降低,可以改善表示形式,同时降低成本。

The model consists of a vision encoder, video compressor, projector, and large language model (LLM), initializing the vision encoder using a pre-trained SigLIP model. It extracts visual tokens, while the video compressor reduces video token representation. The projector connects the vision encoder to the LLM, and Qwen2.5 models are used for the LLM.

该模型由视觉编码器,视频压缩机,投影仪和大语言模型(LLM)组成,使用预训练的siglip模型初始化视觉编码器。它提取视觉令牌,而视频压缩机则减少了视频令牌表示。投影仪将视觉编码器连接到LLM,QWEN2.5模型用于LLM。

Training occurs in four stages: Vision Encoder Adaptation, Vision-Language Alignment, Multi-task Fine-tuning, and Video-centric Fine-tuning. The first three stages focus on image understanding, and the final stage enhances video understanding by incorporating temporal information.

训练分为四个阶段:视觉编码器适应,视觉语言对准,多任务微调和以视频为中心的微调。前三个阶段的重点是图像理解,最后阶段通过合并时间信息来增强视频理解。

The Vision Encoder Adaptation Stage focuses on fine-tuning the vision encoder, initialized with SigLIP, on a large-scale image dataset, allowing it to process images at varying resolutions. The Vision-Language Alignment Stage introduces multimodal knowledge, making the LLM and the vision encoder trainable to integrate vision and language understanding.

视觉编码器适应阶段着重于在大规模图像数据集上用siglip初始化的视觉编码器进行微调,从而使其可以在不同的分辨率下处理图像。视觉对齐阶段引入了多模式知识,使LLM和视觉编码器可训练以整合视觉和语言的理解。

In the Multi-task Fine-tuning Stage, instruction fine-tuning is performed using multimodal question-answering data, including image and video questions, improving the model’s ability to follow natural language instructions and process temporal information. The Video-centric Fine-tuning Stage unfreezes all parameters to enhance the model’s video understanding capabilities.

在多任务微调阶段,使用多模式提问数据(包括图像和视频问题)进行了指导微调,从而提高了模型遵循自然语言指令和处理时间信息的能力。以视频为中心的微调阶段揭开了所有参数,以增强模型的视频理解功能。

The training data comes from diverse sources like scene images, documents, charts, fine-grained images, and video data, ensuring comprehensive multimodal understanding.

培训数据来自不同的来源,例如场景图像,文档,图表,细粒度图像和视频数据,从而确保了全面的多模式理解。

Experiments were conducted to evaluate the performance of VideoLLaMA3 across image and video tasks. For image-based tasks, the model was tested on document understanding, mathematical reasoning, and multi-image understanding, where it outperformed previous models, showing improvements in chart understanding and real-world knowledge question answering (QA).

进行了实验,以评估跨图像和视频任务的Videollama3的性能。对于基于图像的任务,对模型进行了测试,该模型是根据文档理解,数学推理和多图像理解的,它的表现优于先前的模型,显示了图表理解和现实世界知识问题答案(QA)的改进。

In video-based tasks, VideoLLaMA3 performed strongly in benchmarks like VideoMME and MVBench, proving proficient in general video understanding, long-form video comprehension, and temporal reasoning. The 2B and 7B models performed very competitively, with the 7B model leading in most video tasks, which underlines the model’s effectiveness in multimodal tasks.

在基于视频的任务中,Videollama3在视频和MVBench等基准中表现出色,证明了熟练的视频理解,长期视频理解和时间推理。 2B和7B模型在大多数视频任务中都领导着7B模型,这强调了该模型在多模式任务中的有效性。

Other areas where important improvements were reported were OCR, mathematical reasoning, multi-image understanding, and long-term video comprehension.

报告了重要改进的其他领域包括OCR,数学推理,多图像理解和长期视频理解。

At last, the proposed framework advances vision-centric multimodal models, offering a strong framework for understanding images and videos. By utilizing high-quality image-text datasets it addresses video comprehension challenges and temporal dynamics, achieving strong results across benchmarks. However, challenges like video-text dataset quality and real-time processing remain.

最后,拟议的框架推进了以视觉为中心的多模式模型,为理解图像和视频提供了强大的框架。通过利用高质量的图像文本数据集,它可以解决视频理解挑战和时间动态,从而在基准中取得了强劲的结果。但是,仍然存在视频文本数据集质量和实时处理之类的挑战。

Future research can enhance video-text datasets, optimize for real-time performance, and integrate additional modalities like audio and speech. This work can serve as a baseline for future advancements in multimodal understanding, improving efficiency, generalization, and integration.

未来的研究可以增强视频文本数据集,优化实时性能,并整合诸如音频和语音之类的其他模式。这项工作可以作为多模式理解,提高效率,概括和集成的未来进步的基准。

Check out the Paper and GitHub Page.

查看纸张和GitHub页面。

All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

这项研究的所有信用都归该项目的研究人员。另外,不要忘记在Twitter上关注我们,并加入我们的电报频道和LinkedIn组。不要忘记加入我们的70k+ ml subreddit。

🚨 [Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

🚨[推荐阅读] Nebius AI Studio通过视觉模型,新语言模型,嵌入式和Lora(晋升)扩展

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!

如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。

2025年04月13日 发表的其他文章