$83571.608249 USD

-1.38%

ethereum

$1826.028236 USD

-3.02%

tether

$0.999839 USD

-0.01%

xrp

$2.053149 USD

-2.48%

bnb

$601.140115 USD

-0.44%

solana

$120.357332 USD

-3.79%

usd-coin

$0.999833 USD

-0.02%

dogecoin

$0.166175 USD

-3.43%

cardano

$0.652521 USD

-3.00%

tron

$0.236809 USD

-0.59%

toncoin

$3.785339 USD

-5.02%

chainlink

$13.253231 USD

-3.91%

unus-sed-leo

$9.397427 USD

-0.19%

stellar

$0.266444 USD

-1.00%

sui

$2.409007 USD

1.15%

加密货币新闻

LLaVA-o1：一种新的开源视觉语言模型，为多模态推理带来推理时间缩放

2024/11/23 07:26

LLaVA-o1 是由中国多所大学的研究人员开发的新模型，将这种范式引入开源视觉语言模型（VLM）。

OpenAI’s o1 model demonstrated the potential of inference-time scaling for enhancing language models’ reasoning abilities. Now, researchers from multiple universities in China have applied this paradigm to open-source vision language models (VLMs) with their new LLaVA-o1 model.

OpenAI 的 o1 模型展示了推理时间缩放在增强语言模型推理能力方面的潜力。现在，来自中国多所大学的研究人员已将这种范式应用到开源视觉语言模型（VLM）中，并推出了新的 LLaVA-o1 模型。

Most early open-source VLMs use a direct prediction approach, generating answers without explicitly reasoning about the prompt and the steps required to solve it. This approach limits their effectiveness on tasks that require logical reasoning. While advanced prompting techniques like chain-of-thought (CoT) prompting can encourage models to generate intermediate reasoning steps and produce some marginal improvements, VLMs are still prone to errors or hallucinations.

大多数早期的开源 VLM 使用直接预测方法，生成答案，而无需明确推理提示和解决问题所需的步骤。这种方法限制了它们在需要逻辑推理的任务上的有效性。虽然思想链 (CoT) 提示等高级提示技术可以鼓励模型生成中间推理步骤并产生一些边际改进，但 VLM 仍然容易出现错误或幻觉。

The researchers observed that a key issue is the lack of a systematic and structured reasoning process in existing VLMs. The models don’t generate reasoning chains and often get stuck in reasoning processes where they don’t know at what stage they are and what specific problem they must solve.

研究人员观察到，一个关键问题是现有 VLM 缺乏系统和结构化的推理过程。这些模型不会生成推理链，并且经常陷入推理过程，不知道自己处于哪个阶段以及必须解决什么具体问题。

“We observe that VLMs often initiate responses without adequately organizing the problem and the available information,” the researchers write. “Moreover, they frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.”

研究人员写道：“我们观察到，VLM 经常在没有充分组织问题和可用信息的情况下发起响应。” “此外，他们经常偏离逻辑推理得出结论，而不是过早地提出结论并随后试图证明其合理性。鉴于语言模型逐个生成响应，一旦引入错误的结论，模型通常会沿着有缺陷的推理路径继续下去。”

Multistage reasoning

多阶段推理

OpenAI o1 uses inference-time scaling to solve the systematic and structured reasoning problem and allows the model to pause and review its results as it gradually solves the problem. While OpenAI has not released much detail about the underlying mechanism of o1, its results show promising directions for improving the reasoning abilities of foundational models.

OpenAI o1 使用推理时间缩放来解决系统性和结构化推理问题，并允许模型在逐步解决问题时暂停并审查其结果。虽然 OpenAI 尚未公布有关 o1 底层机制的更多细节，但其结果为提高基础模型的推理能力指明了有希望的方向。

Inspired by o1, the researchers designed LLaVA-o1 to perform stage-by-stage reasoning. Instead of generating a direct reasoning chain, LLaVA-o1 breaks down the reasoning process into four distinct stages:

受到o1的启发，研究人员设计了LLaVA-o1来执行阶段性推理。 LLaVA-o1 没有生成直接推理链，而是将推理过程分为四个不同的阶段：

Summary: The model first provides a high-level summary of the question, outlining the core problem it needs to address.

摘要：该模型首先提供了问题的高级摘要，概述了它需要解决的核心问题。

Caption: If an image is present, the model describes the relevant parts, focusing on elements related to the question.

说明：如果存在图像，模型会描述相关部分，重点关注与问题相关的元素。

Reasoning: Building on the summary, the model performs structured, logical reasoning to derive a preliminary answer.

推理：模型以摘要为基础，执行结构化、逻辑推理，得出初步答案。

Conclusion: Finally, the model presents a concise summary of the answer based on the preceding reasoning.

结论：最后，模型根据前面的推理给出了答案的简洁总结。

Only the conclusion stage is visible to the user; the other three stages represent the model’s internal reasoning process, similar to the hidden reasoning trace of o1. This structured approach allows LLaVA-o1 to manage its reasoning process independently, leading to improved performance on complex tasks.

只有结论阶段对用户可见；其他三个阶段代表模型的内部推理过程，类似于o1的隐藏推理痕迹。这种结构化方法允许 LLaVA-o1 独立管理其推理过程，从而提高复杂任务的性能。

“This structured approach enables the model to independently manage its reasoning process, improving its adaptability and performance on complex reasoning tasks,” the researchers write.

研究人员写道：“这种结构化方法使模型能够独立管理其推理过程，提高其在复杂推理任务上的适应性和性能。”

LLaVA-o1 also introduces a novel inference-time scaling technique called “stage-level beam search.” Stage-level beam search generates multiple candidate outputs at each reasoning stage. It then selects the best candidate at each stage to continue the generation process. This is in contrast to the classic best-of-N approach, in which the model is prompted to generate multiple complete responses before selecting one.

LLaVA-o1 还引入了一种新颖的推理时间缩放技术，称为“阶段级波束搜索”。阶段级波束搜索在每个推理阶段生成多个候选输出。然后，它在每个阶段选择最佳候选者以继续生成过程。这与经典的 N 最佳方法形成鲜明对比，在该方法中，系统会提示模型在选择一个之前生成多个完整响应。

“Notably, it is the structured output design of LLaVA-o1 that makes this approach feasible, enabling efficient and accurate verification at each stage,” the researchers write. “This validates the effectiveness of structured output in improving inference time scaling.”

研究人员写道：“值得注意的是，LLaVA-o1 的结构化输出设计使这种方法变得可行，能够在每个阶段进行高效、准确的验证。” “这验证了结构化输出在改善推理时间扩展方面的有效性。”

Training LLaVA-o1

训练 LLaVA-o1

To train LLaVA-o1, the researchers compiled a new dataset of around 100,000 image-question-answer pairs obtained from several widely used VQA datasets. The dataset covers a variety of tasks, from multi-turn question answering to chart interpretation and geometric reasoning.

为了训练 LLaVA-o1，研究人员编译了一个新数据集，其中包含从几个广泛使用的 VQA 数据集获得的约 100,000 个图像-问题-答案对。该数据集涵盖了各种任务，从多轮问答到图表解释和几何推理。

The researchers used GPT-4o to generate the detailed four-stage reasoning processes for each example, including the summary, caption, reasoning and conclusion stages.

研究人员使用 GPT-4o 为每个示例生成详细的四阶段推理过程，包括摘要、标题、推理和结论阶段。

The researchers then fine-tuned Llama-3.2-11B-Vision-Instruct on this dataset to obtain the final LLaVA-o1 model. The researchers have not released the model but plan to release the dataset, called the LLaVA-o1-100k.

研究人员随后在此数据集上微调 Llama-3.2-11B-Vision-Instruct，以获得最终的 LLaVA-o1 模型。研究人员尚未发布该模型，但计划发布名为 LLaVA-o1-100k 的数据集。

LLaVA-o1 in action

LLaVA-o1 投入使用

The researchers evaluated LLaVA-o1 on several multimodal reasoning benchmarks. Despite being trained on only 100,000 examples, LLaVA-o1 showed significant performance improvements over the base Llama model, with an average benchmark score increase of 6.9%.

研究人员在多个多模态推理基准上评估了 LLaVA-o1。尽管仅接受了 100,000 个示例的训练，LLaVA-o1 与基础 Llama 模型相比表现出了显着的性能改进，平均基准分数提高了 6.9%。

Furthermore, stage-level beam search led to additional performance gains, demonstrating the effectiveness of inference-time scaling. Due to computational resource constraints, the researchers were only able to test the technique with a beam size of 2. They expect even greater improvements with larger beam sizes.

此外，阶段级波束搜索带来了额外的性能提升，证明了推理时间缩放的有效性。由于计算资源的限制，研究人员只能使用 2 的光束尺寸来测试该技术。他们期望光束尺寸更大时会有更大的改进。

Impressively, LLaVA-o1 outperformed not only other open-source models of the same size or larger but also some closed-

令人印象深刻的是，LLaVA-o1 不仅优于相同尺寸或更大的其他开源模型，而且还优于一些封闭模型。

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年04月03日发表的其他文章