$81480.024097 USD

-1.47%

ethereum

$1917.647074 USD

-9.08%

tether

$0.999934 USD

-0.01%

xrp

$2.135157 USD

-2.74%

bnb

$560.495410 USD

-0.86%

solana

$123.934739 USD

-3.77%

usd-coin

$0.999920 USD

-0.02%

cardano

$0.732452 USD

-2.80%

dogecoin

$0.160484 USD

-8.70%

tron

$0.230256 USD

-2.00%

pi

$1.369992 USD

-3.68%

unus-sed-leo

$9.742460 USD

0.04%

hedera

$0.200285 USD

-5.91%

chainlink

$12.987043 USD

-8.68%

stellar

$0.253812 USD

-5.21%

Cryptocurrency News Articles

LLaVA-o1: A New Open-Source Vision Language Model That Brings Inference-Time Scaling to Multimodal Reasoning

Nov 23, 2024 at 07:26 am

LLaVA-o1, a new model developed by researchers from multiple universities in China, brings this paradigm to open-source vision language models (VLMs).

OpenAI’s o1 model demonstrated the potential of inference-time scaling for enhancing language models’ reasoning abilities. Now, researchers from multiple universities in China have applied this paradigm to open-source vision language models (VLMs) with their new LLaVA-o1 model.

Most early open-source VLMs use a direct prediction approach, generating answers without explicitly reasoning about the prompt and the steps required to solve it. This approach limits their effectiveness on tasks that require logical reasoning. While advanced prompting techniques like chain-of-thought (CoT) prompting can encourage models to generate intermediate reasoning steps and produce some marginal improvements, VLMs are still prone to errors or hallucinations.

The researchers observed that a key issue is the lack of a systematic and structured reasoning process in existing VLMs. The models don’t generate reasoning chains and often get stuck in reasoning processes where they don’t know at what stage they are and what specific problem they must solve.

“We observe that VLMs often initiate responses without adequately organizing the problem and the available information,” the researchers write. “Moreover, they frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.”

Multistage reasoning

OpenAI o1 uses inference-time scaling to solve the systematic and structured reasoning problem and allows the model to pause and review its results as it gradually solves the problem. While OpenAI has not released much detail about the underlying mechanism of o1, its results show promising directions for improving the reasoning abilities of foundational models.

Inspired by o1, the researchers designed LLaVA-o1 to perform stage-by-stage reasoning. Instead of generating a direct reasoning chain, LLaVA-o1 breaks down the reasoning process into four distinct stages:

Summary: The model first provides a high-level summary of the question, outlining the core problem it needs to address.

Caption: If an image is present, the model describes the relevant parts, focusing on elements related to the question.

Reasoning: Building on the summary, the model performs structured, logical reasoning to derive a preliminary answer.

Conclusion: Finally, the model presents a concise summary of the answer based on the preceding reasoning.

Only the conclusion stage is visible to the user; the other three stages represent the model’s internal reasoning process, similar to the hidden reasoning trace of o1. This structured approach allows LLaVA-o1 to manage its reasoning process independently, leading to improved performance on complex tasks.

“This structured approach enables the model to independently manage its reasoning process, improving its adaptability and performance on complex reasoning tasks,” the researchers write.

LLaVA-o1 also introduces a novel inference-time scaling technique called “stage-level beam search.” Stage-level beam search generates multiple candidate outputs at each reasoning stage. It then selects the best candidate at each stage to continue the generation process. This is in contrast to the classic best-of-N approach, in which the model is prompted to generate multiple complete responses before selecting one.

“Notably, it is the structured output design of LLaVA-o1 that makes this approach feasible, enabling efficient and accurate verification at each stage,” the researchers write. “This validates the effectiveness of structured output in improving inference time scaling.”

Training LLaVA-o1

To train LLaVA-o1, the researchers compiled a new dataset of around 100,000 image-question-answer pairs obtained from several widely used VQA datasets. The dataset covers a variety of tasks, from multi-turn question answering to chart interpretation and geometric reasoning.

The researchers used GPT-4o to generate the detailed four-stage reasoning processes for each example, including the summary, caption, reasoning and conclusion stages.

The researchers then fine-tuned Llama-3.2-11B-Vision-Instruct on this dataset to obtain the final LLaVA-o1 model. The researchers have not released the model but plan to release the dataset, called the LLaVA-o1-100k.

LLaVA-o1 in action

The researchers evaluated LLaVA-o1 on several multimodal reasoning benchmarks. Despite being trained on only 100,000 examples, LLaVA-o1 showed significant performance improvements over the base Llama model, with an average benchmark score increase of 6.9%.

Furthermore, stage-level beam search led to additional performance gains, demonstrating the effectiveness of inference-time scaling. Due to computational resource constraints, the researchers were only able to test the technique with a beam size of 2. They expect even greater improvements with larger beam sizes.

Impressively, LLaVA-o1 outperformed not only other open-source models of the same size or larger but also some closed-

Disclaimer:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research！

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

Other articles published on Mar 12, 2025