市值: $2.6498T 1.440%
體積(24小時): $117.5875B -25.350%
  • 市值: $2.6498T 1.440%
  • 體積(24小時): $117.5875B -25.350%
  • 恐懼與貪婪指數:
  • 市值: $2.6498T 1.440%
Cryptos
主題
Cryptospedia
資訊
CryptosTopics
影片
Top News
Cryptos
主題
Cryptospedia
資訊
CryptosTopics
影片
bitcoin
bitcoin

$81480.024097 USD

-1.47%

ethereum
ethereum

$1917.647074 USD

-9.08%

tether
tether

$0.999934 USD

-0.01%

xrp
xrp

$2.135157 USD

-2.74%

bnb
bnb

$560.495410 USD

-0.86%

solana
solana

$123.934739 USD

-3.77%

usd-coin
usd-coin

$0.999920 USD

-0.02%

cardano
cardano

$0.732452 USD

-2.80%

dogecoin
dogecoin

$0.160484 USD

-8.70%

tron
tron

$0.230256 USD

-2.00%

pi
pi

$1.369992 USD

-3.68%

unus-sed-leo
unus-sed-leo

$9.742460 USD

0.04%

hedera
hedera

$0.200285 USD

-5.91%

chainlink
chainlink

$12.987043 USD

-8.68%

stellar
stellar

$0.253812 USD

-5.21%

加密貨幣新聞文章

LLaVA-o1:一種新的開源視覺語言模型,為多模態推理帶來推理時間縮放

2024/11/23 07:26

LLaVA-o1 是由中國多所大學的研究人員開發的新模型,將這種範式引入開源視覺語言模型(VLM)。

LLaVA-o1:一種新的開源視覺語言模型,為多模態推理帶來推理時間縮放

OpenAI’s o1 model demonstrated the potential of inference-time scaling for enhancing language models’ reasoning abilities. Now, researchers from multiple universities in China have applied this paradigm to open-source vision language models (VLMs) with their new LLaVA-o1 model.

OpenAI 的 o1 模型展示了推理時間縮放在增強語言模型推理能力的潛力。現在,來自中國多所大學的研究人員已將這種範式應用到開源視覺語言模型(VLM)中,並推出了新的 LLaVA-o1 模型。

Most early open-source VLMs use a direct prediction approach, generating answers without explicitly reasoning about the prompt and the steps required to solve it. This approach limits their effectiveness on tasks that require logical reasoning. While advanced prompting techniques like chain-of-thought (CoT) prompting can encourage models to generate intermediate reasoning steps and produce some marginal improvements, VLMs are still prone to errors or hallucinations.

大多數早期的開源 VLM 使用直接預測方法,產生答案,而無需明確推理提示和解決問題所需的步驟。這種方法限制了它們在需要邏輯推理的任務上的有效性。雖然思想鏈 (CoT) 提示等高級提示技術可以鼓勵模型產生中間推理步驟並產生一些邊際改進,但 VLM 仍然容易出現錯誤或幻覺。

The researchers observed that a key issue is the lack of a systematic and structured reasoning process in existing VLMs. The models don’t generate reasoning chains and often get stuck in reasoning processes where they don’t know at what stage they are and what specific problem they must solve.

研究人員觀察到,一個關鍵問題是現有 VLM 缺乏系統化和結構化的推理過程。這些模型不會產生推理鏈,而且經常陷入推理過程,不知道自己處於哪個階段以及必須解決什麼具體問題。

“We observe that VLMs often initiate responses without adequately organizing the problem and the available information,” the researchers write. “Moreover, they frequently deviate from a logical reasoning toward conclusions, instead of presenting a conclusion prematurely and subsequently attempting to justify it. Given that language models generate responses token-by-token, once an erroneous conclusion is introduced, the model typically continues along a flawed reasoning path.”

研究人員寫道:“我們觀察到,VLM 經常在沒有充分組織問題和可用資訊的情況下發起回應。” 「此外,他們經常偏離邏輯推理得出結論,而不是過早地提出結論並隨後試圖證明其合理性。鑑於語言模型逐一產生回應,一旦引入錯誤的結論,模型通常會沿著有缺陷的推理路徑繼續下去。

Multistage reasoning

多階段推理

OpenAI o1 uses inference-time scaling to solve the systematic and structured reasoning problem and allows the model to pause and review its results as it gradually solves the problem. While OpenAI has not released much detail about the underlying mechanism of o1, its results show promising directions for improving the reasoning abilities of foundational models.

OpenAI o1 使用推理時間縮放來解決系統性和結構化推理問題,並允許模型在逐步解決問題時暫停並審查其結果。雖然 OpenAI 尚未公佈有關 o1 底層機制的更多細節,但其結果為提高基礎模型的推理能力指明了有希望的方向。

Inspired by o1, the researchers designed LLaVA-o1 to perform stage-by-stage reasoning. Instead of generating a direct reasoning chain, LLaVA-o1 breaks down the reasoning process into four distinct stages:

受到o1的啟發,研究人員設計了LLaVA-o1來執行階段性推理。 LLaVA-o1 並沒有產生直接推理鏈,而是將推理過程分成四個不同的階段:

Summary: The model first provides a high-level summary of the question, outlining the core problem it needs to address.

摘要:該模型首先提供了問題的高級摘要,概述了它需要解決的核心問題。

Caption: If an image is present, the model describes the relevant parts, focusing on elements related to the question.

說明:如果存在圖像,模型會描述相關部分,並專注於與問題相關的元素。

Reasoning: Building on the summary, the model performs structured, logical reasoning to derive a preliminary answer.

推理:模型以摘要為基礎,執行結構化、邏輯推理,得出初步答案。

Conclusion: Finally, the model presents a concise summary of the answer based on the preceding reasoning.

結論:最後,模型根據前面的推理給出了答案的簡潔總結。

Only the conclusion stage is visible to the user; the other three stages represent the model’s internal reasoning process, similar to the hidden reasoning trace of o1. This structured approach allows LLaVA-o1 to manage its reasoning process independently, leading to improved performance on complex tasks.

只有結論階段對使用者可見;其他三個階段代表模型的內部推理過程,類似o1的隱藏推理痕跡。這種結構化方法允許 LLaVA-o1 獨立管理其推理過程,從而提高複雜任務的效能。

“This structured approach enables the model to independently manage its reasoning process, improving its adaptability and performance on complex reasoning tasks,” the researchers write.

研究人員寫道:“這種結構化方法使模型能夠獨立管理其推理過程,提高其在複雜推理任務上的適應性和性能。”

LLaVA-o1 also introduces a novel inference-time scaling technique called “stage-level beam search.” Stage-level beam search generates multiple candidate outputs at each reasoning stage. It then selects the best candidate at each stage to continue the generation process. This is in contrast to the classic best-of-N approach, in which the model is prompted to generate multiple complete responses before selecting one.

LLaVA-o1 還引入了一種新穎的推理時間縮放技術,稱為「階段級波束搜尋」。階段級波束搜尋在每個推理階段產生多個候選輸出。然後,它在每個階段選擇最佳候選者以繼續生成過程。這與經典的 N 最佳方法形成鮮明對比,在該方法中,系統會提示模型在選擇一個之前產生多個完整響應。

“Notably, it is the structured output design of LLaVA-o1 that makes this approach feasible, enabling efficient and accurate verification at each stage,” the researchers write. “This validates the effectiveness of structured output in improving inference time scaling.”

研究人員寫道:“值得注意的是,LLaVA-o1 的結構化輸出設計使這種方法變得可行,能夠在每個階段進行高效、準確的驗證。” “這驗證了結構化輸出在改善推理時間擴展方面的有效性。”

Training LLaVA-o1

訓練 LLaVA-o1

To train LLaVA-o1, the researchers compiled a new dataset of around 100,000 image-question-answer pairs obtained from several widely used VQA datasets. The dataset covers a variety of tasks, from multi-turn question answering to chart interpretation and geometric reasoning.

為了訓練 LLaVA-o1,研究人員編譯了一個新資料集,其中包含從幾個廣泛使用的 VQA 資料集獲得的約 100,000 個圖像-問題-答案對。該資料集涵蓋了各種任務,從多輪問答到圖表解釋和幾何推理。

The researchers used GPT-4o to generate the detailed four-stage reasoning processes for each example, including the summary, caption, reasoning and conclusion stages.

研究人員使用 GPT-4o 為每個範例產生詳細的四階段推理過程,包括摘要、標題、推理和結論階段。

The researchers then fine-tuned Llama-3.2-11B-Vision-Instruct on this dataset to obtain the final LLaVA-o1 model. The researchers have not released the model but plan to release the dataset, called the LLaVA-o1-100k.

研究人員隨後在此資料集上微調 Llama-3.2-11B-Vision-Instruct,以獲得最終的 LLaVA-o1 模型。研究人員尚未發布該模型,但計劃發布名為 LLaVA-o1-100k 的資料集。

LLaVA-o1 in action

LLaVA-o1 投入使用

The researchers evaluated LLaVA-o1 on several multimodal reasoning benchmarks. Despite being trained on only 100,000 examples, LLaVA-o1 showed significant performance improvements over the base Llama model, with an average benchmark score increase of 6.9%.

研究人員在多個多模態推理基準上評估了 LLaVA-o1。儘管僅接受了 100,000 個範例的訓練,LLaVA-o1 與基礎 Llama 模型相比表現出了顯著的效能改進,平均基準分數提高了 6.9%。

Furthermore, stage-level beam search led to additional performance gains, demonstrating the effectiveness of inference-time scaling. Due to computational resource constraints, the researchers were only able to test the technique with a beam size of 2. They expect even greater improvements with larger beam sizes.

此外,階段級波束搜尋帶來了額外的效能提升,證明了推理時間縮放的有效性。由於計算資源的限制,研究人員只能使用 2 的光束尺寸來測試該技術。

Impressively, LLaVA-o1 outperformed not only other open-source models of the same size or larger but also some closed-

令人印象深刻的是,LLaVA-o1 不僅優於相同尺寸或​​更大的其他開源模型,而且優於一些封閉模型。

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!

如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。

2025年03月12日 其他文章發表於