市值: $3.1222T -1.420%
體積(24小時): $128.7185B 1.130%
  • 市值: $3.1222T -1.420%
  • 體積(24小時): $128.7185B 1.130%
  • 恐懼與貪婪指數:
  • 市值: $3.1222T -1.420%
Cryptos
主題
Cryptospedia
資訊
CryptosTopics
影片
Top News
Cryptos
主題
Cryptospedia
資訊
CryptosTopics
影片
bitcoin
bitcoin

$95988.360365 USD

-1.02%

ethereum
ethereum

$2616.372283 USD

-3.20%

tether
tether

$1.000079 USD

-0.02%

xrp
xrp

$2.379544 USD

3.41%

solana
solana

$191.021998 USD

-0.17%

bnb
bnb

$579.394785 USD

0.28%

usd-coin
usd-coin

$0.999980 USD

0.00%

dogecoin
dogecoin

$0.246368 USD

-0.99%

cardano
cardano

$0.694285 USD

-2.52%

tron
tron

$0.232453 USD

1.91%

chainlink
chainlink

$18.089071 USD

-3.16%

stellar
stellar

$0.324940 USD

1.41%

avalanche
avalanche

$24.110410 USD

-2.54%

toncoin
toncoin

$3.700057 USD

-0.98%

unus-sed-leo
unus-sed-leo

$9.767020 USD

0.09%

加密貨幣新聞文章

隱式PRM:一個強化學習框架,消除了對明確逐步註釋的需求

2025/02/08 11:49

一群來自Tsinghua大學的研究人員,上海AI實驗室,伊利諾伊大學Urbana-Champaign大學,北京大學,上海Jiaotong University和Cuhk提出了一個加強學習框架。

隱式PRM:一個強化學習框架,消除了對明確逐步註釋的需求

Reinforcement learning (RL) for large language models (LLMs) has traditionally relied on outcome-based rewards, which provide feedback only on the final output. This sparsity of reward makes it challenging to train models that need multi-step reasoning, like those employed in mathematical problem-solving and programming. Additionally, credit assignment becomes ambiguous, as the model does not get fine-grained feedback for intermediate steps.

傳統上,大型語言模型(LLMS)的加強學習(RL)依賴於基於結果的獎勵,這僅提供最終輸出的反饋。這種獎勵的稀疏性使得訓練需要多步推理的模型,就像數學解決問題和編程中的模型一樣。此外,信用分配變得模棱兩可,因為該模型沒有獲得中級步驟的細粒度反饋。

Process reward models (PRMs) try to address this by offering dense step-wise rewards, but they need costly human-annotated process labels, making them infeasible for large-scale RL. In addition, static reward functions are plagued by overoptimization and reward hacking, where the model takes advantage of the reward system in unforeseen ways, eventually compromising generalization performance. These limitations restrict RL’s efficiency, scalability, and applicability for LLMs, calling for a new solution that effectively combines dense rewards without high computational expense or human annotations.

工藝獎勵模型(PRM)試圖通過提供厚實的逐步獎勵來解決此問題,但它們需要昂貴的人類宣傳標籤,這使得它們對於大規模RL而言是不可行的。此外,靜態獎勵功能受到過度優化和獎勵黑客的困擾,在這種情況下,模型以無法預見的方式利用獎勵系統,最終損害了概括性能。這些限制限制了RL的LLM效率,可伸縮性和適用性,呼籲有效地結合密集獎勵而無需高度計算費用或人為註釋。

Most existing RL methods for LLMs employ outcome reward models (ORMs), which offer scores only for the final output. This results in low sample efficiency as models must generate and test whole sequences before getting feedback. Some methods employ value models that estimate future rewards from past actions to counter this. However, these models have high variance and do not handle reward sparsity properly. PRMs offer more fine-grained feedback but need costly manual annotations for intermediate steps and are prone to reward hacking because of static reward functions. Additionally, most existing methods need an extra training phase for the reward model, adding to the computational expense and making them infeasible for scalable online RL.

LLMS的大多數RL方法都採用結果獎勵模型(ORM),僅為最終輸出提供得分。這會導致樣本效率低,因為模型必須在獲得反饋之前生成和測試整個序列。某些方法採用價值模型來估算過去的行動的未來獎勵來對抗這一點。但是,這些模型具有很大的差異,並且無法正確處理獎勵稀疏性。 PRM提供了更多細粒度的反饋,但需要代價高昂的手動註釋中間步驟,並且由於靜態獎勵功能而容易獎勵黑客。此外,大多數現有方法需要一個額外的培訓階段來獎勵模型,從而增加了計算費用,並且使其無法進行可擴展的在線RL。

A group of researchers from Tsinghua University, Shanghai AI Lab, University of Illinois Urbana-Champaign, Peking University, Shanghai Jiaotong University, and CUHK has proposed a reinforcement learning framework that eliminates the need for explicit step-wise annotations using efficient utilization of dense feedback. The main contribution proposed is the introduction of an Implicit Process Reward Model (Implicit PRM), which produces token-level rewards independently of outcome labels, thus eliminating the need for human-annotated step-level guidance. The approach allows for continuous online improvement of the reward model, eliminating the problem of overoptimization without allowing dynamic policy rollout adjustments. The framework can successfully integrate implicit process rewards with outcome rewards during advantage estimation, offering computational efficiency and eliminating reward hacking. Unlike previous methods, which require a separate training phase for process rewards, the new approach initializes the PRM directly from the policy model itself, thus greatly eliminating developmental overhead. It is also made compatible with a range of RL algorithms, including REINFORCE, PPO, and GRPO, thus making it generalizable and scalable for training large language models (LLMs).

一群來自Tsinghua大學的研究人員,上海AI實驗室,伊利諾伊大學Urbana-Champaign大學,北京大學,上海Jiaotong University和Cuhk提出了一個加強學習框架。提出的主要貢獻是引入隱式過程獎勵模型(隱式PRM),該模型獨立於結果標籤產生令牌級別的獎勵,從而消除了對人類宣傳的階梯級指導的需求。該方法允許持續在線改進獎勵模型,從而消除了過度分配的問題,而不允許動態策略推出調整。該框架可以在優勢估計期間成功整合隱式過程獎勵與結果獎勵,從而提供計算效率並消除獎勵黑客攻擊。與以前需要單獨的過程獎勵培訓階段的方法不同,新方法直接從策略模型本身初始化了PRM,從而大大消除了開發開銷。它也與一系列RL算法兼容,包括增強,PPO和GRPO,從而使其可概括且可擴展用於培訓大語言模型(LLMS)。

This reinforcement learning system provides token-level implicit process rewards, calculated through a log-ratio formulation between a learned reward model and a reference model. Rather than manual annotation, the reward function is learned from raw outcome labels, which are already obtained for policy training. The system also includes online learning of the reward function to avoid overoptimization and reward hacking. It uses a hybrid advantage estimation approach that combines implicit process and outcome rewards through a leave-one-out Monte Carlo estimator. Policy optimization is achieved through Proximal Policy Optimisation (PPO) using a clipped surrogate loss function for stability. The model was trained using Qwen2.5-Math-7B-Base, an optimized model for mathematical reasoning. The system is based on 150K queries with four samples per query, compared to Qwen2.5-Math-7B-Instruct using 618K in-house annotations, which demonstrates the effectiveness of the training process.

這種強化學習系統提供了令牌級的隱式過程獎勵,該獎勵是通過學習獎勵模型和參考模型之間的對數比率公式計算得出的。從原始結果標籤中學到了獎勵功能,而不是手動註釋,這已經是用於政策培訓的。該系統還包括在線學習獎勵功能,以避免過度優化和獎勵黑客入侵。它使用混合優勢估計方法,該方法結合了隱式過程和結果獎勵,通過一個遺留的蒙特卡洛估計器。政策優化是通過近端策略優化(PPO)使用固定的替代損失功能來實現的。使用QWEN2.5-MATH-7B基礎訓練該模型,這是一種用於數學推理的優化模型。該系統基於使用618K內部註釋的QWEN2.5-7B-INSTRUCTION,每個查詢的150K查詢,每個查詢四個樣本,這證明了訓練過程的有效性。

The reinforcement learning system demonstrates significant gains in sample efficiency and reasoning performance across several benchmarks. It provides a 2.5× gain in sample efficiency and a 6.9% gain in mathematical problem-solving compared to standard outcome-based RL. The model outperforms Qwen2.5-Math-7B-Instruct on benchmarking mathematical benchmarks, with better accuracy on competition-level tasks like AIME and AMC. Models trained from this process outperform larger models, including GPT-4o, by pass@1 accuracy for challenging reasoning tasks, even when using only 10% of the training data used by Qwen2.5-Math-7B-Instruct. The results affirm that online updates to the reward model avoid over-optimization, enhance training stability, and enhance credit assignment, making it an extremely powerful method for reinforcement learning in LLMs.

增強學習系統表明,在幾個基準的樣本效率和推理性能方面取得了顯著提高。與基於標準結果的RL相比,它提供了2.5倍的樣品效率增長和數學問題的6.9%。該模型在基準數學基準測試基准上優於QWEN2.5-MATH-7B教學法,在AIME和AMC等競爭級別的任務上的準確性更好。從該過程訓練的模型以較大的模型(包括GPT-4O)的較大模型,即使僅使用QWEN2.5-MATH-7B-7B-INSTRUCTY使用的10%的培訓數據,即使僅使用10%的培訓數據,也可以通過PASS@1精度進行挑戰。結果確認,獎勵模型的在線更新避免過度優化,增強培訓穩定並增強信用分配,從而使其成為強大的LLM中強大方法。

This reinforcement learning approach provides an efficient and scalable LLM training process with dense implicit process rewards. This eliminates step-level explicit annotations and minimizes training costs while enhancing sample efficiency, stability, and performance. The process combines online reward modeling and token-level feedback harmoniously, solving long-standing problems of reward sparsity and credit assignment in RL for LLMs. These improvements optimize reasoning capability in AI models and make them suitable for problem-solving applications in mathematics and programming. This research is a substantial contribution to RL-based L

這種強化學習方法提供了一個有效且可擴展的LLM培訓過程,並具有密集的隱式過程獎勵。這消除了階梯級顯式註釋,並最大程度地減少培訓成本,同時提高樣本效率,穩定性和性能。該過程將在線獎勵建模和代幣級別的反饋和諧地結合在一起,解決了LLM中RL中獎勵稀疏性和信用分配的長期問題。這些改進優化了AI模型中的推理能力,並使其適用於數學和編程中的問題解決應用程序。這項研究是對基於RL的L的實質性貢獻

免責聲明:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

2025年02月08日 其他文章發表於