$108530.002893 USD

1.12%

ethereum

$2501.495543 USD

2.83%

tether

$1.000245 USD

-0.01%

xrp

$2.198131 USD

0.43%

bnb

$654.360076 USD

0.87%

solana

$152.192030 USD

1.55%

usd-coin

$0.999839 USD

0.00%

tron

$0.276594 USD

0.49%

dogecoin

$0.167580 USD

2.68%

cardano

$0.568515 USD

0.60%

hyperliquid

$40.700758 USD

7.87%

bitcoin-cash

$500.972465 USD

1.64%

sui

$2.847545 USD

2.13%

chainlink

$13.518965 USD

1.41%

unus-sed-leo

$9.163651 USD

0.47%

加密货币新闻

隐式PRM：一个强化学习框架，消除了对明确逐步注释的需求

2025/02/08 11:49

一群来自Tsinghua大学的研究人员，上海AI实验室，伊利诺伊大学Urbana-Champaign大学，北京大学，上海Jiaotong University和Cuhk提出了一个加强学习框架。

Reinforcement learning (RL) for large language models (LLMs) has traditionally relied on outcome-based rewards, which provide feedback only on the final output. This sparsity of reward makes it challenging to train models that need multi-step reasoning, like those employed in mathematical problem-solving and programming. Additionally, credit assignment becomes ambiguous, as the model does not get fine-grained feedback for intermediate steps.

传统上，大型语言模型（LLMS）的加强学习（RL）依赖于基于结果的奖励，这仅提供最终输出的反馈。这种奖励的稀疏性使得训练需要多步推理的模型，就像数学解决问题和编程中的模型一样。此外，信用分配变得模棱两可，因为该模型没有获得中级步骤的细粒度反馈。

Process reward models (PRMs) try to address this by offering dense step-wise rewards, but they need costly human-annotated process labels, making them infeasible for large-scale RL. In addition, static reward functions are plagued by overoptimization and reward hacking, where the model takes advantage of the reward system in unforeseen ways, eventually compromising generalization performance. These limitations restrict RL’s efficiency, scalability, and applicability for LLMs, calling for a new solution that effectively combines dense rewards without high computational expense or human annotations.

工艺奖励模型（PRM）试图通过提供厚实的逐步奖励来解决此问题，但它们需要昂贵的人类宣传标签，这使得它们对于大规模RL而言是不可行的。此外，静态奖励功能受到过度优化和奖励黑客的困扰，在这种情况下，模型以无法预见的方式利用奖励系统，最终损害了概括性能。这些限制限制了RL的LLM效率，可伸缩性和适用性，呼吁有效地结合密集奖励而无需高度计算费用或人为注释。

Most existing RL methods for LLMs employ outcome reward models (ORMs), which offer scores only for the final output. This results in low sample efficiency as models must generate and test whole sequences before getting feedback. Some methods employ value models that estimate future rewards from past actions to counter this. However, these models have high variance and do not handle reward sparsity properly. PRMs offer more fine-grained feedback but need costly manual annotations for intermediate steps and are prone to reward hacking because of static reward functions. Additionally, most existing methods need an extra training phase for the reward model, adding to the computational expense and making them infeasible for scalable online RL.

LLMS的大多数RL方法都采用结果奖励模型（ORM），仅为最终输出提供得分。这会导致样本效率低，因为模型必须在获得反馈之前生成和测试整个序列。某些方法采用价值模型来估算过去的行动的未来奖励来对抗这一点。但是，这些模型具有很大的差异，并且无法正确处理奖励稀疏性。 PRM提供了更多细粒度的反馈，但需要代价高昂的手动注释中间步骤，并且由于静态奖励功能而容易奖励黑客。此外，大多数现有方法需要一个额外的培训阶段来奖励模型，从而增加了计算费用，并且使其无法进行可扩展的在线RL。

A group of researchers from Tsinghua University, Shanghai AI Lab, University of Illinois Urbana-Champaign, Peking University, Shanghai Jiaotong University, and CUHK has proposed a reinforcement learning framework that eliminates the need for explicit step-wise annotations using efficient utilization of dense feedback. The main contribution proposed is the introduction of an Implicit Process Reward Model (Implicit PRM), which produces token-level rewards independently of outcome labels, thus eliminating the need for human-annotated step-level guidance. The approach allows for continuous online improvement of the reward model, eliminating the problem of overoptimization without allowing dynamic policy rollout adjustments. The framework can successfully integrate implicit process rewards with outcome rewards during advantage estimation, offering computational efficiency and eliminating reward hacking. Unlike previous methods, which require a separate training phase for process rewards, the new approach initializes the PRM directly from the policy model itself, thus greatly eliminating developmental overhead. It is also made compatible with a range of RL algorithms, including REINFORCE, PPO, and GRPO, thus making it generalizable and scalable for training large language models (LLMs).

一群来自Tsinghua大学的研究人员，上海AI实验室，伊利诺伊大学Urbana-Champaign大学，北京大学，上海Jiaotong University和Cuhk提出了一个加强学习框架。提出的主要贡献是引入隐式过程奖励模型（隐式PRM），该模型独立于结果标签产生令牌级别的奖励，从而消除了对人类宣传的阶梯级指导的需求。该方法允许持续在线改进奖励模型，从而消除了过度分配的问题，而不允许动态策略推出调整。该框架可以在优势估计期间成功整合隐式过程奖励与结果奖励，从而提供计算效率并消除奖励黑客攻击。与以前需要单独的过程奖励培训阶段的方法不同，新方法直接从策略模型本身初始化了PRM，从而大大消除了开发开销。它也与一系列RL算法兼容，包括增强，PPO和GRPO，从而使其可概括且可扩展用于培训大语言模型（LLMS）。

This reinforcement learning system provides token-level implicit process rewards, calculated through a log-ratio formulation between a learned reward model and a reference model. Rather than manual annotation, the reward function is learned from raw outcome labels, which are already obtained for policy training. The system also includes online learning of the reward function to avoid overoptimization and reward hacking. It uses a hybrid advantage estimation approach that combines implicit process and outcome rewards through a leave-one-out Monte Carlo estimator. Policy optimization is achieved through Proximal Policy Optimisation (PPO) using a clipped surrogate loss function for stability. The model was trained using Qwen2.5-Math-7B-Base, an optimized model for mathematical reasoning. The system is based on 150K queries with four samples per query, compared to Qwen2.5-Math-7B-Instruct using 618K in-house annotations, which demonstrates the effectiveness of the training process.

这种强化学习系统提供了令牌级的隐式过程奖励，该奖励是通过学习奖励模型和参考模型之间的对数比率公式计算得出的。从原始结果标签中学到了奖励功能，而不是手动注释，这已经是用于政策培训的。该系统还包括在线学习奖励功能，以避免过度优化和奖励黑客入侵。它使用混合优势估计方法，该方法结合了隐式过程和结果奖励，通过一个遗留的蒙特卡洛估计器。政策优化是通过近端策略优化（PPO）使用固定的替代损失功能来实现的。使用QWEN2.5-MATH-7B基础训练该模型，这是一种用于数学推理的优化模型。该系统基于使用618K内部注释的QWEN2.5-7B-INSTRUCTION，每个查询的150K查询，每个查询四个样本，这证明了训练过程的有效性。

The reinforcement learning system demonstrates significant gains in sample efficiency and reasoning performance across several benchmarks. It provides a 2.5× gain in sample efficiency and a 6.9% gain in mathematical problem-solving compared to standard outcome-based RL. The model outperforms Qwen2.5-Math-7B-Instruct on benchmarking mathematical benchmarks, with better accuracy on competition-level tasks like AIME and AMC. Models trained from this process outperform larger models, including GPT-4o, by pass@1 accuracy for challenging reasoning tasks, even when using only 10% of the training data used by Qwen2.5-Math-7B-Instruct. The results affirm that online updates to the reward model avoid over-optimization, enhance training stability, and enhance credit assignment, making it an extremely powerful method for reinforcement learning in LLMs.

增强学习系统表明，在几个基准的样本效率和推理性能方面取得了显着提高。与基于标准结果的RL相比，它提供了2.5倍的样品效率增长和数学问题的6.9％。该模型在基准数学基准测试基准上优于QWEN2.5-MATH-7B教学法，在AIME和AMC等竞争级别的任务上的准确性更好。从该过程训练的模型以较大的模型（包括GPT-4O）的较大模型，即使仅使用QWEN2.5-MATH-7B-7B-INSTRUCTY使用的10％的培训数据，即使仅使用10％的培训数据，也可以通过PASS@1精度进行挑战。结果确认，奖励模型的在线更新避免过度优化，增强培训稳定并增强信用分配，从而使其成为强大的LLM中强大方法。

This reinforcement learning approach provides an efficient and scalable LLM training process with dense implicit process rewards. This eliminates step-level explicit annotations and minimizes training costs while enhancing sample efficiency, stability, and performance. The process combines online reward modeling and token-level feedback harmoniously, solving long-standing problems of reward sparsity and credit assignment in RL for LLMs. These improvements optimize reasoning capability in AI models and make them suitable for problem-solving applications in mathematics and programming. This research is a substantial contribution to RL-based L

这种强化学习方法提供了一个有效且可扩展的LLM培训过程，并具有密集的隐式过程奖励。这消除了阶梯级显式注释，并最大程度地减少培训成本，同时提高样本效率，稳定性和性能。该过程将在线奖励建模和代币级别的反馈和谐地结合在一起，解决了LLM中RL中奖励稀疏性和信用分配的长期问题。这些改进优化了AI模型中的推理能力，并使其适用于数学和编程中的问题解决应用程序。这项研究是对基于RL的L的实质性贡献

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年07月01日发表的其他文章