$95988.360365 USD

-1.02%

ethereum

$2616.372283 USD

-3.20%

tether

$1.000079 USD

-0.02%

xrp

$2.379544 USD

3.41%

solana

$191.021998 USD

-0.17%

bnb

$579.394785 USD

0.28%

usd-coin

$0.999980 USD

0.00%

dogecoin

$0.246368 USD

-0.99%

cardano

$0.694285 USD

-2.52%

tron

$0.232453 USD

1.91%

chainlink

$18.089071 USD

-3.16%

stellar

$0.324940 USD

1.41%

avalanche

$24.110410 USD

-2.54%

toncoin

$3.700057 USD

-0.98%

unus-sed-leo

$9.767020 USD

0.09%

暗号通貨のニュース記事

暗黙のPRM：明示的な段階的な注釈の必要性を排除する強化学習フレームワーク

2025/02/08 11:49

Tsinghua University、Shanghai AI Lab、イリノイ大学アーバナシャンペーン大学、北京大学、上海Jiaotong大学、およびCUHKの研究者グループは、密集したフィードバックの効率的な利用を使用した明示的な段階的な注釈の必要性を排除する強化学習フレームワークを提案しました。。

Reinforcement learning (RL) for large language models (LLMs) has traditionally relied on outcome-based rewards, which provide feedback only on the final output. This sparsity of reward makes it challenging to train models that need multi-step reasoning, like those employed in mathematical problem-solving and programming. Additionally, credit assignment becomes ambiguous, as the model does not get fine-grained feedback for intermediate steps.

大規模な言語モデル（LLMS）の強化学習（RL）は、伝統的に結果ベースの報酬に依存しており、最終出力でのみフィードバックを提供していました。この報酬のスパースは、数学的な問題解決やプログラミングで採用されているものと同様に、マルチステップの推論を必要とするモデルを訓練するのが難しくなります。さらに、モデルが中間ステップに対してきめ細かいフィードバックを取得しないため、クレジットの割り当ては曖昧になります。

Process reward models (PRMs) try to address this by offering dense step-wise rewards, but they need costly human-annotated process labels, making them infeasible for large-scale RL. In addition, static reward functions are plagued by overoptimization and reward hacking, where the model takes advantage of the reward system in unforeseen ways, eventually compromising generalization performance. These limitations restrict RL’s efficiency, scalability, and applicability for LLMs, calling for a new solution that effectively combines dense rewards without high computational expense or human annotations.

プロセス報酬モデル（PRMS）は、段階的な報酬を提供することでこれに対処しようとしますが、コストのかかる人間が注目したプロセスラベルが必要であり、大規模なRLでは実行不可能です。さらに、静的報酬関数は、過剰な最適化と報酬ハッキングに悩まされています。このハッキングは、モデルが予期せぬ方法で報酬システムを利用し、最終的に一般化パフォーマンスを損なう。これらの制限は、RLの効率、スケーラビリティ、およびLLMSの適用性を制限し、高い計算費用や人間の注釈なしで密な報酬を効果的に組み合わせた新しいソリューションを要求します。

Most existing RL methods for LLMs employ outcome reward models (ORMs), which offer scores only for the final output. This results in low sample efficiency as models must generate and test whole sequences before getting feedback. Some methods employ value models that estimate future rewards from past actions to counter this. However, these models have high variance and do not handle reward sparsity properly. PRMs offer more fine-grained feedback but need costly manual annotations for intermediate steps and are prone to reward hacking because of static reward functions. Additionally, most existing methods need an extra training phase for the reward model, adding to the computational expense and making them infeasible for scalable online RL.

LLMの既存のRLメソッドのほとんどは、最終出力に対してのみスコアを提供するアウトカム報酬モデル（ORM）を採用しています。これにより、フィードバックを得る前にモデルがシーケンス全体を生成してテストする必要があるため、サンプル効率が低くなります。一部の方法では、これに対抗するために過去のアクションから将来の報酬を推定する価値モデルを採用しています。ただし、これらのモデルは高い分散を持ち、報酬のスパースを適切に処理しません。 PRMSはより微調整されたフィードバックを提供しますが、中間ステップには費用のかかる手動注釈が必要であり、静的な報酬機能のためにハッキングに報いる傾向があります。さらに、ほとんどの既存の方法では、報酬モデルに追加のトレーニングフェーズが必要であり、計算費用に加えて、スケーラブルなオンラインRLで実行不可能にします。

A group of researchers from Tsinghua University, Shanghai AI Lab, University of Illinois Urbana-Champaign, Peking University, Shanghai Jiaotong University, and CUHK has proposed a reinforcement learning framework that eliminates the need for explicit step-wise annotations using efficient utilization of dense feedback. The main contribution proposed is the introduction of an Implicit Process Reward Model (Implicit PRM), which produces token-level rewards independently of outcome labels, thus eliminating the need for human-annotated step-level guidance. The approach allows for continuous online improvement of the reward model, eliminating the problem of overoptimization without allowing dynamic policy rollout adjustments. The framework can successfully integrate implicit process rewards with outcome rewards during advantage estimation, offering computational efficiency and eliminating reward hacking. Unlike previous methods, which require a separate training phase for process rewards, the new approach initializes the PRM directly from the policy model itself, thus greatly eliminating developmental overhead. It is also made compatible with a range of RL algorithms, including REINFORCE, PPO, and GRPO, thus making it generalizable and scalable for training large language models (LLMs).

Tsinghua University、Shanghai AI Lab、イリノイ大学アーバナシャンペーン大学、北京大学、上海Jiaotong大学、およびCUHKの研究者グループは、密集したフィードバックの効率的な利用を使用した明示的な段階的な注釈の必要性を排除する強化学習フレームワークを提案しました。。提案されている主な貢献は、結果ラベルとは無関係にトークンレベルの報酬を生成する暗黙のプロセス報酬モデル（暗黙的PRM）の導入であり、したがって、人間が発表したステップレベルのガイダンスの必要性を排除します。このアプローチにより、報酬モデルの継続的なオンライン改善が可能になり、動的なポリシーロールアウト調整を許可することなく、過剰最適化の問題を排除できます。このフレームワークは、Advantationの推定中に暗黙的なプロセス報酬を結果の報酬と統合し、計算効率を提供し、報酬のハッキングを排除することができます。プロセス報酬のために別のトレーニングフェーズを必要とする以前の方法とは異なり、新しいアプローチはPRMをポリシーモデル自体から直接初期化するため、発達のオーバーヘッドを大幅に排除します。また、補強、PPO、GRPOなどのさまざまなRLアルゴリズムと互換性があるため、大規模な言語モデル（LLM）をトレーニングするために一般化およびスケーラブルになります。

This reinforcement learning system provides token-level implicit process rewards, calculated through a log-ratio formulation between a learned reward model and a reference model. Rather than manual annotation, the reward function is learned from raw outcome labels, which are already obtained for policy training. The system also includes online learning of the reward function to avoid overoptimization and reward hacking. It uses a hybrid advantage estimation approach that combines implicit process and outcome rewards through a leave-one-out Monte Carlo estimator. Policy optimization is achieved through Proximal Policy Optimisation (PPO) using a clipped surrogate loss function for stability. The model was trained using Qwen2.5-Math-7B-Base, an optimized model for mathematical reasoning. The system is based on 150K queries with four samples per query, compared to Qwen2.5-Math-7B-Instruct using 618K in-house annotations, which demonstrates the effectiveness of the training process.

この強化学習システムは、学習された報酬モデルと参照モデルの間のログ比式定式化を通じて計算されたトークンレベルの暗黙的なプロセス報酬を提供します。手動注釈ではなく、報酬機能は、ポリシートレーニングのためにすでに取得されている生の結果ラベルから学習されます。このシステムには、過剰な最適化や報酬のハッキングを避けるための報酬機能のオンライン学習も含まれています。これは、休暇1-アウトモンテカルロ推定器を介して暗黙のプロセスと結果の報酬を組み合わせたハイブリッドアドバンテージ推定アプローチを使用します。政策最適化は、安定性のためにクリップされた代理損失関数を使用して、近位政策最適化（PPO）を通じて達成されます。このモデルは、数学的推論のための最適化されたモデルであるQWEN2.5-MATH-7B-Baseを使用してトレーニングされました。このシステムは、618kの社内注釈を使用したQWEN2.5-MATH-7B-Instructと比較して、クエリごとに4つのサンプルを持つ150Kクエリに基づいており、トレーニングプロセスの有効性を実証しています。

The reinforcement learning system demonstrates significant gains in sample efficiency and reasoning performance across several benchmarks. It provides a 2.5× gain in sample efficiency and a 6.9% gain in mathematical problem-solving compared to standard outcome-based RL. The model outperforms Qwen2.5-Math-7B-Instruct on benchmarking mathematical benchmarks, with better accuracy on competition-level tasks like AIME and AMC. Models trained from this process outperform larger models, including GPT-4o, by pass@1 accuracy for challenging reasoning tasks, even when using only 10% of the training data used by Qwen2.5-Math-7B-Instruct. The results affirm that online updates to the reward model avoid over-optimization, enhance training stability, and enhance credit assignment, making it an extremely powerful method for reinforcement learning in LLMs.

強化学習システムは、いくつかのベンチマークにわたってサンプル効率と推論パフォーマンスの大幅な利益を示しています。サンプル効率が2.5×増加し、標準的な結果ベースのRLと比較して数学的な問題解決が6.9％増加します。このモデルは、AIMEやAMCなどの競合レベルのタスクの精度を高め、数学ベンチマークのベンチマークでQWEN2.5-MATH-7B-Instructを上回ります。このプロセスからトレーニングされたモデルは、QWEN2.5-MATH-7B-Instructで使用されているトレーニングデータの10％しか使用していない場合でも、挑戦的な推論タスクのためにPass@1の精度でGPT-4oを含むより大きなモデルを上回ります。この結果は、報酬モデルのオンライン更新が過剰な最適化を回避し、トレーニングの安定性を高め、クレジットの割り当てを強化し、LLMSでの強化学習のための非常に強力な方法となることを確認しています。

This reinforcement learning approach provides an efficient and scalable LLM training process with dense implicit process rewards. This eliminates step-level explicit annotations and minimizes training costs while enhancing sample efficiency, stability, and performance. The process combines online reward modeling and token-level feedback harmoniously, solving long-standing problems of reward sparsity and credit assignment in RL for LLMs. These improvements optimize reasoning capability in AI models and make them suitable for problem-solving applications in mathematics and programming. This research is a substantial contribution to RL-based L

この強化学習アプローチは、密集した暗黙のプロセス報酬を備えた効率的でスケーラブルなLLMトレーニングプロセスを提供します。これにより、ステップレベルの明示的な注釈がなくなり、サンプルの効率、安定性、パフォーマンスが向上しながら、トレーニングコストを最小限に抑えます。このプロセスは、オンライン報酬モデリングとトークンレベルのフィードバックを調和させて組み合わせて、LLMSのRLでの報酬のスパース性とクレジット割り当ての長年の問題を解決します。これらの改善により、AIモデルの推論能力が最適化され、数学とプログラミングの問題解決アプリケーションに適しています。この研究は、RLベースのLへの多大な貢献です

免責事項:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research！

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

2025年02月08日に掲載されたその他の記事

もっと