|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
經過訓練以表達擴展思維鏈的新模型將在其代碼和數學的突破性領域之外概括。
This post is early to accommodate some last minute travel on my end!
這篇文章還早,可以容納我盡頭的最後一分鐘旅行!
The new models trained to express extended chain of thought are going to generalize outside of their breakthrough domains of code and math. The “reasoning” process of language models that we use today is chain of thought reasoning. We ask the model to work step by step because it helps it manage complexity, especially in domains where the answer requires precision across multiple specific tokens. The domains where chain of thought (CoT) is most useful today are code, mathematics, and other “reasoning” tasks1. These are the domains where models like o1, R1, Gemini-Thinking, etc. were designed for.
經過訓練以表達擴展思維鏈的新模型將在其代碼和數學的突破性領域之外概括。我們今天使用的語言模型的“推理”過程是思想推理鏈。我們要求該模型逐步工作,因為它可以幫助其管理複雜性,尤其是在答案需要多個特定令牌的域中。當今最有用的思想鏈(COT)的域是代碼,數學和其他“推理”任務1。這些是為O1,R1,Gemini-Inkining等設計的域名。
Different intelligences reason in different ways that correspond to how they store and manipulate information. Humans compress a lifetime of experience into our spectacular, low-power brains that draw on past experience almost magically. The words that follow in this blog are also autoregressive, like the output of a language model, but draw on hours and hours of background processing as I converge on this argument.
不同的智能理由以不同的方式與他們如何存儲和操縱信息相對應。人類將一生的經驗壓縮到我們壯觀的低功耗大腦中,這些大腦幾乎神奇地借鑒了過去的經驗。此博客中隨之而來的單詞也是自回歸的,例如語言模型的輸出,但是在我匯總該參數時,會在數小時的背景處理中使用。
Language models, on the other hand, are extremely general and do not today have architectures (or use-cases) that continually re-expose them to relevant problems and fold information back in a compressed form. Language models are very large, sophisticated, parametric probability distributions. All of their knowledge and information processing power is stored in the raw weights. Therein, they need a way of processing information that matches this. Chain of thought is that alignment.
另一方面,語言模型非常通用,並且今天沒有架構(或用例),這些架構(或用例)不斷地將其重新陳述為相關問題,並以壓縮形式折疊信息。語言模型非常大,複雜,參數概率分佈。他們所有的知識和信息處理能力都存儲在原始權重中。在其中,他們需要一種處理與此相匹配的信息的方法。思想鏈就是那個對齊。
Chain of thought reasoning allows information to be naturally processed in smaller chunks, allowing the large, brute force probability distribution to work one token at a time. Chain of thought, while allowing more compute per important token, also allows the models to store intermediate information in their context window without needing explicit recurrence.
思想推理鏈允許在較小的塊中自然處理信息,從而使較大的,蠻力的概率分佈一次可以工作一個令牌。思想鏈,同時允許每個重要令牌進行更多計算,還允許模型將中間信息存儲在其上下文窗口中,而無需明確的複發。
Recurrence is required for reasoning and this can either happen in the parameter or state-space. Chain of thoughts with transformers handles all of this in the state-space of the problems. The humans we look at as the most intelligent have embedded information directly in the parameters of our brains that we can draw on.
推理需要復發,這可以在參數或狀態空間中發生。帶有變形金剛的思想鏈在問題的狀態空間中處理了所有這些。我們認為最聰明的人類將信息直接嵌入了我們可以吸引的大腦的參數中。
Here is the only assumption of this piece — chain of thought is a natural fit for language models to “reason” and therefore one should be optimistic about training methods that are designed to enhance it generalizing to many domains.2 By the end of 2025 we should have ample evidence of this given the pace of the technological development.
這是本文的唯一假設 - 思想鏈自然地適合“理性”,因此應該對旨在增強其推廣到許多領域的培訓方法樂觀。2到2025年底,我們考慮到技術發展的速度,應該有足夠的證據。
If the analogies of types of intelligence aren’t convincing enough, a far more practical way to view the new style of training is a method that teaches the model to be better at allocating more compute to harder problems. If the skill is compute allocation, it is fundamental to the models handling a variety of tasks. Today’s reasoning models do not solve this perfectly, but they open the door for doing so precisely.
如果類型的智力類型不夠說服力,那麼一種更實用的方式來查看新的培訓方式是一種教會模型更好地分配更難解決問題的方法。如果技能是計算分配,則對於處理各種任務的模型至關重要。當今的推理模型並不能完美地解決這個問題,但是他們為這樣做打開了大門。
The nature of this coming generalization is not that these models are one size fits all, best in all cases: speed, intelligence, price, etc. There’s still no free lunch. A realistic outcome for reasoning heavy models in the next 0-3 years is a world where:
即將到來的概括的性質並不是這些型號適合所有尺寸,在所有情況下最好:速度,智力,價格等。仍然沒有免費的午餐。在未來0 - 3年中推理重型模型的現實結果是一個世界:
Reasoning trained models are superhuman on tasks with verifiable domains, like those with initial progress: Code, math, etc.
推理訓練的模型是具有可驗證域的任務的超人,例如具有最初進度的域:代碼,數學等。
Reasoning trained models are well better in peak performance than existing autoregressive models in many domains we would not expect and are not necessarily verifiable.
在許多域中,經過推理訓練的模型的峰值性能要比現有的自動回歸模型要好,我們不會期望並且不一定是可驗證的。
Reasoning trained models are still better in performance at the long-tail of tasks, but worse in cost given the high inference costs of long-context.
經過推理訓練的模型在長期完成任務方面的性能仍然更好,但是鑑於長期延長的推理成本很高,成本較差。
Many of the leading figures in AI have been saying for quite some time that powerful AI is going to be “spikey" when it shows up — meaning that the capabilities and improvements will vary substantially across domains — but encountering this reality is very unintuitive.
AI中的許多領先人物已經說過一段時間以來,當它出現時強大的AI將會是“尖峰”的- 這意味著能力和改進在跨領域之間的差異很大- 但是遇到這一現實是非常不直覺的。
Some evidence for generalization of reasoning models already exists.
一些關於推理模型的概括的證據已經存在。
OpenAI has already published multiple safety-oriented research projects with their new reasoning models in Deliberative Alignment: Reasoning Enables Safer Language Models and Trading Inference-Time Compute for Adversarial Robustness. These papers show their new methods can be translated to various safety domains, i.e. model safety policies and jailbreaking. The deliberative alignment paper shows them integrating a softer reward signal into the reasoning training — having a language model check how the safety policies apply to outputs.
OpenAI已經在審議方面發表了多個面向安全的研究項目,其新的推理模型:推理使更安全的語言模型和交易推理時間計算以實現對抗性魯棒性。這些論文表明,他們的新方法可以轉化為各種安全域,即模型安全政策和越獄。審議對准文件顯示他們將較軟的獎勵信號整合到推理培訓中 - 具有語言模型檢查安全策略如何應用於輸出。
An unsurprising quote from the deliberative alignment release related to generalization:
與概括相關的審議一致性釋放的毫不奇怪的報價:
we find that deliberative alignment enables strong generalization to out-of-distribution safety scenarios.
我們發現,審議對齊能夠有力地概括分佈安全的情況。
Safety, qualitatively, is very orthogonal to traditional reasoning problems. Safety is very subjective to the information provided and subtle context, where math and coding problems are often about many small, forward processing steps towards a final goal. More behaviors will fit in between those.
在定性上,安全性與傳統推理問題非常矯正。安全對於提供的信息和微妙的環境非常主觀,在這種情況下,數學和編碼問題通常與許多針對最終目標的小型,前進的處理步驟有關。這些行為將適合這些行為。
This generative verifier for safety is not a ground truth signal and could theoretically be subject to reward hacking, but it was avoided. Generative verifiers will be crucial to expanding this training to countless domains — they’re easy to use and largely a new development
這種生成性驗證者的安全性不是地面真相信號,從理論上講,這可能會受到獎勵黑客攻擊,但可以避免。生成驗證者對於將這種培訓擴展到無數領域至關重要 - 它們易於使用,並且在很大程度上是一個新的開發項目
免責聲明:info@kdj.com
The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!
If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.
-
- 既然加拿大銀行連續第六次降低了其關鍵利率,加拿大人可能會發現債務付款更容易
- 2025-01-30 14:35:52
- 據金融專家稱,加拿大銀行已經連續第六次降低了其關鍵利率,加拿大人可能會發現更容易支付債務。
-
- Coinbase:加密貨幣改變遊戲規則
- 2025-01-30 14:30:53
- Coinbase Global(NASDAQ:COIN)正在成為主要金融機構的激動人心的中心,因為它們向呼叫期權注入了174萬美元
-
- Shane Jones是拒絕向化石燃料行業貸款的銀行的權利
- 2025-01-30 14:30:53
- 儘管化石燃料行業被指控所有事情,但這並不是非法的運營。這就是為什麼Shane Jones正在計劃這種干預
-
- 消費稅部以加強鎮壓令牌的稅收違約者
- 2025-01-30 14:30:53
- 消費稅和稅收旁遮普邦的局長Umar Sher Chattha已針對代幣違約者採取了嚴格的行動。
-
- 第四顆星,尖端的虛擬現實沉浸式媒體流媒體平台,正式可供公眾使用
- 2025-01-30 14:30:53
- 這個創新的平台通過使用戶能夠無縫觀看標準2D,180和360個沉浸式媒體內容和電影來改變傳統娛樂。
-
- 當美聯儲椅子涉及加密法規時,比特幣跳躍
- 2025-01-30 14:30:53
- 最大的數字資產截至週四(1月30日)在新加坡上午10點,大約104,300美元易手,比美國當天的攀升為3.5%。
-
- 今天購買的4個加密貨幣可以在2025年之前成為百萬富翁
- 2025-01-30 14:30:53
- 隨著2025年牛市的臨近,現在是確定有望爆炸過夜增長的令牌的絕佳時機。