![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
自回歸 (AR) 模型改變了影像生成領域,為生成高品質視覺效果樹立了新的基準。這些模型將影像創建過程分解為連續的步驟,每個標記都是基於先前的標記生成的,從而創建具有異常真實性和連貫性的輸出。
Autoregressive (AR) models have revolutionized the field of image generation, pushing the boundaries of visual realism and coherence. These models operate sequentially, generating each token based on the preceding ones, resulting in outputs of exceptional quality. Researchers have widely employed AR techniques in computer vision, gaming, and digital content creation applications. However, the potential of AR models is often limited by inherent inefficiencies, particularly their slow generation speed, which poses a significant challenge in real-time scenarios.
自回歸 (AR) 模型徹底改變了影像生成領域,突破了視覺真實感和連貫性的界限。這些模型按順序運行,根據前面的令牌產生每個令牌,從而產生卓越品質的輸出。研究人員已在電腦視覺、遊戲和數位內容創建應用中廣泛採用 AR 技術。然而,AR 模型的潛力往往受到固有的低效率的限制,特別是其生成速度緩慢,這在即時場景中提出了重大挑戰。
Among various concerns, a critical aspect that hinders the practical deployment of AR models is their speed. The sequential nature of token-by-token generation inherently limits scalability and introduces high latency during image generation tasks. For instance, generating a 256×256 image using traditional AR models like LlamaGen requires 256 steps, which translates to approximately five seconds on modern GPUs. Such delays hinder their application in scenarios demanding instantaneous results. Moreover, while AR models excel in maintaining the fidelity of their outputs, they face difficulties in meeting the growing demand for both speed and quality in large-scale implementations.
在各種擔憂中,阻礙 AR 模型實際部署的關鍵因素是它們的速度。逐個令牌生成的順序本質本質上限制了可擴展性,並在影像生成任務期間引入了高延遲。例如,使用 LlamaGen 等傳統 AR 模型產生 256×256 影像需要 256 個步驟,這在現代 GPU 上大約需要 5 秒。這種延遲阻礙了它們在需要即時結果的場景中的應用。此外,雖然 AR 模型在保持輸出的保真度方面表現出色,但它們在滿足大規模實施中對速度和品質不斷增長的需求方面面臨著困難。
Efforts to accelerate AR models have led to various methods, such as predicting multiple tokens simultaneously or adopting masking strategies during generation. These approaches aim to reduce the required steps but often compromise the quality of the generated images. For example, in multi-token generation techniques, the assumption of conditional independence among tokens introduces artifacts, ultimately undermining the cohesiveness of the output. Similarly, masking-based methods allow for faster generation by training models to predict specific tokens based on others, but their effectiveness diminishes when generation steps are drastically reduced. These limitations highlight the need for a novel approach to enhance AR model efficiency.
加速 AR 模型的努力催生了各種方法,例如同時預測多個令牌或在生成過程中採用屏蔽策略。這些方法旨在減少所需的步驟,但通常會損害生成影像的品質。例如,在多令牌生成技術中,令牌之間條件獨立的假設引入了偽影,最終破壞了輸出的凝聚力。類似地,基於掩碼的方法允許透過訓練模型來根據其他標記預測特定標記來更快地生成,但當生成步驟大幅減少時,它們的有效性會降低。這些限制突顯出需要一種新穎的方法來提高 AR 模型的效率。
A recent research collaboration between Tsinghua University and Microsoft Research has devised a solution to these challenges: Distilled Decoding (DD). This method builds on flow matching, a deterministic mapping that connects Gaussian noise to the output distribution of pre-trained AR models. Unlike conventional methods, DD does not require access to the original training data of the AR models, making it more practical for deployment. The research demonstrated that DD can transform the generation process from hundreds of steps to as few as one or two while preserving the quality of the output. For example, on ImageNet-256, DD achieved a speed-up of 6.3x for VAR models and an impressive 217.8x for LlamaGen, reducing generation steps from 256 to just one.
清華大學和微軟研究院最近的一項研究合作設計了應對這些挑戰的解決方案:蒸餾解碼(DD)。此方法建立在流匹配的基礎上,流匹配是一種將高斯雜訊與預訓練 AR 模型的輸出分佈連接起來的確定性映射。與傳統方法不同,DD不需要存取AR模型的原始訓練數據,使其更容易部署。研究表明,DD 可以將生成過程從數百個步驟減少到一兩個步驟,同時保持輸出的品質。例如,在 ImageNet-256 上,DD 為 VAR 模型實現了 6.3 倍的加速,為 LlamaGen 實現了令人印象深刻的 217.8 倍的加速,將生成步驟從 256 個減少到了 1 個。
The technical foundation of DD is based on its ability to create a deterministic trajectory for token generation. Using flow matching, DD maps noisy inputs to tokens to align their distribution with the pre-trained AR model. During training, the mapping is distilled into a lightweight network that can directly predict the final data sequence from a noise input. This process ensures faster generation and provides flexibility in balancing speed and quality by allowing intermediate steps when needed. Unlike existing methods, DD eliminates the trade-off between speed and fidelity, enabling scalable implementations across diverse tasks.
DD 的技術基礎是基於其為代幣生成創建確定性軌蹟的能力。使用流匹配,DD 將噪聲輸入映射到令牌,以使其分佈與預先訓練的 AR 模型保持一致。在訓練過程中,映射被提煉成一個輕量級網絡,可以直接從噪音輸入預測最終資料序列。此過程可確保更快的生成,並透過在需要時允許中間步驟來提供平衡速度和品質的靈活性。與現有方法不同,DD 消除了速度和保真度之間的權衡,從而實現了跨不同任務的可擴展實施。
In experiments, DD highlights its superiority over traditional methods. For instance, using VAR-d16 models, DD achieved one-step generation with an FID score increase from 4.19 to 9.96, showcasing minimal quality degradation despite a 6.3x speed-up. For LlamaGen models, the reduction in steps from 256 to one resulted in an FID score of 11.35, compared to 4.11 in the original model, with a remarkable 217.8x speed improvement. DD demonstrated similar efficiency in text-to-image tasks, reducing generation steps from 256 to two while maintaining a comparable FID score of 28.95 against 25.70. The results underline DD’s ability to drastically enhance speed without significant loss in image quality, a feat unmatched by baseline methods.
在實驗中,DD凸顯了其相對於傳統方法的優越性。例如,使用 VAR-d16 模型,DD 實現了一步生成,FID 分數從 4.19 增加到 9.96,儘管速度提高了 6.3 倍,但品質下降最小。對於 LlamaGen 模型,步數從 256 減少到 1,FID 分數為 11.35,而原始模型為 4.11,速度顯著提高了 217.8 倍。 DD 在文字到圖像任務中表現出類似的效率,將生成步驟從 256 個減少到 2 個,同時保持可比較的 FID 分數 28.95 與 25.70。結果強調了 DD 能夠在不顯著降低影像品質的情況下大幅提高速度,這是基準方法無法比擬的壯舉。
Several key takeaways from the research on DD include:
DD 研究的幾個關鍵要點包括:
In conclusion, with the introduction of Distilled Decoding, researchers have successfully addressed the longstanding speed-quality trade-off that has plagued AR generation processes by leveraging flow matching and deterministic mappings. The method accelerates image synthesis by reducing steps drastically and preserves the outputs’ fidelity and scalability. With its robust performance, adaptability, and practical deployment advantages, Distilled Decoding opens new frontiers in real-time applications of AR models. It sets the stage for further innovation in generative modeling.
總之,隨著蒸餾解碼的引入,研究人員透過利用流匹配和確定性映射,成功解決了長期以來困擾 AR 生成過程的速度與品質之間的權衡問題。此方法透過大幅減少步驟來加速影像合成,並保持輸出的保真度和可擴展性。憑藉其強大的性能、適應性和實際部署優勢,Distilled Decoding 開闢了 AR 模型即時應用的新領域。它為生成建模的進一步創新奠定了基礎。
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.
請參閱 Paper 和 GitHub 頁面。這項研究的所有功勞都歸功於該計畫的研究人員。另外,不要忘記在 Twitter 上關注我們並加入我們的 Telegram 頻道和 LinkedIn 群組。不要忘記加入我們 60k+ ML SubReddit。
Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence
熱門話題:LG AI Research 發表 EXAONE 3.5:三個開源雙語前沿 AI 級模型,提供無與倫比的指令追蹤和長上下文理解,實現卓越生成 AI 的全球領先地位
免責聲明:info@kdj.com
所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!
如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。
-
-
-
- Qubetics(TICS):後起之秀
- 2025-04-08 23:50:12
- 加密貨幣已經從利基數字好奇心演變為全球金融環境中的一個強國。無論您是經驗豐富的商人還是剛開始
-
-
-
-
- $ 12.50
- 2025-04-08 23:40:12
- XRP在特朗普總統離開辦公室之前可能上升至12.50美元
-
- MOG硬幣價格預測2024
- 2025-04-08 23:40:12
- 在最近在最大的加密交易所Binance上列出了新興的Memecoins列表之後,這種上升的恢復了。
-