![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
自回归视觉生成模型已成为一种开创性的图像综合方法,从语言模型令牌预测机制中汲取灵感。
Autoregressive visual generation models have emerged as a groundbreaking approach to image synthesis, drawing inspiration from language model token prediction mechanisms. These innovative models utilize image tokenizers to transform visual content into discrete or continuous tokens. The approach facilitates flexible multimodal integrations and allows adaptation of architectural innovations from LLM research. However, the field faces a critical challenge of determining the optimal token representation strategy. The choice between discrete and continuous tokens remains a fundamental dilemma, impacting model complexity and generation quality.
自回归视觉生成模型已成为一种开创性的图像综合方法,从语言模型令牌预测机制中汲取灵感。这些创新的模型利用图像令牌将视觉内容转换为离散或连续令牌。该方法促进了灵活的多模式整合,并允许从LLM Research适应建筑创新。但是,该领域面临确定最佳令牌表示策略的关键挑战。离散和连续令牌之间的选择仍然是一个根本的困境,影响了模型的复杂性和发电质量。
Existing methods include visual tokenization that explores two primary approaches: continuous and discrete token representations. Variational autoencoders establish continuous latent spaces that maintain high visual fidelity, becoming foundational in diffusion model development. Discrete methods like VQ-VAE and VQGAN enable straightforward autoregressive modeling but encounter significant limitations, including codebook collapse and information loss.
现有方法包括探索两种主要方法的视觉令牌化:连续和离散的令牌表示。变异自动编码器建立了保持高视觉保真度的连续潜在空间,并成为扩散模型开发中的基础。诸如VQ-VAE和VQGAN之类的离散方法可实现直接自回旋建模,但遇到了重大限制,包括代码书崩溃和信息丢失。
Autoregressive image generation evolves from computationally intensive pixel-based methods to more efficient token-based strategies. While models like DALL-E show promising results, hybrid methods such as GIVT and MAR introduce complex architectural modifications to improve generation quality, rendering the traditional autoregressive modeling pipeline complicated.
自回归图像的产生从基于计算密集的像素的方法发展为更有效的代币策略。尽管Dall-E之类的模型显示出令人鼓舞的结果,但GIVT和MAR等混合方法引入了复杂的建筑修饰,以提高发电质量,从而使传统的自动回归建模管道变得复杂。
To bridge this critical gap between continuous and discrete token representations in visual generation, researchers from the University of Hong Kong, ByteDance Seed, Ecole Polytechnique, and Peking University propose TokenBridge. It aims to utilize the strong representation capacity of continuous tokens while maintaining the modeling simplicity of discrete tokens. TokenBridge decouples the discretization process from initial tokenizer training by introducing a novel post-training quantization technique. Moreover, it implements a unique dimension-wise quantization strategy that independently discretizes each feature dimension, complemented by a lightweight autoregressive prediction mechanism. It efficiently manages the expanded token space while preserving high-quality visual generation capabilities.
为了弥合视觉一代中连续和离散代币表示之间的关键差距,香港大学的研究人员,兽人种子,Ecole Polytechnique和Peking University提出了Tokenbridge。它旨在利用连续令牌的强大表示能力,同时保持离散令牌的模型简单性。 Tokenbridge通过引入一种新型的训练后量化技术,将离散化过程与初始令牌培训相结合。此外,它实现了独特的尺寸量化策略,该策略将每个特征维度独立离散,并以轻巧的自回归预测机制进行补充。它有效地管理了扩展的令牌空间,同时保留了高质量的视觉生成能力。
TokenBridge introduces a training-free dimension-wise quantization technique that operates independently on each feature channel, effectively addressing previous token representation limitations. The approach capitalizes on two crucial properties of Variational Autoencoder features: their bounded nature due to KL constraints and near-Gaussian distribution.
Tokenbridge引入了一种无训练的尺寸量化技术,该技术在每个特征通道上独立运行,有效地解决了以前的令牌表示限制。该方法利用了变异自动编码器特征的两个关键特性:由于KL限制和接近高斯的分布,它们的界限。
The autoregressive model adopts a Transformer architecture with two primary configurations: a default L model comprising 32 blocks with 1024 width (approx 400 million parameters) for initial studies and a larger H model with 40 blocks and 1280 width (around 910 million parameters) for final evaluations. This design allows a detailed exploration of the proposed quantization strategy across different model scales.
自回归模型采用具有两种主要配置的变压器体系结构:默认L模型,其中包括32个块,具有1024个宽度(约4亿个参数),用于初步研究,较大的H模型具有40个块和1280个宽度和1280个宽度(约9.1亿个参数)(约9.1亿个参数),用于最终评估。该设计允许对不同模型量表进行拟议的量化策略进行详细的探索。
The results demonstrate that TokenBridge outperforms traditional discrete token models, achieving superior Frechet Inception Distance (FID) with significantly fewer parameters. For instance, TokenBridge-L secures an FID of 1.76 with only 486 million parameters, contrasting with LlamaGen's 2.18 using 3.1 billion parameters. When benchmarked against continuous approaches, TokenBridge-L outperforms GIVT, achieving a FID of 1.76 versus 3.35.
结果表明,Tokenbridge的表现优于传统的离散代币模型,实现了优势构成距离(FID)的参数较少。例如,Tokenbridge-L仅使用31亿个参数与Llamagen的2.18相比,以4.86亿个参数获得1.76的FID。当对着连续的方法进行基准测试时,Tokenbridge-L优于Givt,获得1.76的FID与3.35。
The H-model configuration further validates the method's effectiveness, matching MAR-H in FID (1.55) while delivering superior Inception Score and Recall metrics with marginally fewer parameters. These results highlight TokenBridge's capability to bridge discrete and continuous token representations.
H模型配置进一步验证了该方法的有效性,与FID中的MAR-H相匹配(1.55),同时提供了较高的启动分数和召回参数较少的指标。这些结果突出了Tokenbridge桥接离散和连续令牌表示的能力。
In conclusion, researchers present TokenBridge, which bridges the longstanding gap between discrete and continuous token representations. It achieves high-quality visual generation with remarkable efficiency by introducing a post-training quantization approach and dimension-wise autoregressive decomposition. The research demonstrates that discrete token approaches using standard cross-entropy loss can compete with state-of-the-art continuous methods, eliminating the need for complex distribution modeling techniques. This finding opens a promising pathway for future investigations, potentially transforming how researchers conceptualize and implement token-based visual synthesis technologies.
总之,研究人员提出了Tokenbridge,这弥合了离散和连续令牌表示之间的长期差距。通过引入训练后量化方法和尺寸自回归分解,它具有出色的效率来实现高质量的视觉产生。该研究表明,使用标准跨透镜损失的离散令牌方法可以与最新的连续方法竞争,从而消除了对复杂分布建模技术的需求。这一发现为未来研究开辟了一个有希望的途径,有可能改变研究人员如何概念化和实施基于令牌的视觉合成技术。
Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
查看纸张,GitHub页面和项目。这项研究的所有信用都归该项目的研究人员。另外,请随时在Twitter上关注我们,不要忘记加入我们的85K+ ML Subreddit。
免责声明:info@kdj.com
所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!
如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。
-
-
-
- “埃隆·马斯克(Elon Musk
- 2025-04-02 03:35:12
- Dogecoin(Doge)是从互联网模因中诞生的加密货币,发现自己纠结于独特而令人困惑的叙述
-
-
-
- 尽管有看跌的情绪,但分析师预言了700%的ETH集会,宏观经济事件助长了
- 2025-04-02 03:30:12
- 以太坊(ETH)市场是广泛的加密货币生态系统的缩影,最近几周一直是叙事冲突的战场。
-
-
-