bitcoin
bitcoin

$94595.515428 USD

-0.31%

ethereum
ethereum

$3274.212242 USD

-1.03%

xrp
xrp

$2.427823 USD

5.26%

tether
tether

$0.999679 USD

-0.04%

bnb
bnb

$697.259590 USD

0.20%

solana
solana

$186.361935 USD

-2.83%

dogecoin
dogecoin

$0.337851 USD

0.65%

usd-coin
usd-coin

$1.000077 USD

0.00%

cardano
cardano

$0.946469 USD

-0.70%

tron
tron

$0.242697 USD

0.08%

avalanche
avalanche

$36.852174 USD

-1.73%

sui
sui

$5.012248 USD

-1.22%

toncoin
toncoin

$5.396842 USD

2.71%

stellar
stellar

$0.426722 USD

5.37%

shiba-inu
shiba-inu

$0.000022 USD

1.11%

加密货币新闻

内容自适应标记化 (CAT):内容感知图像标记化的开创性框架

2025/01/11 09:03

卡内基梅隆大学和 Meta 的研究人员提出了内容自适应标记化 (CAT),这是内容感知图像标记化的开创性框架

内容自适应标记化 (CAT):内容感知图像标记化的开创性框架

In the realm of AI-driven image modeling, one of the critical challenges that have yet to be fully addressed is the inability to effectively account for the diversity present in image content complexity. Existing tokenization methods largely employ static compression ratios, treating all images equally without considering their varying complexities. As a result of this approach, complex images often undergo excessive compression, leading to the loss of crucial information, while simpler images remain under-compressed, wasting valuable computational resources. These inefficiencies directly impact the performance of subsequent operations, such as the reconstruction and generation of images, where accurate and efficient representation plays a pivotal role.

在人工智能驱动的图像建模领域,尚未完全解决的关键挑战之一是无法有效地解释图像内容复杂性中存在的多样性。现有的标记化方法主要采用静态压缩比,平等地对待所有图像,而不考虑它们不同的复杂性。由于这种方法,复杂的图像通常会经历过度的压缩,导致关键信息的丢失,而较简单的图像仍然压缩不足,浪费宝贵的计算资源。这些低效率直接影响后续操作的性能,例如图像的重建和生成,其中准确和高效的表示起着关键作用。

Current techniques for tokenizing images fall short in appropriately addressing the variation in complexity. Fixed ratio tokenization approaches, such as resizing images to standard sizes, fail to account for the varying content complexities. While Vision Transformers do adapt patch size dynamically, they rely on image input and lack the flexibility required for text-to-image applications. Other compression techniques, such as JPEG, are specifically designed for traditional media and lack optimization for deep learning-based tokenization. Recent work, such as ElasticTok, has explored random token length strategies but lacked consideration of the intrinsic content complexity during training time, leading to inefficiencies in quality and computational cost.

当前用于标记图像的技术不足以适当地解决复杂性的变化。固定比例标记化方法(例如将图像大小调整为标准尺寸)无法考虑不同的内容复杂性。虽然 Vision Transformer 确实可以动态调整补丁大小,但它们依赖于图像输入,并且缺乏文本到图像应用程序所需的灵活性。其他压缩技术(例如 JPEG)是专门为传统媒体设计的,缺乏针对基于深度学习的标记化的优化。最近的工作,例如 ElasticTok,探索了随机令牌长度策略,但缺乏对训练期间内在内容复杂性的考虑,导致质量和计算成本效率低下。

To address these limitations, researchers from Carnegie Mellon University and Meta have proposed Content-Adaptive Tokenization (CAT), a pioneering framework for content-aware image tokenization that introduces a dynamic approach to allocating representation capacity based on content complexity. This innovation enables large language models to assess the complexity of images from captions and perception-based queries while classifying images into three compression levels: 8x, 16x, and 32x. Furthermore, it utilizes a nested VAE architecture that generates variable-length latent features by dynamically routing intermediate outputs based on the complexity of the images. The adaptive design reduces training overhead and optimizes image representation quality to overcome the inefficiencies of fixed-ratio methods. Notably, CAT enables adaptive and efficient tokenization using text-based complexity analysis without requiring image inputs at inference.

为了解决这些限制,卡内基梅隆大学和 Meta 的研究人员提出了内容自适应标记化 (CAT),这是一种内容感知图像标记化的开创性框架,引入了一种根据内容复杂性分配表示容量的动态方法。这项创新使大型语言模型能够评估来自字幕和基于感知的查询的图像的复杂性,同时将图像分为三个压缩级别:8x、16x 和 32x。此外,它利用嵌套 VAE 架构,根据图像的复杂性动态路由中间输出,生成可变长度的潜在特征。自适应设计减少了训练开销并优化了图像表示质量,以克服固定比率方法的低效率。值得注意的是,CAT 使用基于文本的复杂性分析实现自适应且高效的标记化,而无需在推理时输入图像。

CAT evaluates complexity with captions produced from LLMs that consider both semantic, visual, and perceptual features while determining compression ratios. Such a caption-based system is observed to be superior to traditional methods, including JPEG size and MSE in mimicking human perceived importance. The adaptive nested VAE design achieves this with channel-matched skip connections that dynamically alter latent space across varying compression levels. Shared parameterization guarantees consistency across scales, while training is performed by a combination of reconstruction error, perceptual loss (e.g., LPIPS), and adversarial loss to reach optimal performance. CAT was trained on a dataset of 380 million images and tested on the benchmarks of COCO, ImageNet, CelebA, and ChartQA, demonstrating its applicability to different image types.

CAT 使用法学硕士生成的字幕来评估复杂性,在确定压缩比时考虑语义、视觉和感知特征。据观察,这种基于字幕的系统在模仿人类感知重要性方面优于传统方法,包括 JPEG 大小和 MSE。自适应嵌套 VAE 设计通过通道匹配的跳跃连接来实现这一点,该连接可以在不同的压缩级别上动态改变潜在空间。共享参数化保证了跨尺度的一致性,而训练是通过重建误差、感知损失(例如,LPIPS)和对抗性损失的组合来执行的,以达到最佳性能。 CAT 在 3.8 亿张图像的数据集上进行了训练,并在 COCO、ImageNet、CelebA 和 ChartQA 的基准上进行了测试,证明了其对不同图像类型的适用性。

This approach achieves highly significant performance improvements in both image reconstruction and generation by adapting compression to content complexity. For reconstruction tasks, it significantly improves the rFID, LPIPS, and PSNR metrics. It delivers a 12% quality improvement for the reconstruction of CelebA and a 39% enhancement for ChartQA, all while keeping the quality comparable to those of datasets such as COCO and ImageNet with fewer tokens and efficiency. For class-conditional ImageNet generation, CAT outperforms the fixed-ratio baselines with an FID of 4.56 and improves inference throughput by 18.5%. This adaptive tokenization framework serves as the new benchmark for further improvement.

这种方法通过根据内容复杂性调整压缩,在图像重建和生成方面实现了非常显着的性能改进。对于重建任务,它显着改善了 rFID、LPIPS 和 PSNR 指标。它为 CelebA 的重建提供了 12% 的质量改进,为 ChartQA 提供了 39% 的增强,同时保持与 COCO 和 ImageNet 等数据集相当的质量,同时具有更少的标记和效率。对于类条件 ImageNet 生成,CAT 的 FID 为 4.56,优于固定比率基线,并将推理吞吐量提高了 18.5%。这种自适应标记化框架可以作为进一步改进的新基准。

CAT presents a novel approach to image tokenization by dynamically modulating compression levels based on the complexity of the content. It integrates LLM-based assessments with an adaptive nested VAE, eliminating persistent inefficiencies associated with fixed-ratio tokenization, thereby significantly improving performance in reconstruction and generation tasks. The adaptability and effectiveness of CAT make it a revolutionary asset in AI-oriented image modeling, with potential applications extending to video and multi-modal domains.

CAT 通过根据内容的复杂性动态调整压缩级别,提出了一种新颖的图像标记化方法。它将基于 LLM 的评估与自适应嵌套 VAE 集成,消除了与固定比率标记化相关的持续低效问题,从而显着提高了重建和生成任务的性能。 CAT 的适应性和有效性使其成为面向 AI 的图像建模的革命性资产,潜在应用扩展到视频和多模态领域。

新闻来源:www.marktechpost.com

免责声明:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

2025年01月11日 发表的其他文章