$80526.588300 USD

-1.29%

ethereum

$1540.127221 USD

-4.23%

tether

$0.999410 USD

-0.03%

xrp

$1.992067 USD

0.59%

bnb

$578.240064 USD

0.73%

usd-coin

$1.000005 USD

0.01%

solana

$114.989272 USD

-0.41%

dogecoin

$0.156351 USD

1.19%

tron

$0.235315 USD

-1.20%

cardano

$0.620256 USD

1.42%

unus-sed-leo

$9.411993 USD

0.23%

chainlink

$12.296466 USD

0.33%

avalanche

$18.470197 USD

2.97%

toncoin

$2.925237 USD

-3.48%

hedera

$0.169941 USD

2.85%

加密货币新闻

Smolvlms：拥抱脸部发布世界上最小的视觉语言模型

2025/01/26 00:21

已经开发了机器学习算法来处理许多不同的任务，从做出预测到匹配模式或生成匹配的图像

Recent years have seen a massive increase in the capabilities of machine learning algorithms, which can now perform a wide range of tasks, from making predictions to matching patterns or generating images that match text prompts. To enable them to take on such diverse roles, these models have been given a broad spectrum of capabilities, but one thing they rarely are is efficient.

近年来，机器学习算法的功能大大提高，现在可以执行各种任务，从做出预测到匹配模式或生成匹配文本提示的图像。为了使他们能够扮演如此多样化的角色，这些模型得到了广泛的能力，但是很少有一件事是有效的。

In the present era of exponential growth in the field, rapid advancements often come at the expense of efficiency. It is faster, after all, to produce a very large kitchen-sink model filled with redundancies than it is to produce a lean, mean inferencing machine.

在当前该领域的指数增长时代，快速进步通常以牺牲效率为代价。毕竟，生产一个非常大的厨房清单模型毕业的速度要比生产精益，平均的推论机的速度要快。

But as these present algorithms continue to mature, more attention is being directed at slicing them down to smaller sizes. Even the most useful tools are of little value if they require such a large amount of computational resources that they are impractical for use in real-world applications. As you might expect, the more complex an algorithm is, the more challenging it is to shrink it down. That is what makes Hugging Face’s recent announcement so exciting — they have taken an axe to vision language models (VLMs), resulting in the release of new additions to the SmolVLM family — including SmolVLM-256M, the smallest VLM in the world.

但是，随着这些当前的算法继续成熟，更多的关注是将它们切成小尺寸。即使是最有用的工具，如果它们需要如此大量的计算资源，以至于它们不切实际地用于现实世界应用程序。如您所料，算法越复杂，将其缩小的挑战就越具有挑战性。这就是让Hugging Face最近的公告如此令人兴奋的原因 - 他们将斧头带到了视觉语言模型（VLM），从而释放了Smolvlm家族的新增加 - 包括Smolvlm-256m，这是世界上最小的VLM。

SmolVLM-256M is an impressive example of optimization done right, with just 256 million parameters. Despite its small size, this model performs very well in tasks such as captioning, document-based question answering, and basic visual reasoning, outperforming older, much larger models like the Idefics 80B from just 17 months ago. The SmolVLM-500M model provides an additional performance boost, with 500 million parameters offering a middle ground between size and capability for those needing some extra headroom.

Smolvlm-256M是正确进行优化的一个令人印象深刻的例子，只有2.56亿个参数。尽管尺寸很小，但该模型在字幕，基于文档的问题回答和基本的视觉推理等任务中表现出色，比17个月前的IDEFICS 80B（例如IDEFICS 80B）的表现优于较旧的型号。 SMOLVLM-500M型号提供了额外的性能提升，5亿参数为需要额外额外余量的人提供了大小和能力之间的中间立场。

Hugging Face achieved these advancements by refining its approach to vision encoders and data mixtures. The new models adopt the SigLIP base patch-16/512 encoder, which, though smaller than its predecessor, processes images at a higher resolution. This choice aligns with recent trends seen in Apple and Google research, which emphasize higher resolution for improved visual understanding without drastically increasing parameter counts.

拥抱面孔通过完善其视觉编码器和数据混合物的方法来实现这些进步。新模型采用Siglip Base Patch-16/512编码器，尽管该编码器比其前身小，但以更高的分辨率处理图像。这种选择与Apple和Google Research中的最新趋势保持一致，这些趋势强调了更高的分辨率，以改善视觉理解而不会大幅度增加参数计数。

The team also employed innovative tokenization methods to further streamline their models. By improving how sub-image separators are represented during tokenization, the models gained greater stability during training and achieved better quality outputs. For example, multi-token representations of image regions were replaced with single-token equivalents, enhancing both efficiency and accuracy.

该团队还采用了创新的令牌化方法来进一步简化其模型。通过改善在令牌化过程中如何表示子图像分离器，模型在训练过程中获得了更大的稳定性，并获得了更好的质量产出。例如，图像区域的多to式表示被单词等效物取代，从而提高了效率和准确性。

In another advance, the data mixture strategy was fine-tuned to emphasize document understanding and image captioning, while maintaining a balanced focus on essential areas like visual reasoning and chart comprehension. These refinements are reflected in the model’s improved benchmarks which show both the 250M and 500M models outperforming Idefics 80B in nearly every category.

在另一个进步中，对数据混合物策略进行了微调，以强调文档的理解和图像字幕，同时保持对视觉推理和图表理解等基本领域的平衡关注。这些改进反映在模型的改进基准中，这些基准显示了250m和500m模型在几乎每个类别中的表现优于80B。

By demonstrating that small can indeed be mighty, these models pave the way for a future where advanced machine learning capabilities are both accessible and sustainable. If you want to help bring that future into being, go grab these models now. Hugging Face has open-sourced them, and with only modest hardware requirements, just about anyone can get in on the action.

通过证明小型确实可以是强大的，这些模型为未来的高级机器学习能力既可以访问又可持续铺平了道路。如果您想帮助将未来变成，请立即抓住这些模型。拥抱的脸是开源的，只有适度的硬件要求，几乎任何人都可以采取行动。

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年04月12日发表的其他文章