$80526.588300 USD

-1.29%

ethereum

$1540.127221 USD

-4.23%

tether

$0.999410 USD

-0.03%

xrp

$1.992067 USD

0.59%

bnb

$578.240064 USD

0.73%

usd-coin

$1.000005 USD

0.01%

solana

$114.989272 USD

-0.41%

dogecoin

$0.156351 USD

1.19%

tron

$0.235315 USD

-1.20%

cardano

$0.620256 USD

1.42%

unus-sed-leo

$9.411993 USD

0.23%

chainlink

$12.296466 USD

0.33%

avalanche

$18.470197 USD

2.97%

toncoin

$2.925237 USD

-3.48%

hedera

$0.169941 USD

2.85%

暗号通貨のニュース記事

Smolvlms：ハグする顔は、世界最小のビジョン言語モデルをリリースします

2025/01/26 00:21

機械学習アルゴリズムは、予測を行うことから一致するパターンや一致する画像の生成まで、さまざまなタスクを処理するために開発されました。

Recent years have seen a massive increase in the capabilities of machine learning algorithms, which can now perform a wide range of tasks, from making predictions to matching patterns or generating images that match text prompts. To enable them to take on such diverse roles, these models have been given a broad spectrum of capabilities, but one thing they rarely are is efficient.

近年、機械学習アルゴリズムの機能が大幅に増加しています。これは、予測を行うことから、マッチングパターンやテキストプロンプトに一致する画像の生成まで、幅広いタスクを実行できるようになりました。彼らがこのような多様な役割を引き受けることを可能にするために、これらのモデルには幅広い能力が与えられていますが、それらがめったにないことの1つは効率的です。

In the present era of exponential growth in the field, rapid advancements often come at the expense of efficiency. It is faster, after all, to produce a very large kitchen-sink model filled with redundancies than it is to produce a lean, mean inferencing machine.

現場での指数関数的な成長の現在の時代において、急速な進歩はしばしば効率を犠牲にして来ます。結局のところ、無駄のない平均推測機を生産するよりも、冗長性で満たされた非常に大きなキッチンシンクモデルを生産する方が速いです。

But as these present algorithms continue to mature, more attention is being directed at slicing them down to smaller sizes. Even the most useful tools are of little value if they require such a large amount of computational resources that they are impractical for use in real-world applications. As you might expect, the more complex an algorithm is, the more challenging it is to shrink it down. That is what makes Hugging Face’s recent announcement so exciting — they have taken an axe to vision language models (VLMs), resulting in the release of new additions to the SmolVLM family — including SmolVLM-256M, the smallest VLM in the world.

しかし、これらの現在のアルゴリズムが成熟し続けるにつれて、より小さなサイズにスライスすることにもっと注意が向けられています。最も有用なツールでさえ、実際のアプリケーションで使用するのに非現実的である非常に大量の計算リソースを必要とする場合、ほとんど価値がありません。ご想像のとおり、アルゴリズムがより複雑なほど、それを縮小するのはより困難です。それが、Hugging Faceの最近の発表を非常にエキサイティングにしている理由です。彼らはVision Language Models（VLM）にxを取り、Smolvlmファミリーに新しい追加をリリースしました。

SmolVLM-256M is an impressive example of optimization done right, with just 256 million parameters. Despite its small size, this model performs very well in tasks such as captioning, document-based question answering, and basic visual reasoning, outperforming older, much larger models like the Idefics 80B from just 17 months ago. The SmolVLM-500M model provides an additional performance boost, with 500 million parameters offering a middle ground between size and capability for those needing some extra headroom.

Smolvlm-256Mは、わずか2億5600万のパラメーターを備えた最適化の印象的な例です。サイズが小さいにもかかわらず、このモデルは、キャプション、ドキュメントベースの質問応答、基本的な視覚的推論など、わずか17か月前のIDEFICS 80Bのようなはるかに大きなモデルを上回るタスクで非常によく機能します。 Smolvlm-500mモデルは、追加のパフォーマンスブーストを提供し、5億パラメーターは、余分なヘッドルームを必要とする人にサイズと能力の間の中間地面を提供します。

Hugging Face achieved these advancements by refining its approach to vision encoders and data mixtures. The new models adopt the SigLIP base patch-16/512 encoder, which, though smaller than its predecessor, processes images at a higher resolution. This choice aligns with recent trends seen in Apple and Google research, which emphasize higher resolution for improved visual understanding without drastically increasing parameter counts.

抱きしめる顔は、ビジョンエンコーダーとデータの混合物へのアプローチを改善することにより、これらの進歩を達成しました。新しいモデルは、シグリップベースPATCH-16/512エンコーダーを採用しています。これは、前身よりも小さいものの、画像を高解像度で処理します。この選択は、AppleとGoogle Researchで見られる最近の傾向と一致しており、パラメーター数を大幅に増やすことなく、視覚的理解を改善するためのより高い解像度を強調しています。

The team also employed innovative tokenization methods to further streamline their models. By improving how sub-image separators are represented during tokenization, the models gained greater stability during training and achieved better quality outputs. For example, multi-token representations of image regions were replaced with single-token equivalents, enhancing both efficiency and accuracy.

チームはまた、モデルをさらに合理化するために革新的なトークン化方法を採用しました。トークン化中にサブイメージ分離器がどのように表現されるかを改善することにより、モデルはトレーニング中により高い安定性を獲得し、より良い品質の出力を達成しました。たとえば、画像領域のマルチトークン表現は、シングルトークン同等物に置き換えられ、効率と精度の両方を向上させました。

In another advance, the data mixture strategy was fine-tuned to emphasize document understanding and image captioning, while maintaining a balanced focus on essential areas like visual reasoning and chart comprehension. These refinements are reflected in the model’s improved benchmarks which show both the 250M and 500M models outperforming Idefics 80B in nearly every category.

別の進歩では、データ混合戦略を微調整して、視覚的推論やチャート理解などの重要な領域にバランスの取れた焦点を維持しながら、ドキュメントの理解と画像キャプションを強調しました。これらの改良は、ほぼすべてのカテゴリでIDEFIC 80Bを上回る250mと500mの両方のモデルを示すモデルの改善されたベンチマークに反映されています。

By demonstrating that small can indeed be mighty, these models pave the way for a future where advanced machine learning capabilities are both accessible and sustainable. If you want to help bring that future into being, go grab these models now. Hugging Face has open-sourced them, and with only modest hardware requirements, just about anyone can get in on the action.

Smallが実際に強力であることを実証することにより、これらのモデルは、高度な機械学習機能がアクセス可能で持続可能な未来への道を開きます。あなたがその未来を存在させるのを手伝いたいなら、今すぐこれらのモデルをつかんでください。抱きしめる顔はそれらをオープンソースし、控えめなハードウェアの要件のみで、誰もがアクションに参加することができます。

免責事項:info@kdj.com

提供される情報は取引に関するアドバイスではありません。 kdj.com は、この記事で提供される情報に基づいて行われた投資に対して一切の責任を負いません。暗号通貨は変動性が高いため、十分な調査を行った上で慎重に投資することを強くお勧めします。

このウェブサイトで使用されているコンテンツが著作権を侵害していると思われる場合は、直ちに当社 (info@kdj.com) までご連絡ください。速やかに削除させていただきます。

2025年04月12日に掲載されたその他の記事

もっと