|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
來自南加州大學、Prime Intellect 和核酸觀測站的研究人員推出了宏基因組基礎模型 METAGENE-1。這個 70 億參數的自回歸 Transformer 模型專門用於分析宏基因組序列。
With emerging pandemics posing persistent threats to global health, the need for advanced biosurveillance and pathogen detection systems is becoming increasingly evident. Traditional genomic analysis methods, while effective in isolated cases, often encounter challenges in addressing the complexities of large-scale health monitoring. A significant difficulty lies in identifying and understanding the genomic diversity in environments such as wastewater, which contains a rich mix of microbial and viral DNA and RNA. In this context, the rapid advancements in biological research are highlighting the importance of scalable, accurate, and interpretable models to analyze vast amounts of metagenomic data, aiding in the prediction and mitigation of health crises.
隨著新出現的流行病對全球健康構成持續威脅,對先進生物監測和病原體檢測系統的需求變得越來越明顯。傳統的基因組分析方法雖然在個別情況下有效,但在解決大規模健康監測的複雜性方面常常遇到挑戰。一個重大的困難在於識別和理解廢水等環境中的基因組多樣性,其中含有豐富的微生物和病毒 DNA 和 RNA。在這種背景下,生物學研究的快速進展凸顯了可擴展、準確和可解釋的模型在分析大量宏基因組數據、幫助預測和緩解健康危機的重要性。
Now, a team of researchers from the University of Southern California, Prime Intellect, and the Nucleic Acid Observatory have introduced METAGENE-1, a metagenomic foundation model. This 7-billion-parameter autoregressive transformer model is specifically designed to analyze metagenomic sequences. METAGENE-1 is trained on a dataset comprising over 1.5 trillion DNA and RNA base pairs derived from human wastewater samples, utilizing next-generation sequencing technologies and a tailored byte-pair encoding (BPE) tokenization strategy to capture the intricate genomic diversity present in these datasets. The model is open-sourced, encouraging collaboration and further advancements in the field.
現在,來自南加州大學、Prime Intellect 和核酸觀測站的研究團隊推出了宏基因組基礎模型 METAGENE-1。這個 70 億參數的自回歸 Transformer 模型專門用於分析宏基因組序列。 METAGENE-1 在包含來自人類廢水樣本的超過1.5 兆個DNA 和RNA 鹼基對的資料集上進行訓練,利用下一代定序技術和客製化的位元組對編碼(BPE) 標記化策略來捕獲這些樣本中存在的複雜基因組多樣性。該模型是開源的,鼓勵該領域的合作和進一步進步。
Technical Highlights and BenefitsMETAGENE-1’s architecture draws on modern transformer models, including GPT and Llama families. This decoder-only transformer uses a causal language modeling objective to predict the next token in a sequence based on preceding tokens. Its key features include:
技術亮點和優勢METAGENE-1 的架構借鑒了現代變壓器模型,包括 GPT 和 Llama 系列。這個僅解碼器的轉換器使用因果語言建模目標來根據前面的標記來預測序列中的下一個標記。其主要特點包括:
A decoder-only transformer architecture with 7 billion parameters.
具有 70 億個參數的純解碼器 Transformer 架構。
Trained on a vast dataset of over 1.5 trillion DNA and RNA base pairs from human wastewater samples.
使用來自人類廢水樣本的超過 1.5 兆個 DNA 和 RNA 鹼基對的龐大數據集進行訓練。
Employs a BPE tokenization strategy tailored to metagenomic sequences.
採用針對宏基因組序列客製化的 BPE 標記化策略。
These features enable METAGENE-1 to generate high-quality sequence embeddings and adapt to specific tasks, enhancing its utility in the genomic and public health domains.
這些功能使 METAGENE-1 能夠產生高品質的序列嵌入並適應特定任務,從而增強其在基因組和公共衛生領域的實用性。
Results and InsightsThe capabilities of METAGENE-1 were assessed using multiple benchmarks, where it demonstrated notable performance. In a pathogen detection benchmark based on human wastewater samples, the model achieved an average Matthews correlation coefficient (MCC) of 92.96, significantly outperforming other models. Additionally, METAGENE-1 showed strong results in anomaly detection tasks, effectively distinguishing metagenomic sequences from other genomic data sources.
結果和見解使用多個基準評估了 METAGENE-1 的功能,它表現出了顯著的性能。在基於人類廢水樣本的病原體檢測基準中,模型的平均馬修斯相關係數(MCC)達到92.96,顯著優於其他模型。此外,METAGENE-1 在異常檢測任務中顯示出強大的結果,有效地區分宏基因組序列與其他基因組資料來源。
In embedding-based genomic analyses, METAGENE-1 excelled on the Gene-MTEB benchmark, achieving a global average score of 0.59. This performance underscores its adaptability in both zero-shot and fine-tuning scenarios, reinforcing its value in handling complex and diverse metagenomic data.
在基於嵌入的基因組分析中,METAGENE-1 在 Gene-MTEB 基準測試中表現出色,全球平均得分為 0.59。這項性能強調了其在零樣本和微調場景中的適應性,增強了其在處理複雜多樣的宏基因組數據方面的價值。
ConclusionMETAGENE-1 represents a thoughtful integration of artificial intelligence and metagenomics. By leveraging transformer architectures, the model offers practical solutions for biosurveillance and pandemic preparedness. Its open-source release invites researchers to collaborate and innovate, advancing the field of genomic science. As challenges related to emerging pathogens and global pandemics continue, METAGENE-1 demonstrates how technology can play a crucial role in addressing public health concerns effectively and responsibly.
結論METAGENE-1 代表了人工智慧和宏基因組學的深思熟慮的整合。透過利用變壓器架構,該模型為生物監測和大流行病防範提供了實用的解決方案。其開源版本邀請研究人員合作與創新,推動基因組科學領域的發展。隨著與新出現的病原體和全球流行病相關的挑戰持續存在,METAGENE-1 展示了技術如何在有效和負責任地解決公共衛生問題方面發揮關鍵作用。
Check out the Paper, Website, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.
查看論文、網站、GitHub 頁面和 Hugging Face 模型。這項研究的所有功勞都歸功於該計畫的研究人員。另外,不要忘記在 Twitter 上關注我們並加入我們的 Telegram 頻道和 LinkedIn 群組。不要忘記加入我們 60k+ ML SubReddit。
FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence
即將舉行的免費人工智慧網路研討會(2025 年 1 月 15 日):利用綜合數據和評估智慧提高 LLM 準確性
免責聲明:info@kdj.com
The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!
If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.
-
- r來自一般使用者的行為。
- 2025-01-08 15:05:21
- 歐普推廣通路