|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
来自南加州大学、Prime Intellect 和核酸观测站的研究人员推出了宏基因组基础模型 METAGENE-1。这个 70 亿参数的自回归 Transformer 模型专门用于分析宏基因组序列。
With emerging pandemics posing persistent threats to global health, the need for advanced biosurveillance and pathogen detection systems is becoming increasingly evident. Traditional genomic analysis methods, while effective in isolated cases, often encounter challenges in addressing the complexities of large-scale health monitoring. A significant difficulty lies in identifying and understanding the genomic diversity in environments such as wastewater, which contains a rich mix of microbial and viral DNA and RNA. In this context, the rapid advancements in biological research are highlighting the importance of scalable, accurate, and interpretable models to analyze vast amounts of metagenomic data, aiding in the prediction and mitigation of health crises.
随着新出现的流行病对全球健康构成持续威胁,对先进生物监测和病原体检测系统的需求变得越来越明显。传统的基因组分析方法虽然在个别情况下有效,但在解决大规模健康监测的复杂性方面常常遇到挑战。一个重大的困难在于识别和理解废水等环境中的基因组多样性,其中含有丰富的微生物和病毒 DNA 和 RNA。在这种背景下,生物学研究的快速进步凸显了可扩展、准确和可解释的模型在分析大量宏基因组数据、帮助预测和缓解健康危机方面的重要性。
Now, a team of researchers from the University of Southern California, Prime Intellect, and the Nucleic Acid Observatory have introduced METAGENE-1, a metagenomic foundation model. This 7-billion-parameter autoregressive transformer model is specifically designed to analyze metagenomic sequences. METAGENE-1 is trained on a dataset comprising over 1.5 trillion DNA and RNA base pairs derived from human wastewater samples, utilizing next-generation sequencing technologies and a tailored byte-pair encoding (BPE) tokenization strategy to capture the intricate genomic diversity present in these datasets. The model is open-sourced, encouraging collaboration and further advancements in the field.
现在,来自南加州大学、Prime Intellect 和核酸观测站的研究团队推出了宏基因组基础模型 METAGENE-1。这个 70 亿参数的自回归 Transformer 模型专门用于分析宏基因组序列。 METAGENE-1 在包含来自人类废水样本的超过 1.5 万亿个 DNA 和 RNA 碱基对的数据集上进行训练,利用下一代测序技术和定制的字节对编码 (BPE) 标记化策略来捕获这些样本中存在的复杂基因组多样性。数据集。该模型是开源的,鼓励该领域的合作和进一步进步。
Technical Highlights and BenefitsMETAGENE-1’s architecture draws on modern transformer models, including GPT and Llama families. This decoder-only transformer uses a causal language modeling objective to predict the next token in a sequence based on preceding tokens. Its key features include:
技术亮点和优势METAGENE-1 的架构借鉴了现代变压器模型,包括 GPT 和 Llama 系列。这个仅解码器的转换器使用因果语言建模目标来根据前面的标记来预测序列中的下一个标记。其主要特点包括:
A decoder-only transformer architecture with 7 billion parameters.
具有 70 亿个参数的纯解码器 Transformer 架构。
Trained on a vast dataset of over 1.5 trillion DNA and RNA base pairs from human wastewater samples.
使用来自人类废水样本的超过 1.5 万亿个 DNA 和 RNA 碱基对的庞大数据集进行训练。
Employs a BPE tokenization strategy tailored to metagenomic sequences.
采用针对宏基因组序列定制的 BPE 标记化策略。
These features enable METAGENE-1 to generate high-quality sequence embeddings and adapt to specific tasks, enhancing its utility in the genomic and public health domains.
这些功能使 METAGENE-1 能够生成高质量的序列嵌入并适应特定任务,从而增强其在基因组和公共卫生领域的实用性。
Results and InsightsThe capabilities of METAGENE-1 were assessed using multiple benchmarks, where it demonstrated notable performance. In a pathogen detection benchmark based on human wastewater samples, the model achieved an average Matthews correlation coefficient (MCC) of 92.96, significantly outperforming other models. Additionally, METAGENE-1 showed strong results in anomaly detection tasks, effectively distinguishing metagenomic sequences from other genomic data sources.
结果和见解使用多个基准评估了 METAGENE-1 的功能,它表现出了显着的性能。在基于人类废水样本的病原体检测基准中,该模型的平均马修斯相关系数(MCC)达到92.96,显着优于其他模型。此外,METAGENE-1 在异常检测任务中显示出强大的结果,有效地区分宏基因组序列与其他基因组数据源。
In embedding-based genomic analyses, METAGENE-1 excelled on the Gene-MTEB benchmark, achieving a global average score of 0.59. This performance underscores its adaptability in both zero-shot and fine-tuning scenarios, reinforcing its value in handling complex and diverse metagenomic data.
在基于嵌入的基因组分析中,METAGENE-1 在 Gene-MTEB 基准测试中表现出色,全球平均得分为 0.59。这一性能强调了其在零样本和微调场景中的适应性,增强了其在处理复杂多样的宏基因组数据方面的价值。
ConclusionMETAGENE-1 represents a thoughtful integration of artificial intelligence and metagenomics. By leveraging transformer architectures, the model offers practical solutions for biosurveillance and pandemic preparedness. Its open-source release invites researchers to collaborate and innovate, advancing the field of genomic science. As challenges related to emerging pathogens and global pandemics continue, METAGENE-1 demonstrates how technology can play a crucial role in addressing public health concerns effectively and responsibly.
结论METAGENE-1 代表了人工智能和宏基因组学的深思熟虑的整合。通过利用变压器架构,该模型为生物监测和大流行病防范提供了实用的解决方案。其开源版本邀请研究人员合作和创新,推动基因组科学领域的发展。随着与新出现的病原体和全球流行病相关的挑战持续存在,METAGENE-1 展示了技术如何在有效和负责任地解决公共卫生问题方面发挥关键作用。
Check out the Paper, Website, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.
查看论文、网站、GitHub 页面和 Hugging Face 模型。这项研究的所有功劳都归功于该项目的研究人员。另外,不要忘记在 Twitter 上关注我们并加入我们的 Telegram 频道和 LinkedIn 群组。不要忘记加入我们 60k+ ML SubReddit。
FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence
即将举行的免费人工智能网络研讨会(2025 年 1 月 15 日):利用综合数据和评估智能提高 LLM 准确性
免责声明:info@kdj.com
The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!
If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.
-
- r来自普通用户的行为。
- 2025-01-08 15:05:21
- 欧普推广渠道