![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
结构化生成人工智能使生成人工智能模型能够生成特定格式的输出。这种方法通过将标记选择限制为有效选项来防止语法错误,确保可执行查询和可解析数据结构。此外,标点符号和关键字的一致标记化简化了模型必须学习的模式,减少了训练时间并提高了准确性。通过利用输出结构的知识,结构化生成人工智能提供了将自然语言翻译为各种结构化格式的强大工具。
Redefining Generative AI: Embracing Structure for Enhanced Output Precision
重新定义生成式人工智能:采用结构来提高输出精度
Introduction
介绍
Generative AI, a transformative technology revolutionizing natural language processing, has made significant strides in generating coherent and grammatically sound text. However, when it comes to producing structured output, such as SQL queries or JSON data, generative AI often falters, succumbing to errors that hinder the execution or parsing of the generated code.
生成式人工智能是一项彻底改变自然语言处理的变革性技术,在生成连贯且语法正确的文本方面取得了重大进展。然而,当涉及到生成结构化输出(例如 SQL 查询或 JSON 数据)时,生成式 AI 常常会出现问题,出现阻碍生成代码执行或解析的错误。
Enter Structured Generative AI
进入结构化生成人工智能
To overcome this limitation, we introduce the concept of "structured generative AI," a powerful technique that constrains the generative process within predefined formats, virtually eliminating syntax errors and ensuring the validity of the output. By leveraging the knowledge of the output language's structure, structured generative AI ensures that only legitimate tokens are considered during generation, effectively eliminating syntactical errors.
为了克服这一限制,我们引入了“结构化生成人工智能”的概念,这是一种强大的技术,可以将生成过程限制在预定义的格式内,从而几乎消除语法错误并确保输出的有效性。通过利用输出语言结构的知识,结构化生成人工智能可确保在生成过程中仅考虑合法标记,从而有效消除语法错误。
Mechanism of Token Generation
代币生成机制
Generative AI models, such as transformer architectures, generate tokens sequentially, relying on the input and previously generated tokens to determine the next selection. At each step, a classifier assigns probability values to all tokens in the vocabulary, guiding the selection of the next token.
生成式 AI 模型(例如 Transformer 架构)会顺序生成令牌,并根据输入和之前生成的令牌来确定下一个选择。在每一步中,分类器都会为词汇表中的所有标记分配概率值,指导下一个标记的选择。
Constraining Token Generation
限制代币生成
Structured generative AI incorporates knowledge of the output language's structure to limit token generation. Illegitimate tokens, such as incorrect punctuation or invalid keywords, have their probabilities set to infinity (negative infinity), effectively excluding them from consideration. For instance, if a valid SQL query requires a comma after "SELECT name," all other token probabilities are set to infinity, ensuring that only a comma can be selected.
结构化生成人工智能结合了输出语言结构的知识来限制令牌生成。非法标记(例如不正确的标点符号或无效关键字)的概率设置为无穷大(负无穷大),从而有效地将它们排除在考虑范围之外。例如,如果有效的 SQL 查询需要在“SELECT name”后使用逗号,则所有其他标记概率将设置为无穷大,以确保只能选择逗号。
Implementation with Hugging Face
抱脸实施
Hugging Face, a leading provider of pretrained models and tools for natural language processing, offers a convenient way to implement structured generative AI through its "logits processor" feature. This feature allows users to define a custom function that modifies the token probabilities after they have been calculated but before the final selection is made.
Hugging Face 是自然语言处理预训练模型和工具的领先提供商,通过其“logits 处理器”功能提供了一种实现结构化生成人工智能的便捷方法。此功能允许用户定义自定义函数,在计算令牌概率后但做出最终选择之前修改令牌概率。
Example: SQL Query Generation
示例:SQL 查询生成
To demonstrate the power of structured generative AI, let's consider the task of generating SQL queries from natural language. We initialize a pretrained BART model and define a set of rules that specify which tokens are allowed to follow each other in a valid SQL query.
为了展示结构化生成人工智能的强大功能,让我们考虑一下从自然语言生成 SQL 查询的任务。我们初始化预训练的 BART 模型并定义一组规则,指定允许哪些标记在有效的 SQL 查询中相互跟随。
rules = {'': ['SELECT', 'DELETE'], # beginning of the generation规则={'
'SELECT': ['name', 'email', 'id'], # names of columns in our schema'SELECT': ['name', 'email', 'id'], # 我们模式中的列名称
'DELETE': ['name', 'email', 'id'],'删除': ['姓名', '电子邮件', 'id'],
'name': [',', 'FROM'],'姓名': [',', '来自'],
'email': [',', 'FROM'],'电子邮件': [',', '发件人'],
'id': [',', 'FROM'],'id': [',', '来自'],
',': ['name', 'email', 'id'],',': ['姓名', '电子邮件', 'id'],
'FROM': ['customers', 'vendors'], # names of tables in our schema'FROM': ['customers', 'vendors'], # 我们模式中的表名称
'customers': [''],'顾客': ['
'vendors': [''], # end of the generation}
Using these rules, we create a logits processor that converts the rules into token IDs and modifies the token probabilities accordingly.
'供应商':['
Results: Enhanced SQL Query Generation
结果:增强的 SQL 查询生成
Running the BART model with the logits processor yields significant improvements in the quality of generated SQL queries. The model now adheres to the predefined rules, producing syntactically correct queries that can be executed without errors.
使用 logits 处理器运行 BART 模型可以显着提高生成的 SQL 查询的质量。该模型现在遵循预定义的规则,生成语法正确的查询,可以无错误地执行。
to_translate = 'customers emails from the us'to_translate = '来自美国的客户电子邮件'
words = to_translate.split()单词 = to_translate.split()
tokenized_text = tokenizer([words], is_split_into_words=True, return_offsets_mapping=True)tokenized_text = tokenizer([单词], is_split_into_words=True, return_offsets_mapping=True)
logits_processor = LogitsProcessorList([SQLLogitsProcessor(tokenizer)])logits_processor = LogitsProcessorList([SQLLogitsProcessor(tokenizer)])
out = pretrained_model.generate(输出 = pretrained_model.generate(
torch.tensor(tokenized_text["input_ids"]),torch.tensor(tokenized_text["input_ids"]),
max_new_tokens=20,最大新令牌=20,
logits_processor=logits_processor)
The Significance of Tokenization
logits_processor=logits_processor)Token化的意义
Tokenization, the process of converting text into a sequence of tokens, plays a crucial role in structured generative AI. Consistent tokenization ensures that similar concepts and punctuation are represented by the same token, simplifying the model's learning process. For instance, adding spaces before words and punctuation enhances consistency and reduces the complexity of patterns that the model needs to learn.
标记化是将文本转换为标记序列的过程,在结构化生成人工智能中发挥着至关重要的作用。一致的标记化确保相似的概念和标点符号由相同的标记表示,从而简化模型的学习过程。例如,在单词和标点符号之前添加空格可以增强一致性并降低模型需要学习的模式的复杂性。
Applications of Structured Generative AI
结构化生成人工智能的应用
The applications of structured generative AI extend far beyond SQL query generation. It empowers various tasks, including:
结构化生成人工智能的应用远远超出了 SQL 查询生成。它支持各种任务,包括:
- JSON Data Extraction: Generating structured JSON data from natural language, enabling seamless data parsing and storage.
- Query Generation: Creating executable queries for various database systems, facilitating efficient information retrieval.
- Code Generation: Producing valid code snippets in different programming languages, accelerating software development.
Conclusion
JSON数据提取:从自然语言生成结构化JSON数据,实现无缝数据解析和存储。查询生成:为各种数据库系统创建可执行查询,促进高效的信息检索。代码生成:用不同的编程语言生成有效的代码片段,加速软件开发。结论
Structured generative AI is a groundbreaking technique that dramatically enhances the precision and applicability of generative AI models. By incorporating knowledge of the output language's structure, structured generative AI eliminates syntax errors and guarantees the executability of generated code. This breakthrough enables a wide range of applications, empowering users to extract information, generate queries, and produce code more efficiently and accurately.
结构化生成人工智能是一项突破性技术,可显着提高生成人工智能模型的精度和适用性。通过结合输出语言结构的知识,结构化生成人工智能消除了语法错误并保证了生成代码的可执行性。这一突破实现了广泛的应用,使用户能够更高效、更准确地提取信息、生成查询和生成代码。
免责声明:info@kdj.com
所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!
如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。
-
-
-
- 随着轻松的Memecoin Generation的推出,Solana于去年一月开始
- 2025-04-03 16:30:12
- 随着轻松的Memecoin Generation的推出,Solana于去年一月开始
-
- EOS,Story和Litecoin随着市场的合并而激增 - 这些山寨币的下一步
- 2025-04-03 16:30:12
- 由于关税破坏了传统和加密市场,一些山寨币正在与日益增长的看跌活动作斗争
-
- 市场将随着对比特币和替代币的需求增加而上升:分析师预测
- 2025-04-03 16:25:12
- 昨天,美国政府对包括中国,英国和韩国在内的一些著名贸易伙伴征收了互惠关税。
-
- 明尼苏达州和阿拉巴马州议员介绍伴侣法案以购买比特币
- 2025-04-03 16:25:12
- 美国明尼苏达州和阿拉巴马州的立法者已向同一现有法案提交了同类法案,这些法案将使每个州都能购买比特币。
-
-
-
- Metaplanet以1330万美元的价格扩大其比特币持有量,将其BTC储藏在
- 2025-04-03 16:15:12
- 此举增强了Metaplanet在亚洲和全球第9大公司持有人中最大的比特币持有人的地位。