bitcoin
bitcoin

$97966.666569 USD

4.08%

ethereum
ethereum

$3488.678547 USD

2.38%

tether
tether

$0.999171 USD

0.05%

xrp
xrp

$2.296042 USD

2.29%

bnb
bnb

$704.116215 USD

1.81%

solana
solana

$199.078464 USD

4.12%

dogecoin
dogecoin

$0.333047 USD

3.66%

usd-coin
usd-coin

$0.999961 USD

-0.01%

cardano
cardano

$0.923775 USD

2.34%

tron
tron

$0.257141 USD

2.15%

avalanche
avalanche

$41.192887 USD

6.41%

chainlink
chainlink

$24.870929 USD

3.61%

toncoin
toncoin

$5.839812 USD

3.71%

shiba-inu
shiba-inu

$0.000023 USD

3.52%

sui
sui

$4.592960 USD

-1.64%

加密货币新闻

重新定义生成式人工智能:采用结构来提高输出精度

2024/04/19 08:06

结构化生成人工智能使生成人工智能模型能够生成特定格式的输出。这种方法通过将标记选择限制为有效选项来防止语法错误,确保可执行查询和可解析数据结构。此外,标点符号和关键字的一致标记化简化了模型必须学习的模式,减少了训练时间并提高了准确性。通过利用输出结构的知识,结构化生成人工智能提供了将自然语言翻译为各种结构化格式的强大工具。

重新定义生成式人工智能:采用结构来提高输出精度

Redefining Generative AI: Embracing Structure for Enhanced Output Precision

重新定义生成式人工智能:采用结构来提高输出精度

Introduction

介绍

Generative AI, a transformative technology revolutionizing natural language processing, has made significant strides in generating coherent and grammatically sound text. However, when it comes to producing structured output, such as SQL queries or JSON data, generative AI often falters, succumbing to errors that hinder the execution or parsing of the generated code.

生成式人工智能是一项彻底改变自然语言处理的变革性技术,在生成连贯且语法正确的文本方面取得了重大进展。然而,当涉及到生成结构化输出(例如 SQL 查询或 JSON 数据)时,生成式 AI 常常会出现问题,出现阻碍生成代码执行或解析的错误。

Enter Structured Generative AI

进入结构化生成人工智能

To overcome this limitation, we introduce the concept of "structured generative AI," a powerful technique that constrains the generative process within predefined formats, virtually eliminating syntax errors and ensuring the validity of the output. By leveraging the knowledge of the output language's structure, structured generative AI ensures that only legitimate tokens are considered during generation, effectively eliminating syntactical errors.

为了克服这一限制,我们引入了“结构化生成人工智能”的概念,这是一种强大的技术,可以将生成过程限制在预定义的格式内,从而几乎消除语法错误并确保输出的有效性。通过利用输出语言结构的知识,结构化生成人工智能可确保在生成过程中仅考虑合法标记,从而有效消除语法错误。

Mechanism of Token Generation

代币生成机制

Generative AI models, such as transformer architectures, generate tokens sequentially, relying on the input and previously generated tokens to determine the next selection. At each step, a classifier assigns probability values to all tokens in the vocabulary, guiding the selection of the next token.

生成式 AI 模型(例如 Transformer 架构)会顺序生成令牌,并根据输入和之前生成的令牌来确定下一个选择。在每一步中,分类器都会为词汇表中的所有标记分配概率值,指导下一个标记的选择。

Constraining Token Generation

限制代币生成

Structured generative AI incorporates knowledge of the output language's structure to limit token generation. Illegitimate tokens, such as incorrect punctuation or invalid keywords, have their probabilities set to infinity (negative infinity), effectively excluding them from consideration. For instance, if a valid SQL query requires a comma after "SELECT name," all other token probabilities are set to infinity, ensuring that only a comma can be selected.

结构化生成人工智能结合了输出语言结构的知识来限制令牌生成。非法标记(例如不正确的标点符号或无效关键字)的概率设置为无穷大(负无穷大),从而有效地将它们排除在考虑范围之外。例如,如果有效的 SQL 查询需要在“SELECT name”后使用逗号,则所有其他标记概率将设置为无穷大,以确保只能选择逗号。

Implementation with Hugging Face

抱脸实施

Hugging Face, a leading provider of pretrained models and tools for natural language processing, offers a convenient way to implement structured generative AI through its "logits processor" feature. This feature allows users to define a custom function that modifies the token probabilities after they have been calculated but before the final selection is made.

Hugging Face 是自然语言处理预训练模型和工具的领先提供商,通过其“logits 处理器”功能提供了一种实现结构化生成人工智能的便捷方法。此功能允许用户定义自定义函数,在计算令牌概率后但做出最终选择之前修改令牌概率。

Example: SQL Query Generation

示例:SQL 查询生成

To demonstrate the power of structured generative AI, let's consider the task of generating SQL queries from natural language. We initialize a pretrained BART model and define a set of rules that specify which tokens are allowed to follow each other in a valid SQL query.

为了展示结构化生成人工智能的强大功能,让我们考虑一下从自然语言生成 SQL 查询的任务。我们初始化预训练的 BART 模型并定义一组规则,指定允许哪些标记在有效的 SQL 查询中相互跟随。

rules = {'': ['SELECT', 'DELETE'], # beginning of the generation

规则={'

'SELECT': ['name', 'email', 'id'], # names of columns in our schema

'SELECT': ['name', 'email', 'id'], # 我们模式中的列名称

'DELETE': ['name', 'email', 'id'],

'删除': ['姓名', '电子邮件', 'id'],

'name': [',', 'FROM'],

'姓名': [',', '来自'],

'email': [',', 'FROM'],

'电子邮件': [',', '发件人'],

'id': [',', 'FROM'],

'id': [',', '来自'],

',': ['name', 'email', 'id'],

',': ['姓名', '电子邮件', 'id'],

'FROM': ['customers', 'vendors'], # names of tables in our schema

'FROM': ['customers', 'vendors'], # 我们模式中的表名称

'customers': ['
'],

'顾客': ['

'vendors': [''], # end of the generation}

Using these rules, we create a logits processor that converts the rules into token IDs and modifies the token probabilities accordingly.

'供应商':['

Results: Enhanced SQL Query Generation

结果:增强的 SQL 查询生成

Running the BART model with the logits processor yields significant improvements in the quality of generated SQL queries. The model now adheres to the predefined rules, producing syntactically correct queries that can be executed without errors.

使用 logits 处理器运行 BART 模型可以显着提高生成的 SQL 查询的质量。该模型现在遵循预定义的规则,生成语法正确的查询,可以无错误地执行。

to_translate = 'customers emails from the us'

to_translate = '来自美国的客户电子邮件'

words = to_translate.split()

单词 = to_translate.split()

tokenized_text = tokenizer([words], is_split_into_words=True, return_offsets_mapping=True)

tokenized_text = tokenizer([单词], is_split_into_words=True, return_offsets_mapping=True)

logits_processor = LogitsProcessorList([SQLLogitsProcessor(tokenizer)])

logits_processor = LogitsProcessorList([SQLLogitsProcessor(tokenizer)])

out = pretrained_model.generate(

输出 = pretrained_model.generate(

torch.tensor(tokenized_text["input_ids"]),

torch.tensor(tokenized_text["input_ids"]),

max_new_tokens=20,

最大新令牌=20,

logits_processor=logits_processor)

The Significance of Tokenization

logits_processor=logits_processor)Token化的意义

Tokenization, the process of converting text into a sequence of tokens, plays a crucial role in structured generative AI. Consistent tokenization ensures that similar concepts and punctuation are represented by the same token, simplifying the model's learning process. For instance, adding spaces before words and punctuation enhances consistency and reduces the complexity of patterns that the model needs to learn.

标记化是将文本转换为标记序列的过程,在结构化生成人工智能中发挥着至关重要的作用。一致的标记化确保相似的概念和标点符号由相同的标记表示,从而简化模型的学习过程。例如,在单词和标点符号之前添加空格可以增强一致性并降低模型需要学习的模式的复杂性。

Applications of Structured Generative AI

结构化生成人工智能的应用

The applications of structured generative AI extend far beyond SQL query generation. It empowers various tasks, including:

结构化生成人工智能的应用远远超出了 SQL 查询生成。它支持各种任务,包括:

  • JSON Data Extraction: Generating structured JSON data from natural language, enabling seamless data parsing and storage.
  • Query Generation: Creating executable queries for various database systems, facilitating efficient information retrieval.
  • Code Generation: Producing valid code snippets in different programming languages, accelerating software development.

Conclusion

JSON数据提取:从自然语言生成结构化JSON数据,实现无缝数据解析和存储。查询生成:为各种数据库系统创建可执行查询,促进高效的信息检索。代码生成:用不同的编程语言生成有效的代码片段,加速软件开发。结论

Structured generative AI is a groundbreaking technique that dramatically enhances the precision and applicability of generative AI models. By incorporating knowledge of the output language's structure, structured generative AI eliminates syntax errors and guarantees the executability of generated code. This breakthrough enables a wide range of applications, empowering users to extract information, generate queries, and produce code more efficiently and accurately.

结构化生成人工智能是一项突破性技术,可显着提高生成人工智能模型的精度和适用性。通过结合输出语言结构的知识,结构化生成人工智能消除了语法错误并保证了生成代码的可执行性。这一突破实现了广泛的应用,使用户能够更高效、更准确地提取信息、生成查询和生成代码。

免责声明:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

2024年12月25日 发表的其他文章