|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
結構化生成人工智慧使生成人工智慧模型能夠產生特定格式的輸出。這種方法透過將標記選擇限制為有效選項來防止語法錯誤,確保可執行查詢和可解析資料結構。此外,標點符號和關鍵字的一致標記化簡化了模型必須學習的模式,減少了訓練時間並提高了準確性。透過利用輸出結構的知識,結構化生成人工智慧提供了將自然語言翻譯為各種結構化格式的強大工具。
Redefining Generative AI: Embracing Structure for Enhanced Output Precision
重新定義生成式人工智慧:採用結構來提高輸出精度
Introduction
介紹
Generative AI, a transformative technology revolutionizing natural language processing, has made significant strides in generating coherent and grammatically sound text. However, when it comes to producing structured output, such as SQL queries or JSON data, generative AI often falters, succumbing to errors that hinder the execution or parsing of the generated code.
生成式人工智慧是一項徹底改變自然語言處理的變革性技術,在產生連貫且語法正確的文本方面取得了重大進展。然而,當涉及產生結構化輸出(例如 SQL 查詢或 JSON 資料)時,生成式 AI 常常會出現問題,出現阻礙生成程式碼執行或解析的錯誤。
Enter Structured Generative AI
進入結構化生成人工智慧
To overcome this limitation, we introduce the concept of "structured generative AI," a powerful technique that constrains the generative process within predefined formats, virtually eliminating syntax errors and ensuring the validity of the output. By leveraging the knowledge of the output language's structure, structured generative AI ensures that only legitimate tokens are considered during generation, effectively eliminating syntactical errors.
為了克服這一限制,我們引入了「結構化生成人工智慧」的概念,這是一種強大的技術,可以將生成過程限制在預先定義的格式內,幾乎消除語法錯誤並確保輸出的有效性。透過利用輸出語言結構的知識,結構化產生人工智慧可確保在生成過程中僅考慮合法標記,從而有效消除語法錯誤。
Mechanism of Token Generation
代幣生成機制
Generative AI models, such as transformer architectures, generate tokens sequentially, relying on the input and previously generated tokens to determine the next selection. At each step, a classifier assigns probability values to all tokens in the vocabulary, guiding the selection of the next token.
生成式 AI 模型(例如 Transformer 架構)會順序產生令牌,並根據輸入和先前產生的令牌來決定下一個選擇。在每一步中,分類器都會為詞彙表中的所有標記分配機率值,並指導下一個標記的選擇。
Constraining Token Generation
限制代幣生成
Structured generative AI incorporates knowledge of the output language's structure to limit token generation. Illegitimate tokens, such as incorrect punctuation or invalid keywords, have their probabilities set to infinity (negative infinity), effectively excluding them from consideration. For instance, if a valid SQL query requires a comma after "SELECT name," all other token probabilities are set to infinity, ensuring that only a comma can be selected.
結構化生成人工智慧結合了輸出語言結構的知識來限制令牌生成。非法標記(例如不正確的標點符號或無效關鍵字)的機率設定為無限大(負無限大),從而有效地將它們排除在考慮範圍之外。例如,如果有效的 SQL 查詢需要在「SELECT name」後使用逗號,則所有其他標記機率將設為無限大,以確保只能選擇逗號。
Implementation with Hugging Face
抱臉實施
Hugging Face, a leading provider of pretrained models and tools for natural language processing, offers a convenient way to implement structured generative AI through its "logits processor" feature. This feature allows users to define a custom function that modifies the token probabilities after they have been calculated but before the final selection is made.
Hugging Face 是自然語言處理預訓練模型和工具的領先提供商,透過其「logits 處理器」功能提供了實現結構化生成人工智慧的便捷方法。此功能允許使用者定義自訂函數,在計算令牌機率後但做出最終選擇之前修改令牌機率。
Example: SQL Query Generation
範例:SQL 查詢生成
To demonstrate the power of structured generative AI, let's consider the task of generating SQL queries from natural language. We initialize a pretrained BART model and define a set of rules that specify which tokens are allowed to follow each other in a valid SQL query.
為了展示結構化產生人工智慧的強大功能,讓我們考慮一下從自然語言產生 SQL 查詢的任務。我們初始化預訓練的 BART 模型並定義一組規則,指定允許哪些標記在有效的 SQL 查詢中相互跟隨。
rules = {'': ['SELECT', 'DELETE'], # beginning of the generation規則={'
'SELECT': ['name', 'email', 'id'], # names of columns in our schema'SELECT': ['name', 'email', 'id'], # 我們模式中的列名稱
'DELETE': ['name', 'email', 'id'],'刪除': ['姓名', '電子郵件', 'id'],
'name': [',', 'FROM'],'姓名': [',', '來自'],
'email': [',', 'FROM'],'電子郵件': [',', '寄件者'],
'id': [',', 'FROM'],'id': [',', '來自'],
',': ['name', 'email', 'id'],',': ['姓名', '電子郵件', 'id'],
'FROM': ['customers', 'vendors'], # names of tables in our schema'FROM': ['customers', 'vendors'], # 我們模式中的表格名稱
'customers': [''],'顧客': ['
'vendors': [''], # end of the generation}
Using these rules, we create a logits processor that converts the rules into token IDs and modifies the token probabilities accordingly.
'供應商':['
Results: Enhanced SQL Query Generation
結果:增強的 SQL 查詢生成
Running the BART model with the logits processor yields significant improvements in the quality of generated SQL queries. The model now adheres to the predefined rules, producing syntactically correct queries that can be executed without errors.
使用 logits 處理器執行 BART 模型可以顯著提高產生的 SQL 查詢的品質。模型現在遵循預先定義的規則,產生語法正確的查詢,可以無錯誤地執行。
to_translate = 'customers emails from the us'to_translate = '來自美國的客戶電子郵件'
words = to_translate.split()字 = to_translate.split()
tokenized_text = tokenizer([words], is_split_into_words=True, return_offsets_mapping=True)tokenized_text = tokenizer([單字], is_split_into_words=True, return_offsets_mapping=True)
logits_processor = LogitsProcessorList([SQLLogitsProcessor(tokenizer)])logits_processor = LogitsProcessorList([SQLLogitsProcessor(tokenizer)])
out = pretrained_model.generate(輸出 = pretrained_model.generate(
torch.tensor(tokenized_text["input_ids"]),torch.tensor(tokenized_text["input_ids"]),
max_new_tokens=20,最大新令牌=20,
logits_processor=logits_processor)
The Significance of Tokenization
logits_processor=logits_processor)Token化的意義
Tokenization, the process of converting text into a sequence of tokens, plays a crucial role in structured generative AI. Consistent tokenization ensures that similar concepts and punctuation are represented by the same token, simplifying the model's learning process. For instance, adding spaces before words and punctuation enhances consistency and reduces the complexity of patterns that the model needs to learn.
標記化是將文字轉換為標記序列的過程,在結構化生成人工智慧中發揮著至關重要的作用。一致的標記化確保相似的概念和標點符號由相同的標記表示,從而簡化模型的學習過程。例如,在單字和標點符號之前添加空格可以增強一致性並降低模型需要學習的模式的複雜性。
Applications of Structured Generative AI
結構化生成人工智慧的應用
The applications of structured generative AI extend far beyond SQL query generation. It empowers various tasks, including:
結構化產生人工智慧的應用遠遠超出了 SQL 查詢產生。它支援各種任務,包括:
- JSON Data Extraction: Generating structured JSON data from natural language, enabling seamless data parsing and storage.
- Query Generation: Creating executable queries for various database systems, facilitating efficient information retrieval.
- Code Generation: Producing valid code snippets in different programming languages, accelerating software development.
Conclusion
JSON資料擷取:從自然語言產生結構化JSON數據,實現無縫資料解析與儲存。的程式碼片段,加速軟體開發。
Structured generative AI is a groundbreaking technique that dramatically enhances the precision and applicability of generative AI models. By incorporating knowledge of the output language's structure, structured generative AI eliminates syntax errors and guarantees the executability of generated code. This breakthrough enables a wide range of applications, empowering users to extract information, generate queries, and produce code more efficiently and accurately.
結構化生成人工智慧是一項突破性技術,可顯著提高生成人工智慧模型的精確度和適用性。透過結合輸出語言結構的知識,結構化生成人工智慧消除了語法錯誤並保證了生成程式碼的可執行性。這項突破實現了廣泛的應用,使用戶能夠更有效率、更準確地提取資訊、產生查詢和產生程式碼。
免責聲明:info@kdj.com
The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!
If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.
-
- Quant 表示,比特幣 Coinbase 溢價給出了潛在的買入訊號
- 2024-12-25 14:35:02
- 一位量化分析師解釋了比特幣 Coinbase 溢價指數的最新趨勢如何意味著該資產的買入機會。
-
- 2024年加密產業回顧
- 2024-12-25 14:30:59
- 2024 年對加密產業來說是動盪的一年。比特幣現貨ETF推出,機構加速採用,帶來產業繁榮
-
- 莫迪總理將在阿塔爾·比哈里·瓦杰帕伊誕辰紀念日為肯貝特瓦河連接工程奠基
- 2024-12-25 14:30:59
- 總理莫迪將在前總理阿塔爾·比哈里·瓦杰帕伊誕辰之際為該國首個肯-貝特瓦河流連接項目在克久拉霍奠基
-
- 萊特幣(LTC)今年平均每日活躍地址顯著增加
- 2024-12-25 14:30:59
- 鏈上數據顯示,今年萊特幣每日活躍地址指標較去年大幅增加。
-
- 由於停產和競爭壓力,豐田 11 月全球銷量停滯不前
- 2024-12-25 14:30:59
- (彭博)—由於需求低迷加上兩家工廠暫停生產,豐田汽車公司 11 月的全球銷售趨於穩定。