|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
主要目標是透過以下標識符之一來識別與每個文件關聯的客戶:
The goal was to extract client names from legal documents using Named Entity Recognition (NER). Here's how I approached the task:
目標是使用命名實體識別 (NER) 從法律文件中提取客戶名稱。我是這樣完成任務的:
Data: I had a collection of legal documents in PDF format. The task was to identify the clients mentioned in each document using one of the following identifiers:
資料:我收集了 PDF 格式的法律文件。任務是使用以下標識符之一來識別每個文件中提到的客戶:
Approximate client name (e.g., "John Doe")
大概的客戶名稱(例如“John Doe”)
Precise client name (e.e., "Doe, John A.")
準確的客戶名稱(例如“Doe, John A.”)
Approximate firm name (e.g., "Doe Law Firm")
公司大致名稱(例如“Doe Law Firm”)
Precise firm name (e.g., "Doe, John A. Law Firm")
準確的公司名稱(例如“Doe, John A. Law Firm”)
About 5% of the documents didn't include any identifying entities.
大約 5% 的檔案不包含任何識別實體。
Dataset: For developing the model, I used 710 "true" PDF documents, which were split into three sets: 600 for training, 55 for validation, and 55 for testing.
資料集:為了開發模型,我使用了 710 個「真實」PDF 文檔,這些文檔分為三組:600 個用於訓練,55 個用於驗證,55 個用於測試。
Labels: I was given an Excel file with entities extracted as plain text, which needed to be manually labeled in the document text. Using the BIO tagging format, I performed the following steps:
標籤:我收到了一個 Excel 文件,其中的實體被提取為純文本,需要在文件文本中手動標記。使用 BIO 標記格式,我執行了以下步驟:
Mark the beginning of an entity with "B-
用“B-”標記實體的開頭。
Continue marking subsequent tokens within the same entity with "I-
繼續以「I-」標記同一實體內的後續標記。
If a token doesn't belong to any entity, mark it as "O".
如果令牌不屬於任何實體,則將其標記為“O”。
Alternative Approach: Models like LayoutLM, which also consider bounding boxes for input tokens, could potentially enhance the performance of the NER task. However, I opted not to use this approach because, as is often the case, I had already spent the majority of the project time on preparing the data (e.g., reformatting Excel files, correcting data errors, labeling). To integrate bounding box-based models, I would have needed to allocate even more time.
替代方法:像 LayoutLM 這樣的模型也考慮了輸入標記的邊界框,可能會提高 NER 任務的效能。然而,我選擇不使用這種方法,因為通常情況下,我已經花了專案的大部分時間來準備資料(例如,重新格式化 Excel 檔案、更正資料錯誤、標記)。為了整合基於邊界框的模型,我需要分配更多的時間。
While regex and heuristics could theoretically be applied to identify these simple entities, I anticipated that this approach would be impractical, as it would necessitate overly complex rules to precisely identify the correct entities among other potential candidates (e.g., lawyer name, case number, other participants in the proceedings). In contrast, the model is capable of learning to distinguish the relevant entities, rendering the use of heuristics superfluous.
雖然理論上可以應用正規表示式和啟發式方法來識別這些簡單的實體,但我預計這種方法是不切實際的,因為它需要過於複雜的規則來精確識別其他潛在候選者中的正確實體(例如,律師姓名、案件編號、其他實體)。相較之下,該模型能夠學習區分相關實體,從而使啟發式方法的使用變得多餘。
免責聲明:info@kdj.com
所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!
如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。
-
- 以太坊的 Pectra 升級及其意義
- 2024-11-19 09:30:21
- Pectra 升級也稱為“Prague/Electra”,旨在減少以太坊網路的嚴重擁塞並解決全球網路可擴展性問題。
-
- 沒有國會的情況下,比特幣戰略儲備能否推進?專家意見不一
- 2024-11-19 09:30:02
- 美國政府已經持有超過 208,000 枚比特幣 (BTC),但保留這些比特幣比想像的更複雜。