![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
主要目標是透過以下標識符之一來識別與每個文件關聯的客戶:
The goal was to extract client names from legal documents using Named Entity Recognition (NER). Here's how I approached the task:
目標是使用命名實體識別 (NER) 從法律文件中提取客戶名稱。我是這樣完成任務的:
Data: I had a collection of legal documents in PDF format. The task was to identify the clients mentioned in each document using one of the following identifiers:
資料:我收集了 PDF 格式的法律文件。任務是使用以下標識符之一來識別每個文件中提到的客戶:
Approximate client name (e.g., "John Doe")
大概的客戶名稱(例如“John Doe”)
Precise client name (e.e., "Doe, John A.")
準確的客戶名稱(例如“Doe, John A.”)
Approximate firm name (e.g., "Doe Law Firm")
公司大致名稱(例如“Doe Law Firm”)
Precise firm name (e.g., "Doe, John A. Law Firm")
準確的公司名稱(例如“Doe, John A. Law Firm”)
About 5% of the documents didn't include any identifying entities.
大約 5% 的檔案不包含任何識別實體。
Dataset: For developing the model, I used 710 "true" PDF documents, which were split into three sets: 600 for training, 55 for validation, and 55 for testing.
資料集:為了開發模型,我使用了 710 個「真實」PDF 文檔,這些文檔分為三組:600 個用於訓練,55 個用於驗證,55 個用於測試。
Labels: I was given an Excel file with entities extracted as plain text, which needed to be manually labeled in the document text. Using the BIO tagging format, I performed the following steps:
標籤:我收到了一個 Excel 文件,其中的實體被提取為純文本,需要在文件文本中手動標記。使用 BIO 標記格式,我執行了以下步驟:
Mark the beginning of an entity with "B-
用“B-”標記實體的開頭。
Continue marking subsequent tokens within the same entity with "I-
繼續以「I-」標記同一實體內的後續標記。
If a token doesn't belong to any entity, mark it as "O".
如果令牌不屬於任何實體,則將其標記為“O”。
Alternative Approach: Models like LayoutLM, which also consider bounding boxes for input tokens, could potentially enhance the performance of the NER task. However, I opted not to use this approach because, as is often the case, I had already spent the majority of the project time on preparing the data (e.g., reformatting Excel files, correcting data errors, labeling). To integrate bounding box-based models, I would have needed to allocate even more time.
替代方法:像 LayoutLM 這樣的模型也考慮了輸入標記的邊界框,可能會提高 NER 任務的效能。然而,我選擇不使用這種方法,因為通常情況下,我已經花了專案的大部分時間來準備資料(例如,重新格式化 Excel 檔案、更正資料錯誤、標記)。為了整合基於邊界框的模型,我需要分配更多的時間。
While regex and heuristics could theoretically be applied to identify these simple entities, I anticipated that this approach would be impractical, as it would necessitate overly complex rules to precisely identify the correct entities among other potential candidates (e.g., lawyer name, case number, other participants in the proceedings). In contrast, the model is capable of learning to distinguish the relevant entities, rendering the use of heuristics superfluous.
雖然理論上可以應用正規表示式和啟發式方法來識別這些簡單的實體,但我預計這種方法是不切實際的,因為它需要過於複雜的規則來精確識別其他潛在候選者中的正確實體(例如,律師姓名、案件編號、其他實體)。相較之下,該模型能夠學習區分相關實體,從而使啟發式方法的使用變得多餘。
免責聲明:info@kdj.com
所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!
如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。
-
- 隨著加密貨幣市場清算的5億美元,比特幣價格掙扎在其日常圖表上。
- 2025-04-03 16:40:12
- 在特朗普的關稅和通貨膨脹恐懼的貿易緊張局勢之後,今天的加密貨幣市場受到了巨大打擊。
-
- Tron創始人Justin Sun指控破產的第一筆數字信託(FDT)
- 2025-04-03 16:40:12
- 賈斯汀·孫(Justin Sun)週三提出了索賠
-
-
-
- 隨著輕鬆的Memecoin Generation的推出,Solana於去年一月開始
- 2025-04-03 16:30:12
- 隨著輕鬆的Memecoin Generation的推出,Solana於去年一月開始
-
- EOS,Story和Litecoin隨著市場的合併而激增 - 這些山寨幣的下一步
- 2025-04-03 16:30:12
- 由於關稅破壞了傳統和加密市場,一些山寨幣正在與日益增長的看跌活動作鬥爭
-
- 市場將隨著對比特幣和替代幣的需求增加而上升:分析師預測
- 2025-04-03 16:25:12
- 昨天,美國政府對包括中國,英國和韓國在內的一些著名貿易夥伴徵收了互惠關稅。
-
- 明尼蘇達州和阿拉巴馬州議員介紹伴侶法案以購買比特幣
- 2025-04-03 16:25:12
- 美國明尼蘇達州和阿拉巴馬州的立法者已向同一現有法案提交了同類法案,這些法案將使每個州都能購買比特幣。
-