|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
主要目标是通过以下标识符之一来识别与每个文档关联的客户:
The goal was to extract client names from legal documents using Named Entity Recognition (NER). Here's how I approached the task:
目标是使用命名实体识别 (NER) 从法律文档中提取客户名称。我是这样完成任务的:
Data: I had a collection of legal documents in PDF format. The task was to identify the clients mentioned in each document using one of the following identifiers:
数据:我收集了 PDF 格式的法律文件。任务是使用以下标识符之一来识别每个文档中提到的客户:
Approximate client name (e.g., "John Doe")
大概的客户名称(例如“John Doe”)
Precise client name (e.e., "Doe, John A.")
准确的客户名称(例如“Doe, John A.”)
Approximate firm name (e.g., "Doe Law Firm")
公司大致名称(例如“Doe Law Firm”)
Precise firm name (e.g., "Doe, John A. Law Firm")
准确的公司名称(例如“Doe, John A. Law Firm”)
About 5% of the documents didn't include any identifying entities.
大约 5% 的文件不包含任何识别实体。
Dataset: For developing the model, I used 710 "true" PDF documents, which were split into three sets: 600 for training, 55 for validation, and 55 for testing.
数据集:为了开发模型,我使用了 710 个“真实”PDF 文档,这些文档分为三组:600 个用于训练,55 个用于验证,55 个用于测试。
Labels: I was given an Excel file with entities extracted as plain text, which needed to be manually labeled in the document text. Using the BIO tagging format, I performed the following steps:
标签:我收到了一个 Excel 文件,其中的实体被提取为纯文本,需要在文档文本中手动标记。使用 BIO 标记格式,我执行了以下步骤:
Mark the beginning of an entity with "B-
用“B-”标记实体的开头。
Continue marking subsequent tokens within the same entity with "I-
继续用“I-”标记同一实体内的后续标记。
If a token doesn't belong to any entity, mark it as "O".
如果令牌不属于任何实体,则将其标记为“O”。
Alternative Approach: Models like LayoutLM, which also consider bounding boxes for input tokens, could potentially enhance the performance of the NER task. However, I opted not to use this approach because, as is often the case, I had already spent the majority of the project time on preparing the data (e.g., reformatting Excel files, correcting data errors, labeling). To integrate bounding box-based models, I would have needed to allocate even more time.
替代方法:像 LayoutLM 这样的模型也考虑了输入标记的边界框,可能会提高 NER 任务的性能。然而,我选择不使用这种方法,因为通常情况下,我已经花费了项目的大部分时间来准备数据(例如,重新格式化 Excel 文件、更正数据错误、标记)。为了集成基于边界框的模型,我需要分配更多的时间。
While regex and heuristics could theoretically be applied to identify these simple entities, I anticipated that this approach would be impractical, as it would necessitate overly complex rules to precisely identify the correct entities among other potential candidates (e.g., lawyer name, case number, other participants in the proceedings). In contrast, the model is capable of learning to distinguish the relevant entities, rendering the use of heuristics superfluous.
虽然理论上可以应用正则表达式和启发式方法来识别这些简单的实体,但我预计这种方法是不切实际的,因为它需要过于复杂的规则来精确识别其他潜在候选者中的正确实体(例如,律师姓名、案件编号、其他实体)。诉讼程序的参与者)。相比之下,该模型能够学习区分相关实体,从而使启发式方法的使用变得多余。
免责声明:info@kdj.com
所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!
如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。
-
- 以太坊的 Pectra 升级及其含义
- 2024-11-19 09:30:21
- Pectra 升级也称为“Prague/Electra”,旨在减少以太坊网络的严重拥塞并解决全球网络可扩展性问题。
-
- 没有国会的情况下,比特币战略储备能否推进?专家意见不一
- 2024-11-19 09:30:02
- 美国政府已经持有超过 208,000 枚比特币 (BTC),但保留这些比特币比想象的要复杂。
-
- 随着比特币进入“极端贪婪”区域,比特币矿工和长期持有者准备迎接市场调整
- 2024-11-19 09:11:28
- 正如恐惧和贪婪指数所示,比特币最近进入了许多人所说的“极度贪婪”区域。