$83571.608249 USD

-1.38%

ethereum

$1826.028236 USD

-3.02%

tether

$0.999839 USD

-0.01%

xrp

$2.053149 USD

-2.48%

bnb

$601.140115 USD

-0.44%

solana

$120.357332 USD

-3.79%

usd-coin

$0.999833 USD

-0.02%

dogecoin

$0.166175 USD

-3.43%

cardano

$0.652521 USD

-3.00%

tron

$0.236809 USD

-0.59%

toncoin

$3.785339 USD

-5.02%

chainlink

$13.253231 USD

-3.91%

unus-sed-leo

$9.397427 USD

-0.19%

stellar

$0.266444 USD

-1.00%

sui

$2.409007 USD

1.15%

암호화폐 뉴스 기사

법률 문서와 관련된 클라이언트 식별

2024/11/19 05:02

주요 목표는 다음 식별자 중 하나를 통해 각 문서와 연관된 클라이언트를 식별하는 것이었습니다.

The goal was to extract client names from legal documents using Named Entity Recognition (NER). Here's how I approached the task:

목표는 NER(Named Entity Recognition)를 사용하여 법적 문서에서 고객 이름을 추출하는 것이었습니다. 제가 이 작업에 접근한 방법은 다음과 같습니다.

Data: I had a collection of legal documents in PDF format. The task was to identify the clients mentioned in each document using one of the following identifiers:

데이터: PDF 형식의 법률 문서 모음이 있었습니다. 임무는 다음 식별자 중 하나를 사용하여 각 문서에 언급된 고객을 식별하는 것이었습니다.

Approximate client name (e.g., "John Doe")

대략적인 고객 이름(예: "John Doe")

Precise client name (e.e., "Doe, John A.")

정확한 고객 이름(예: "Doe, John A.")

Approximate firm name (e.g., "Doe Law Firm")

대략적인 회사 이름(예: "Doe Law Firm")

Precise firm name (e.g., "Doe, John A. Law Firm")

정확한 회사 이름(예: "Doe, John A. Law Firm")

About 5% of the documents didn't include any identifying entities.

문서의 약 5%에는 식별 가능한 개체가 포함되어 있지 않았습니다.

Dataset: For developing the model, I used 710 "true" PDF documents, which were split into three sets: 600 for training, 55 for validation, and 55 for testing.

데이터 세트: 모델 개발을 위해 710개의 "진짜" PDF 문서를 사용했는데, 이 문서는 훈련용 600개, 검증용 55개, 테스트용 55개의 세 세트로 나뉩니다.

Labels: I was given an Excel file with entities extracted as plain text, which needed to be manually labeled in the document text. Using the BIO tagging format, I performed the following steps:

라벨: 문서 텍스트에서 수동으로 라벨을 지정해야 하는 일반 텍스트로 추출된 엔터티가 포함된 Excel 파일을 받았습니다. BIO 태깅 형식을 사용하여 다음 단계를 수행했습니다.

Mark the beginning of an entity with "B-".

엔터티의 시작을 "B-"로 표시합니다.

Continue marking subsequent tokens within the same entity with "I-".

동일한 엔터티 내의 후속 토큰을 "I-"로 계속 표시합니다.

If a token doesn't belong to any entity, mark it as "O".

토큰이 어떤 엔터티에도 속하지 않으면 "O"로 표시하세요.

Alternative Approach: Models like LayoutLM, which also consider bounding boxes for input tokens, could potentially enhance the performance of the NER task. However, I opted not to use this approach because, as is often the case, I had already spent the majority of the project time on preparing the data (e.g., reformatting Excel files, correcting data errors, labeling). To integrate bounding box-based models, I would have needed to allocate even more time.

대체 접근 방식: 입력 토큰에 대한 경계 상자도 고려하는 LayoutLM과 같은 모델은 잠재적으로 NER 작업의 성능을 향상시킬 수 있습니다. 그러나 나는 종종 그렇듯이 프로젝트 시간의 대부분을 데이터 준비(예: Excel 파일 형식 재지정, 데이터 오류 수정, 레이블 지정)에 이미 소비했기 때문에 이 접근 방식을 사용하지 않기로 결정했습니다. 경계 상자 기반 모델을 통합하려면 더 많은 시간을 할당해야 했습니다.

While regex and heuristics could theoretically be applied to identify these simple entities, I anticipated that this approach would be impractical, as it would necessitate overly complex rules to precisely identify the correct entities among other potential candidates (e.g., lawyer name, case number, other participants in the proceedings). In contrast, the model is capable of learning to distinguish the relevant entities, rendering the use of heuristics superfluous.

이론적으로 정규 표현식과 휴리스틱을 적용하여 이러한 간단한 엔터티를 식별할 수 있지만 다른 잠재적 후보자(예: 변호사 이름, 사건 번호, 기타) 중에서 올바른 엔터티를 정확하게 식별하려면 지나치게 복잡한 규칙이 필요하기 때문에 이 접근 방식은 비실용적일 것이라고 예상했습니다. 절차 참가자). 대조적으로, 모델은 관련 엔터티를 구별하는 방법을 학습할 수 있으므로 휴리스틱 사용이 불필요해집니다.

부인 성명:info@kdj.com

제공된 정보는 거래 조언이 아닙니다. kdj.com은 이 기사에 제공된 정보를 기반으로 이루어진 투자에 대해 어떠한 책임도 지지 않습니다. 암호화폐는 변동성이 매우 높으므로 철저한 조사 후 신중하게 투자하는 것이 좋습니다!

2025年04月03日 에 게재된 다른 기사

더