$108530.002893 USD

1.12%

ethereum

$2501.495543 USD

2.83%

tether

$1.000245 USD

-0.01%

xrp

$2.198131 USD

0.43%

bnb

$654.360076 USD

0.87%

solana

$152.192030 USD

1.55%

usd-coin

$0.999839 USD

0.00%

tron

$0.276594 USD

0.49%

dogecoin

$0.167580 USD

2.68%

cardano

$0.568515 USD

0.60%

hyperliquid

$40.700758 USD

7.87%

bitcoin-cash

$500.972465 USD

1.64%

sui

$2.847545 USD

2.13%

chainlink

$13.518965 USD

1.41%

unus-sed-leo

$9.163651 USD

0.47%

암호화폐 뉴스 기사

MolE: 분자 그래프 학습을 위한 변환기 모델

2024/11/12 18:04

분자 그래프 학습을 위한 변환기 기반 모델인 MolE를 소개합니다. MolE는 원자 식별자와 그래프 연결을 입력 토큰으로 제공하여 분자 그래프와 직접 작동합니다. 원자 식별자는 다양한 원자 특성을 단일 정수로 해싱하여 계산되며, 그래프 연결성은 위상학적 거리 행렬로 제공됩니다. MolE는 이전에 그래프에도 적용되었던 Transformer를 기본 아키텍처로 사용합니다. 변환기의 성능은 주로 self-attention 메커니즘의 광범위한 사용에 기인합니다. 표준 변환기에서 입력 토큰은 쿼리, 키 및 값 $Q,K,V\in {R}^{N\times d}$에 내장되어 있으며 다음과 같이 self-attention을 계산하는 데 사용됩니다.

MolE is a transformer model designed specifically for molecular graphs. It directly works with graphs by providing both atom identifiers and graph connectivity as input tokens and relative position information, respectively. Atom identifiers are calculated by hashing different atomic properties into a single integer. In particular, this hash contains the following information:

MolE는 분자 그래프용으로 특별히 설계된 변환기 모델입니다. 원자 식별자와 그래프 연결성을 각각 입력 토큰과 상대 위치 정보로 제공하여 그래프와 직접 작동합니다. 원자 식별자는 다양한 원자 속성을 단일 정수로 해싱하여 계산됩니다. 특히 이 해시에는 다음 정보가 포함되어 있습니다.

- number of neighboring heavy atoms,

- 이웃한 중원자의 수,

- number of neighboring hydrogen atoms,

- 이웃한 수소 원자의 수,

- valence minus the number of attached hydrogens,

- 원자가에서 부착된 수소의 수를 뺀 값,

- atomic charge,

- 원자 전하,

- atomic mass,

- 원자 질량,

- attached bond types,

- 부속채권 종류,

- and ring membership.

- 그리고 링 멤버십.

Atom identifiers (also known as atom environments of radius 0) were computed using the Morgan algorithm as implemented in RDKit.

원자 식별자(반경 0의 원자 환경이라고도 함)는 RDKit에 구현된 Morgan 알고리즘을 사용하여 계산되었습니다.

In addition to tokens, MolE also takes graph connectivity information as input which is an important inductive bias since it encodes the relative position of atoms in the molecular graph. In this case, the graph connectivity is given as a topological distance matrix d where dij corresponds to the length of the shortest path over bonds separating atom i from atom j.

토큰 외에도 MolE는 분자 그래프에서 원자의 상대적 위치를 인코딩하기 때문에 중요한 유도 바이어스인 그래프 연결 정보를 입력으로 사용합니다. 이 경우, 그래프 연결성은 위상학적 거리 행렬 d로 제공됩니다. 여기서 dij는 원자 j에서 원자 i를 분리하는 결합에 대한 최단 경로의 길이에 해당합니다.

MolE uses a Transformer as its base architecture, which also has been applied to graphs previously. The performance of transformers can be attributed in large part to the extensive use of the self-attention mechanism. In standard transformers, the input tokens are embedded into queries, keys and values $Q,K,V\in {R}^{N\times d}$, which are used to compute self-attention as:

MolE는 이전에 그래프에도 적용되었던 Transformer를 기본 아키텍처로 사용합니다. 변환기의 성능은 주로 self-attention 메커니즘의 광범위한 사용에 기인합니다. 표준 변환기에서 입력 토큰은 쿼리, 키 및 값 $Q,K,V\in {R}^{N\times d}$에 내장되어 있으며 다음과 같이 self-attention을 계산하는 데 사용됩니다.

where ${H}_{0}\in {R}^{N\times d}$ are the output hidden vectors after self-attention, and $d$ is the dimension of the hidden space.

여기서 ${H}_{0}\in {R}^{N\times d}$는 self-attention 이후의 출력 숨겨진 벡터이고, $d$는 숨겨진 공간의 차원입니다.

In order to explicitly carry positional information through each layer of the transformer, MolE uses the disentangled self-attention from DeBERTa:

변환기의 각 레이어를 통해 위치 정보를 명시적으로 전달하기 위해 MolE는 DeBERTa의 풀린 self-attention을 사용합니다.

where ${Q}^{c},{K}^{c},{V}^{c}\in {R}^{N\times d}$ are context queries, keys and values that contain token information (used in standard self-attention), and ${Q}_{i,j}^{p},{K}_{i,j}^{p}\in {R}^{N\times d}$ are the position queries and keys that encode the relative position of the $i{{{\rm{th}}}}$ atom with respect to the $j{{{\rm{th}}}}$ atom. The use of disentangled attention makes MolE invariant with respect to the order of the input atoms.

여기서 ${Q}^{c},{K}^{c},{V}^{c}\in {R}^{N\times d}$는 토큰을 포함하는 컨텍스트 쿼리, 키 및 값입니다. 정보(표준 self-attention에 사용됨) 및 ${Q}_{i,j}^{p},{K}_{i,j}^{p}\in {R}^{N\times d}$는 $j{{\rm{th}}에 대한 \(i{{{\rm{th}}}}$ 원자의 상대 위치를 인코딩하는 위치 쿼리 및 키입니다. }}\) 원자. 얽힌 주의를 사용하면 입력 원자의 순서와 관련하여 MolE가 변하지 않게 됩니다.

As mentioned earlier, self-supervised pretraining can effectively transfer information from large unlabeled datasets to smaller datasets with labels. Here we present a two-step pretraining strategy. The first step is a self-supervised approach to learn chemical structure representation. For this we use a BERT-like approach in which each atom is randomly masked with a probability of 15%, from which 80% of the selected tokens are replaced by a mask token, 10% replaced by a random token from the vocabulary, and 10% are not changed. Different from BERT, the prediction task is not to predict the identity of the masked token, but to predict the corresponding atom environment (or functional atom environment) of radius 2, meaning all atoms that are separated from the masked atom by two or less bonds. It is important to keep in mind that we used different tokenization strategies for inputs (radius 0) and labels (radius 2) and that input tokens do not contain overlapping data of neighboring atoms to avoid information leakage. This incentivizes the model to aggregate information from neighboring atoms while learning local molecular features. MolE learns via a classification task where each atom environment of radius 2 has a predefined label, contrary to the Context Prediction approach where the task is to match the embedding of atom environments of radius 4 to the embedding of context atoms (i.e., surrounding atoms beyond radius 4) via negative sampling. The second step uses a graph-level supervised pretraining with a large labeled dataset. As proposed by Hu et al., combining node- and graph-level pretraining helps to learn local and global features that improve the final prediction performance. More details regarding the pretraining steps can be found in the Methods section.

앞서 언급한 것처럼 자가 지도 사전 학습은 레이블이 없는 대규모 데이터 세트의 정보를 레이블이 있는 작은 데이터 세트로 효과적으로 전송할 수 있습니다. 여기서는 2단계 사전 훈련 전략을 제시합니다. 첫 번째 단계는 화학 구조 표현을 학습하기 위한 자기 지도 방식입니다. 이를 위해 우리는 각 원자가 15%의 확률로 무작위로 마스크되는 BERT와 유사한 접근 방식을 사용합니다. 이 중에서 선택된 토큰의 80%는 마스크 토큰으로 대체되고, 10%는 어휘의 무작위 토큰으로 대체됩니다. 10%는 변경되지 않습니다. BERT와 달리 예측 작업은 마스킹된 토큰의 동일성을 예측하는 것이 아니라 반경 2의 해당 원자 환경(또는 기능적 원자 환경)을 예측하는 것입니다. 즉, 마스킹된 원자에서 2개 이하의 결합으로 분리된 모든 원자를 의미합니다. . 입력(반경 0)과 레이블(반경 2)에 대해 서로 다른 토큰화 전략을 사용했으며 정보 유출을 피하기 위해 입력 토큰에는 인접한 원자의 중복 데이터가 포함되지 않는다는 점을 명심하는 것이 중요합니다. 이는 모델이 로컬 분자 특징을 학습하면서 이웃 원자로부터 정보를 집계하도록 장려합니다. MolE는 반경 2의 각 원자 환경에 미리 정의된 레이블이 있는 분류 작업을 통해 학습합니다. 이는 작업이 반경 4의 원자 환경 임베딩을 컨텍스트 원자(즉, 그 너머의 주변 원자 반경 4) 네거티브 샘플링을 통해. 두 번째 단계에서는 대규모 레이블이 지정된 데이터 세트를 사용하여 그래프 수준 지도 사전 학습을 사용합니다. Hu 등이 제안한 대로 노드 수준 및 그래프 수준 사전 훈련을 결합하면 최종 예측 성능을 향상시키는 로컬 및 전역 기능을 학습하는 데 도움이 됩니다. 사전 훈련 단계에 대한 자세한 내용은 방법 섹션에서 확인할 수 있습니다.

MolE was pretrained using an ultra-large database of ~842 million molecules from ZINC and ExCAPE-DB, employing a self-supervised scheme (with an auxiliary loss) followed by a supervised pretraining with ~456K molecules (see Methods section for more details). We assess the quality of the molecular embedding by finetuning MolE on a set of downstream tasks. In this case, we use a set of 22 ADMET tasks included in the Therapeutic Data Commons (TDC) benchmark This benchmark is composed of 9 regression and 13 binary classification tasks on datasets that range from hundreds (e.g, DILI with 475 compounds) to thousands of compounds (such as CYP inhibition tasks with ~13,000 compounds). An advantage of using this benchmark is

MolE는 ZINC 및 ExCAPE-DB의 ~8억 4200만 분자로 구성된 초대형 데이터베이스를 사용하여 사전 훈련되었으며, 자체 감독 체계(보조 손실 포함)와 ~456K 분자에 대한 감독 사전 훈련을 사용했습니다(자세한 내용은 방법 섹션 참조). . 우리는 일련의 다운스트림 작업에서 MolE를 미세 조정하여 분자 임베딩의 품질을 평가합니다. 이 경우 TDC(Therapeutic Data Commons) 벤치마크에 포함된 22개의 ADMET 작업 세트를 사용합니다. 이 벤치마크는 수백 개(예: 475개 화합물이 포함된 DILI)에서 수천 개에 이르는 데이터 세트에 대한 9개의 회귀 작업과 13개의 이진 분류 작업으로 구성됩니다. (예: ~13,000개의 화합물을 사용한 CYP 억제 작업) 이 벤치마크를 사용하면 얻을 수 있는 이점은 다음과 같습니다.

부인 성명:info@kdj.com

제공된 정보는 거래 조언이 아닙니다. kdj.com은 이 기사에 제공된 정보를 기반으로 이루어진 투자에 대해 어떠한 책임도 지지 않습니다. 암호화폐는 변동성이 매우 높으므로 철저한 조사 후 신중하게 투자하는 것이 좋습니다!

2025年07月01日 에 게재된 다른 기사

더