Representation Learning of Biomedical Ontologies using Poincaré Embedding and Application to Genetic Risk Model

김재식

Advisor: 손경아

Affiliation: 아주대학교 일반대학원

Department: 일반대학원 컴퓨터공학과

Publication Year: 2021-08

Publisher: The Graduate School, Ajou University

Keyword: Poincaré ball Polygenic risk score Representation learning Transformer

Description: 학위논문(석사)--아주대학교 일반대학원 :컴퓨터공학과,2021. 8

Alternative Abstract: Knowledge manipulation of Gene Ontology (GO) and Gene Ontology Annotation (GOA) can be done primarily by using vector representation of GO terms and genes. Previous studies have represented GO terms and genes or gene products in Euclidean space to measure their semantic similarity using an embedding method such as the Word2Vec-based method to represent entities as numeric vectors. However, this method has the limitation that embedding large graph-structured data in the Euclidean space cannot prevent a loss of information of latent hierarchies, thus precluding the semantics of GO and GOA from being captured optimally. On the other hand, hyperbolic spaces such as the Poincaré ball are more suitable for modeling hierarchies, as they have a geometric property in which the distance increases exponentially as it nears the boundary because of negative curvature. In this thesis, we propose hierarchical representations of GO and genes (HiG2Vec) by applying Poincaré embedding specialized in the representation of hierarchy through a two-step procedure: GO embedding and gene embedding. Through experiments, we show that our model represents the hierarchical structure better than other approaches and predicts the interaction of genes or gene products similar to or better than previous studies. The results indicate that HiG2Vec is superior to other methods in capturing the GO and gene semantics and in data utilization as well. As one of effective downstream application of gene embeddings, we propose TransformerPRS, a deep learing model using a transformer module derived from language model, and compared with conventional polygenic risk score (PRS) which is a widely used risk scoring approach that derives a genetic risk for each individual from the sum of risk variants weighted by effect sizes from genome-wide association studies (GWASs). In the experiments, TransformerPRS with initialized by HiG2Vec showed better prediction performance than TransfermerPRS from scratch as well as conventional PRS. In addition, the self-attention module in a transformer block identified important features and their interactions. Our models can improve genetic risk prediction by providing information on which genes and interactions between genes have an important impact on prediction, which were not captured by conventional PRS.

Language: eng

URI: https://dspace.ajou.ac.kr/handle/2018.oak/20424

Fulltext

Type: Thesis

Show full item record

qrcode

트윗하기

Total Views & Downloads

File Download

There are no files associated with this item.