Application of a common data model-based transfer learning method for developing clinical prediction models

김청수

DC Field	Value	Language
dc.contributor.advisor	박래웅	-
dc.contributor.author	김청수	-
dc.date.issued	2024-02	-
dc.identifier.other	33422	-
dc.identifier.uri	https://aurora.ajou.ac.kr/handle/2018.oak/39012	-
dc.description	학위논문(박사)--의생명과학과,2024. 2	-
dc.description.abstract	Following the successful development and utilization of large language models, multimodal foundation models are gaining attention and are also being explored in the medical field. Code-based structured data has been relatively underutilized in the development of foundation models. Some algorithms have made it possible to develop foundation models, however, there are few cases of foundation model development using real data source and evaluating through transfer learning with heterogeneous databases. Therefore, in this study, we aimed to develop foundation models using Korean healthcare data and compared its performance with traditional development process of the prediction model for drug adverse event. In addition, we tried to evaluate the feasibility of transfer learning through the foundation model generated to different databases. In this study, we developed foundation models using one electronic medical record (AUSOM) and two health insurance claim data sources (HIRA-ADHD, NHIS-NSC). The CEHR-BERT algorithm was utilized for the foundation model pretraining. We articulated the drug adverse event prediction problem (a model to predict sleep disorder in methylphenidate users for the attention deficit/hyperactivity disorder) as a downstream task. We measured the performance of fine-tuning models with feed forward (FF) and bidirectional long short-term memory (BiLSTM) final layer for downstream task. As comparator models, a least absolute shrinkage and selection operator logistic regression (LLR) and extreme gradient boosting machine (XGBoost) were also developed and compared with the fine-tuning models. Afterward, we transferred the foundation models to methylphenidate datasets from different data sources. With methylphenidate datasets from different sources, we fine-tuned model with the same way for sleep disorder prediction. We also exported the LLR and XGBoost models and applied with different datasets for external validation. As results of the study, we were able to develop three foundation models using AUSOM, HIRA-ADHD, and NHIS-NSC databases. The results showed that the HIRA-ADHD database had the highest AUROC (91.31 [90.90-91.73] %) and AUPRC (74.11 %) when the foundation model was fine-tuned with BiLSTM. The LLR model showed the best performance using AUSOM database (AUROC 94.44 [91.54-97.34] %; AUPRC 68.89 %) and the fine-tuned model with BiLSTM showed comparable performance (AUROC 94.17 [93.18-95.16] %). The XGBoost model showed the best performance from the NHIS-NSC (AUROC 90.20 [88.44-91.97] %; AUPRC 50.52 %). The comparison between the externally fine-tuned models based on the foundation model and externally validated traditional models showed that the XGBoost model developed based on HIRA-ADHD dataset performed best when applied to AUSOM data (AUROC 95.49 [92.79-98.20] %, AUPRC 78.71 %). However, when applying the model from NHIS-NSC data to HIRA-ADHD dataset, the performance result of the BiLSTM model model was the highest (AUROC 91.14 [90.74-91.55] %, AUPRC 73.50 %). Through this study, we were able to develop foundation models using three healthcare databases in Republic of Korea. We applied this foundation models for develop a drug adverse event prediction model and evaluated their performance. We found that it can outperform traditional machine learning methods in data-rich environments. However, the performance when model transferred to other dataset was not sufficient, therefore, further research is needed in terms of few-shots learning.\|대형 언어 모델의 성공적인 개발과 활용에 이어 멀티모달(multimodal) 기반모형(foundation model)이 주목을 받고 있다. 상대적으로 구조화된 데이터는 기반모형 생성에 있어 상대적으로 활용도가 낮은데, 최근 일부 알고리즘을 통해 정형 데이터를 활용한 기반모형의 개발이 가능해졌다. 하지만 실제 데이터를 기반한 모델의 개발 사례는 적고 전이 학습을 통한 성능 평가가 부족한 실정이다. 이에 본 연구에서는 한국의 보건의료데이터를 활용한 사전학습을 수행하여 기반모형을 개발하고 주의력결핍 과잉행동장애 치료제인 메틸페니데이트의 수면장애 부작용 예측 분야에서 기존 모델과 성능을 비교하였다. 또한, 여러 데이터베이스에서 생성된 기반 모델을 서로 다른 데이터에 전이하여 예측 모형의 성능을 평가하고자 하였다._x000D_ <br>본 연구에서는 1개의 전자의무기록(AUSOM)과 2개의 건강보험 청구 데이터 소스(HIRA-ADHD, NHIS-NSC)를 사용하여 기반모형을 개발하였다. 사전 훈련 알고리즘으로는 CEHR-BERT 알고리즘을 활용하였다. 하위 예측 문제로 약물 부작용 예측 문제를 선택하고 메틸페니데이트 사용자의 수면 장애를 예측하는 문제를 설계하였다. 각 데이터베이스(DB)의 약물 부작용 예측 문제에 대해 기반 모형을 단일 신경망(FF) 및 양방향 long short-term memory (BiLSTM) 알고리즘으로 미세 조정한 모형을 학습시켰으며, 대조군으로서 전통적으로 활용하는 Lasso logistic regression (LLR)과 extreme gradient boosting machine (XGBoost)을 개발하여 총 4가지 모델의 성능을 각각 측정하였다. 또한 개발된 DB와 다른 DB에 LLR 및 XGBoost 모형을 적용하여 얻은 외부 검증 결과와 사전학습 기반모형을 외부 데이터셋으로 미세 조정한 결과도 비교하였다. _x000D_ <br>연구 결과, CEHR-BERT 알고리즘을 활용하여 AUSOM, HIRA-ADHD, NHIS-NSC 데이터베이스를 학습한 각각의 기반모형을 개발할 수 있었다. 메틸페니데이트 약물 부작용 예측문제에 적용하여 개발된 모형의 성능은 HIRA-ADHD 데이터베이스에서 기반모형을 BiLSTM으로 미세조정 했을 때 AUROC (91.31 [90.90-91.73] %)와 AUPRC (74.11 %)가 가장 높은 것으로 나타났다. AUSOM 데이터베이스를 기반으로 개발된 수면부작용 예측 모형 중에서는 LLR이 가장 우수한 성능을 보였으나(AUROC 94.44 [91.54-97.34] %; AUPRC 68.89 %), BiLSTM 미세조정 모형 또한 유사하였다 (AUROC 94.17 [93.18-95.16] %). NHIS-NSC 데이터를 기반으로 개발되었을 때 XGBoost 모델은 가장 우수한 성능을 보였다(AUROC 90.20 [88.44-91.97] %; AUPRC 50.52 %). 서로 다른 데이터베이스를 사용한 모델의 외부검증과 전이 학습을 비교한 결과 HIRA-ADHD 데이터를 기반으로 개발된 XGBoost는 외부 검증으로 AUSOM 데이터에 적용했을 때 가장 우수한 성능을 보였다(AUROC 95.49 [92.79-98.20] %, AUPRC 78.71 %). 그러나 NHIS-NSC 데이터의 모델을 HIRA-ADHD 데이터셋에 적용했을 때 NHIS-NSC 기반모델에서 이전된 BiLSTM 모델의 성능 결과가 가장 높았다(AUROC 91.14 [90.74-91.55] %, AUPRC 73.50 %). _x000D_ <br>본 연구를 통해 코드 기반 구조화된 데이터를 이용한 사전 훈련 모델을 개발하고 약물 부작용 예측 모델을 개발하여 기존의 기계 학습 방법과 비교 평가할 수 있었다. 일부 데이터가 풍부한 환경에서 전통적인 기계 학습 방법을 능가할 수 있음을 발견하였다. 그러나 대규모 데이터에서 소규모 데이터로의 모델 전이 시 성능은 충분하지 않았으며, few-shot learning 측면에서 추가적인 연구가 필요하다.	-
dc.description.tableofcontents	I. Introduction 1_x000D_ <br> A. Background 1_x000D_ <br> 1. Clinical prediction model 1_x000D_ <br> 2. Methodology for distributed research network and transfer learning 5_x000D_ <br> 3. Foundation model in medicine 8_x000D_ <br> 4. Observational Medical Outcome Partnership Common Data Model 11_x000D_ <br> 5. Patient-Level Prediction Framework 13_x000D_ <br> B. Objectives 14_x000D_ <br>II. Materials and Methods 15_x000D_ <br> A. Data sources 15_x000D_ <br> B. Overall process 18_x000D_ <br> C. Model pretraining 20_x000D_ <br> D. Downstream task 25_x000D_ <br> E. Transfer learning 30_x000D_ <br> F. Sensitivity analysis 33_x000D_ <br>II. Results 35_x000D_ <br> A. Basic statistics for patient sequence and pretrained models 35_x000D_ <br> B. Basic statistics for study populations 35_x000D_ <br> C. Model performance 37_x000D_ <br> D. Transfer learning performance 41_x000D_ <br> E. Sensitivity analysis 47_x000D_ <br> 1. Epoch in pretraining 47_x000D_ <br> 2. Experienced knowledge in pretraining 47_x000D_ <br> 3. Training data in fine-tuning 55_x000D_ <br> 4. Incidence of outcome in the prediction task 60_x000D_ <br>IV. Discussion 63_x000D_ <br> A. Main findings 63_x000D_ <br> 1. Pretraining data 64_x000D_ <br> 2. Pretraining algorithm 65_x000D_ <br> 3. Downstream task and evaluation 66_x000D_ <br> 4. Model transfer 68_x000D_ <br> 5. Sensitivity analysis 69_x000D_ <br> B. Limitations 70_x000D_ <br> C. Further research 71_x000D_ <br>V. Conclusion 73_x000D_ <br>References 74_x000D_ <br>Appendix 82_x000D_ <br>국문요약 89	-
dc.language.iso	eng	-
dc.publisher	The Graduate School, Ajou University	-
dc.rights	아주대학교 논문은 저작권에 의해 보호받습니다.	-
dc.title	Application of a common data model-based transfer learning method for developing clinical prediction models	-
dc.title.alternative	임상예측모형 개발을 위한 공통데이터모델 기반 전이학습 방법론의 적용	-
dc.type	Thesis	-
dc.contributor.affiliation	아주대학교 대학원	-
dc.contributor.alternativeName	Chungsoo Kim	-
dc.contributor.department	일반대학원 의생명과학과	-
dc.date.awarded	2024-02	-
dc.description.degree	Doctor	-
dc.identifier.url	https://dcoll.ajou.ac.kr/dcollection/common/orgView/000000033422	-
dc.subject.keyword	건강보험청구자료	-
dc.subject.keyword	공통데이터모델	-
dc.subject.keyword	기반모형	-
dc.subject.keyword	전이학습	-
dc.subject.keyword	전자의무기록	-
dc.description.alternativeAbstract	대형 언어 모델의 성공적인 개발과 활용에 이어 멀티모달(multimodal) 기반모형(foundation model)이 주목을 받고 있다. 상대적으로 구조화된 데이터는 기반모형 생성에 있어 상대적으로 활용도가 낮은데, 최근 일부 알고리즘을 통해 정형 데이터를 활용한 기반모형의 개발이 가능해졌다. 하지만 실제 데이터를 기반한 모델의 개발 사례는 적고 전이 학습을 통한 성능 평가가 부족한 실정이다. 이에 본 연구에서는 한국의 보건의료데이터를 활용한 사전학습을 수행하여 기반모형을 개발하고 주의력결핍 과잉행동장애 치료제인 메틸페니데이트의 수면장애 부작용 예측 분야에서 기존 모델과 성능을 비교하였다. 또한, 여러 데이터베이스에서 생성된 기반 모델을 서로 다른 데이터에 전이하여 예측 모형의 성능을 평가하고자 하였다._x000D_ <br>본 연구에서는 1개의 전자의무기록(AUSOM)과 2개의 건강보험 청구 데이터 소스(HIRA-ADHD, NHIS-NSC)를 사용하여 기반모형을 개발하였다. 사전 훈련 알고리즘으로는 CEHR-BERT 알고리즘을 활용하였다. 하위 예측 문제로 약물 부작용 예측 문제를 선택하고 메틸페니데이트 사용자의 수면 장애를 예측하는 문제를 설계하였다. 각 데이터베이스(DB)의 약물 부작용 예측 문제에 대해 기반 모형을 단일 신경망(FF) 및 양방향 long short-term memory (BiLSTM) 알고리즘으로 미세 조정한 모형을 학습시켰으며, 대조군으로서 전통적으로 활용하는 Lasso logistic regression (LLR)과 extreme gradient boosting machine (XGBoost)을 개발하여 총 4가지 모델의 성능을 각각 측정하였다. 또한 개발된 DB와 다른 DB에 LLR 및 XGBoost 모형을 적용하여 얻은 외부 검증 결과와 사전학습 기반모형을 외부 데이터셋으로 미세 조정한 결과도 비교하였다. _x000D_ <br>연구 결과, CEHR-BERT 알고리즘을 활용하여 AUSOM, HIRA-ADHD, NHIS-NSC 데이터베이스를 학습한 각각의 기반모형을 개발할 수 있었다. 메틸페니데이트 약물 부작용 예측문제에 적용하여 개발된 모형의 성능은 HIRA-ADHD 데이터베이스에서 기반모형을 BiLSTM으로 미세조정 했을 때 AUROC (91.31 [90.90-91.73] %)와 AUPRC (74.11 %)가 가장 높은 것으로 나타났다. AUSOM 데이터베이스를 기반으로 개발된 수면부작용 예측 모형 중에서는 LLR이 가장 우수한 성능을 보였으나(AUROC 94.44 [91.54-97.34] %; AUPRC 68.89 %), BiLSTM 미세조정 모형 또한 유사하였다 (AUROC 94.17 [93.18-95.16] %). NHIS-NSC 데이터를 기반으로 개발되었을 때 XGBoost 모델은 가장 우수한 성능을 보였다(AUROC 90.20 [88.44-91.97] %; AUPRC 50.52 %). 서로 다른 데이터베이스를 사용한 모델의 외부검증과 전이 학습을 비교한 결과 HIRA-ADHD 데이터를 기반으로 개발된 XGBoost는 외부 검증으로 AUSOM 데이터에 적용했을 때 가장 우수한 성능을 보였다(AUROC 95.49 [92.79-98.20] %, AUPRC 78.71 %). 그러나 NHIS-NSC 데이터의 모델을 HIRA-ADHD 데이터셋에 적용했을 때 NHIS-NSC 기반모델에서 이전된 BiLSTM 모델의 성능 결과가 가장 높았다(AUROC 91.14 [90.74-91.55] %, AUPRC 73.50 %). _x000D_ <br>본 연구를 통해 코드 기반 구조화된 데이터를 이용한 사전 훈련 모델을 개발하고 약물 부작용 예측 모델을 개발하여 기존의 기계 학습 방법과 비교 평가할 수 있었다. 일부 데이터가 풍부한 환경에서 전통적인 기계 학습 방법을 능가할 수 있음을 발견하였다. 그러나 대규모 데이터에서 소규모 데이터로의 모델 전이 시 성능은 충분하지 않았으며, few-shot learning 측면에서 추가적인 연구가 필요하다.	-

Show simple item record

qrcode

트윗하기

Total Views & Downloads

File Download

There are no files associated with this item.