With the advent of easy access to a tremendous amount of text data, various studies utilizing text mining have been conducted in the biomedical field. However, most are only concerned with retrieving information solely from the perspective of either diseases or drugs. Extending from such boundary, we propose an approach of embedding disease and drugs from biomedical literature, determining direct relationships between them, and identifying possibilities of drug repositioning. To embed both disease and drugs, we utilize the word2vec algorithm and generate embedded word vectors for each disease and drug. Then hierarchical clustering with Ward's method is applied for categorization. Moreover, we suggest an evaluation measure that compares clusters from the text data with results from the molecular biology level. The proposed method was applied to 17,606,652 MEDLINE abstracts and extracted 4,163 diseases and 3,930 drugs. By examining heterogeneous clusters in which both disease and drug exist, nine candidate drugs were deduced for each disease in combination with 79 diseases and 84 drugs. The results are expected to serve as a baseline for the preliminary selection of candidate drugs for drug repositioning.
ACKNOWLEDGMENT The authors would like to gratefully acknowledge supported from the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. 2021R1A2C2003474), BK21 FOUR program of the National Research Foundation of Korea funded by the Ministry of Education (NRF5199991014091) and the Ajou University research fund.