Design Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks

CAETANO SERGIO PAULO

Advisor: Kiejin Park

Affiliation: 아주대학교 대학원

Department: 일반대학원 산업공학과

Publication Year: 2023-02

Publisher: The Graduate School, Ajou University

Keyword: Hadoop Machine Learning Natural Language Processing Spark Text Mining

Description: 학위논문(석사)--아주대학교 일반대학원 :산업공학과,2023. 2

Alternative Abstract: Internet resources have rapidly increased in recent years owing to advanced technology. Digital documents have become more popular for storing and broadcasting information when compared to traditional paper documents. The problem of ordinary users has become finding desired information in such an environment. Also, too much data is available, which usually leads to great difficulties due to too high computational complexity (time and memory). On another side, the available hardware may not handle it. The world saw 2020 as an unprecedented global invasion of coronavirus. Furthermore, many public and private institutions act quickly to make resources, such as opening big data repositories, to make the discovery process faster and more efficient. We propose in this Thesis a self-based Automatic Text summarization, merging extensive. Big data modeling, information mapping, and Extraction deliver top-down and bottom-up searching and browsing—the Intersection for faster knowledge discovery and Extraction. The problem consists of finding information on the documents that accept or deny the correlation between Respiratory Syndrome and Weather of COVID-19 spreading, taking into consideration social, Ethics, and Media. In order to discover and extract, and then summarize Texts from the massive amount of paper documents with Natural language processing, text mining techniques, and unsupervised learning mechanism. We designed an automatic Knowledge discovery and extraction analytical pipeline system. The input dataset to the proposed system generates first using Tokenization. Then breaking a text into linguistically meaningful units called tokens removes words that occur frequently but do not contribute to the content of the text, followed by applying an N-gram algorithm that can help capture multiword phrases precisely. Next, the LDA model is used to discover topics that hide in a set of text documents. Next, Hadoop and Spark are adopted to run a topic model and provide summarized text. Finally, for preprocessing, T.B. size input datasets were processed.

Language: eng

URI: https://dspace.ajou.ac.kr/handle/2018.oak/24262

Fulltext

Type: Thesis

Show full item record

qrcode

트윗하기

Total Views & Downloads

File Download

There are no files associated with this item.