Design Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks

Author(s)
CAETANO SERGIO PAULO
Advisor
Kiejin Park
Department
일반대학원 산업공학과
Publisher
The Graduate School, Ajou University
Publication Year
2023-02
Language
eng
Keyword
HadoopMachine LearningNatural Language ProcessingSparkText Mining
Alternative Abstract
Internet resources have rapidly increased in recent years owing to advanced technology. <br>Digital documents have become more popular for storing and broadcasting information <br>when compared to traditional paper documents. The problem of ordinary users has <br>become finding desired information in such an environment. Also, too much data is <br>available, which usually leads to great difficulties due to too high computational complexity <br>(time and memory). On another side, the available hardware may not handle it. The world <br>saw 2020 as an unprecedented global invasion of coronavirus. Furthermore, many public <br>and private institutions act quickly to make resources, such as opening big data <br>repositories, to make the discovery process faster and more efficient. <br> <br> We propose in this Thesis a self-based Automatic Text summarization, merging extensive. <br>Big data modeling, information mapping, and Extraction deliver top-down and bottom-up <br>searching and browsing—the Intersection for faster knowledge discovery and Extraction. <br>The problem consists of finding information on the documents that accept or deny the <br>correlation between Respiratory Syndrome and Weather of COVID-19 spreading, taking <br>into consideration social, Ethics, and Media. In order to discover and extract, and then <br>summarize Texts from the massive amount of paper documents with Natural language processing, <br>text mining techniques, and unsupervised learning mechanism. <br> <br> We designed an automatic Knowledge discovery and extraction analytical pipeline system. <br>The input dataset to the proposed system generates first using Tokenization. Then <br>breaking a text into linguistically meaningful units called tokens removes words that occur <br>frequently but do not contribute to the content of the text, followed by applying an N-gram <br>algorithm that can help capture multiword phrases precisely. Next, the LDA model is used <br>to discover topics that hide in a set of text documents. Next, Hadoop and Spark are <br>adopted to run a topic model and provide summarized text. Finally, for preprocessing, T.B. <br>size input datasets were processed.
URI
https://dspace.ajou.ac.kr/handle/2018.oak/24262
Fulltext

Appears in Collections:
Graduate School of Ajou University > Department of Industrial Engineering > 3. Theses(Master)
Files in This Item:
There are no files associated with this item.
Export
RIS (EndNote)
XLS (Excel)
XML

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse