Ajou University repository

Design Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks
  • CAETANO SERGIO PAULO
Citations

SCOPUS

0

Citation Export

Advisor
Kiejin Park
Affiliation
아주대학교 대학원
Department
일반대학원 산업공학과
Publication Year
2023-02
Publisher
The Graduate School, Ajou University
Keyword
HadoopMachine LearningNatural Language ProcessingSparkText Mining
Description
학위논문(석사)--아주대학교 일반대학원 :산업공학과,2023. 2
Alternative Abstract
Internet resources have rapidly increased in recent years owing to advanced technology. <br>Digital documents have become more popular for storing and broadcasting information <br>when compared to traditional paper documents. The problem of ordinary users has <br>become finding desired information in such an environment. Also, too much data is <br>available, which usually leads to great difficulties due to too high computational complexity <br>(time and memory). On another side, the available hardware may not handle it. The world <br>saw 2020 as an unprecedented global invasion of coronavirus. Furthermore, many public <br>and private institutions act quickly to make resources, such as opening big data <br>repositories, to make the discovery process faster and more efficient. <br> <br> We propose in this Thesis a self-based Automatic Text summarization, merging extensive. <br>Big data modeling, information mapping, and Extraction deliver top-down and bottom-up <br>searching and browsing—the Intersection for faster knowledge discovery and Extraction. <br>The problem consists of finding information on the documents that accept or deny the <br>correlation between Respiratory Syndrome and Weather of COVID-19 spreading, taking <br>into consideration social, Ethics, and Media. In order to discover and extract, and then <br>summarize Texts from the massive amount of paper documents with Natural language processing, <br>text mining techniques, and unsupervised learning mechanism. <br> <br> We designed an automatic Knowledge discovery and extraction analytical pipeline system. <br>The input dataset to the proposed system generates first using Tokenization. Then <br>breaking a text into linguistically meaningful units called tokens removes words that occur <br>frequently but do not contribute to the content of the text, followed by applying an N-gram <br>algorithm that can help capture multiword phrases precisely. Next, the LDA model is used <br>to discover topics that hide in a set of text documents. Next, Hadoop and Spark are <br>adopted to run a topic model and provide summarized text. Finally, for preprocessing, T.B. <br>size input datasets were processed.
Language
eng
URI
https://dspace.ajou.ac.kr/handle/2018.oak/24262
Fulltext

Type
Thesis
Show full item record

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Total Views & Downloads

File Download

  • There are no files associated with this item.