Design Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks

DC Field Value Language
dc.contributor.advisorKiejin Park-
dc.contributor.authorCAETANO SERGIO PAULO-
dc.date.accessioned2025-01-22T06:53:42Z-
dc.date.available2025-01-22T06:53:42Z-
dc.date.issued2023-02-
dc.identifier.other32683-
dc.identifier.urihttps://dspace.ajou.ac.kr/handle/2018.oak/24262-
dc.description학위논문(석사)--아주대학교 일반대학원 :산업공학과,2023. 2-
dc.description.tableofcontents1 Introduction 1 <br> 1.1 Background Information 2 <br> 1.1.1 Programming Model 2 <br> 1.1.2 Yarn Architecture 2 <br> 1.1.3 Hadoop Distributed File System (HDFS) 3 <br> 1.1.4 Cluster Computing (Spark) 4 <br> 1.1.5 Searching the Text 4 <br> 1.1.6 Natural Language Processing 6 <br> 1.1.7 Capabilities of NLP 6 <br> 1.1.8 Applications of NLP 7 <br> 1.1.9 Text Preprocessing 7 <br> 1.1.10 Techniques for Text Preprocessing 7 <br> 1.1.11 LDA Topic Modeling 8 <br> 1.2 Problem Description 10 <br>2 Proposed Approach 11 <br> 2.1 Knowledge finding and Extraction 11 <br> 2.1.1 Work Environment 14 <br> 2.2 Keyword Query DataFrame 15 <br> 2.3 LDA modeling 17 <br> 2.3.1 Loading Data 17 <br> 2.3.2 Import Dataframe from hdfs 18 <br> 2.4 Data Cleaning 18 <br> 2.4.1 Lower casing and Remove punctuation 18 <br> 2.5 Prepare data for LDA analysis (data preprocessing) 18 <br> 2.6 Selection of several topics 19 <br> 2.7 Selection of several topics 19 <br>3 Result 20 <br> 3.1 Analyzing model results 20 <br> 3.2 Results exploration 22 <br>4 Conclusion 35 <br> 4.1 Future studies and drawbacks 35 <br>5 References 36-
dc.language.isoeng-
dc.publisherThe Graduate School, Ajou University-
dc.rights아주대학교 논문은 저작권에 의해 보호받습니다.-
dc.titleDesign Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks-
dc.typeThesis-
dc.contributor.affiliation아주대학교 대학원-
dc.contributor.department일반대학원 산업공학과-
dc.date.awarded2023-02-
dc.description.degreeMaster-
dc.identifier.localIdT000000032683-
dc.identifier.urlhttps://dcoll.ajou.ac.kr/dcollection/common/orgView/000000032683-
dc.subject.keywordHadoop-
dc.subject.keywordMachine Learning-
dc.subject.keywordNatural Language Processing-
dc.subject.keywordSpark-
dc.subject.keywordText Mining-
dc.description.alternativeAbstractInternet resources have rapidly increased in recent years owing to advanced technology. <br>Digital documents have become more popular for storing and broadcasting information <br>when compared to traditional paper documents. The problem of ordinary users has <br>become finding desired information in such an environment. Also, too much data is <br>available, which usually leads to great difficulties due to too high computational complexity <br>(time and memory). On another side, the available hardware may not handle it. The world <br>saw 2020 as an unprecedented global invasion of coronavirus. Furthermore, many public <br>and private institutions act quickly to make resources, such as opening big data <br>repositories, to make the discovery process faster and more efficient. <br> <br> We propose in this Thesis a self-based Automatic Text summarization, merging extensive. <br>Big data modeling, information mapping, and Extraction deliver top-down and bottom-up <br>searching and browsing—the Intersection for faster knowledge discovery and Extraction. <br>The problem consists of finding information on the documents that accept or deny the <br>correlation between Respiratory Syndrome and Weather of COVID-19 spreading, taking <br>into consideration social, Ethics, and Media. In order to discover and extract, and then <br>summarize Texts from the massive amount of paper documents with Natural language processing, <br>text mining techniques, and unsupervised learning mechanism. <br> <br> We designed an automatic Knowledge discovery and extraction analytical pipeline system. <br>The input dataset to the proposed system generates first using Tokenization. Then <br>breaking a text into linguistically meaningful units called tokens removes words that occur <br>frequently but do not contribute to the content of the text, followed by applying an N-gram <br>algorithm that can help capture multiword phrases precisely. Next, the LDA model is used <br>to discover topics that hide in a set of text documents. Next, Hadoop and Spark are <br>adopted to run a topic model and provide summarized text. Finally, for preprocessing, T.B. <br>size input datasets were processed.-
Appears in Collections:
Graduate School of Ajou University > Department of Industrial Engineering > 3. Theses(Master)
Files in This Item:
There are no files associated with this item.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse