Design Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks

CAETANO SERGIO PAULO

DC Field	Value	Language
dc.contributor.advisor	Kiejin Park	-
dc.contributor.author	CAETANO SERGIO PAULO	-
dc.date.issued	2023-02	-
dc.identifier.other	32683	-
dc.identifier.uri	https://aurora.ajou.ac.kr/handle/2018.oak/24262	-
dc.description	학위논문(석사)--아주대학교 일반대학원 :산업공학과,2023. 2	-
dc.description.tableofcontents	1 Introduction 1 <br> 1.1 Background Information 2 <br> 1.1.1 Programming Model 2 <br> 1.1.2 Yarn Architecture 2 <br> 1.1.3 Hadoop Distributed File System (HDFS) 3 <br> 1.1.4 Cluster Computing (Spark) 4 <br> 1.1.5 Searching the Text 4 <br> 1.1.6 Natural Language Processing 6 <br> 1.1.7 Capabilities of NLP 6 <br> 1.1.8 Applications of NLP 7 <br> 1.1.9 Text Preprocessing 7 <br> 1.1.10 Techniques for Text Preprocessing 7 <br> 1.1.11 LDA Topic Modeling 8 <br> 1.2 Problem Description 10 <br>2 Proposed Approach 11 <br> 2.1 Knowledge finding and Extraction 11 <br> 2.1.1 Work Environment 14 <br> 2.2 Keyword Query DataFrame 15 <br> 2.3 LDA modeling 17 <br> 2.3.1 Loading Data 17 <br> 2.3.2 Import Dataframe from hdfs 18 <br> 2.4 Data Cleaning 18 <br> 2.4.1 Lower casing and Remove punctuation 18 <br> 2.5 Prepare data for LDA analysis (data preprocessing) 18 <br> 2.6 Selection of several topics 19 <br> 2.7 Selection of several topics 19 <br>3 Result 20 <br> 3.1 Analyzing model results 20 <br> 3.2 Results exploration 22 <br>4 Conclusion 35 <br> 4.1 Future studies and drawbacks 35 <br>5 References 36	-
dc.language.iso	eng	-
dc.publisher	The Graduate School, Ajou University	-
dc.rights	아주대학교 논문은 저작권에 의해 보호받습니다.	-
dc.title	Design Knowledge Discover and Extractor Analytical Pipeline System for COVID-19 Research Based on Hadoop-Spark Big Data Frameworks	-
dc.type	Thesis	-
dc.contributor.affiliation	아주대학교 대학원	-
dc.contributor.department	일반대학원 산업공학과	-
dc.date.awarded	2023-02	-
dc.description.degree	Master	-
dc.identifier.url	https://dcoll.ajou.ac.kr/dcollection/common/orgView/000000032683	-
dc.subject.keyword	Hadoop	-
dc.subject.keyword	Machine Learning	-
dc.subject.keyword	Natural Language Processing	-
dc.subject.keyword	Spark	-
dc.subject.keyword	Text Mining	-
dc.description.alternativeAbstract	Internet resources have rapidly increased in recent years owing to advanced technology. <br>Digital documents have become more popular for storing and broadcasting information <br>when compared to traditional paper documents. The problem of ordinary users has <br>become finding desired information in such an environment. Also, too much data is <br>available, which usually leads to great difficulties due to too high computational complexity <br>(time and memory). On another side, the available hardware may not handle it. The world <br>saw 2020 as an unprecedented global invasion of coronavirus. Furthermore, many public <br>and private institutions act quickly to make resources, such as opening big data <br>repositories, to make the discovery process faster and more efficient. <br> <br> We propose in this Thesis a self-based Automatic Text summarization, merging extensive. <br>Big data modeling, information mapping, and Extraction deliver top-down and bottom-up <br>searching and browsing—the Intersection for faster knowledge discovery and Extraction. <br>The problem consists of finding information on the documents that accept or deny the <br>correlation between Respiratory Syndrome and Weather of COVID-19 spreading, taking <br>into consideration social, Ethics, and Media. In order to discover and extract, and then <br>summarize Texts from the massive amount of paper documents with Natural language processing, <br>text mining techniques, and unsupervised learning mechanism. <br> <br> We designed an automatic Knowledge discovery and extraction analytical pipeline system. <br>The input dataset to the proposed system generates first using Tokenization. Then <br>breaking a text into linguistically meaningful units called tokens removes words that occur <br>frequently but do not contribute to the content of the text, followed by applying an N-gram <br>algorithm that can help capture multiword phrases precisely. Next, the LDA model is used <br>to discover topics that hide in a set of text documents. Next, Hadoop and Spark are <br>adopted to run a topic model and provide summarized text. Finally, for preprocessing, T.B. <br>size input datasets were processed.	-

Show simple item record

qrcode

트윗하기

Total Views & Downloads

File Download

There are no files associated with this item.