Internet resources have rapidly increased in recent years owing to advanced technology.
<br>Digital documents have become more popular for storing and broadcasting information
<br>when compared to traditional paper documents. The problem of ordinary users has
<br>become finding desired information in such an environment. Also, too much data is
<br>available, which usually leads to great difficulties due to too high computational complexity
<br>(time and memory). On another side, the available hardware may not handle it. The world
<br>saw 2020 as an unprecedented global invasion of coronavirus. Furthermore, many public
<br>and private institutions act quickly to make resources, such as opening big data
<br>repositories, to make the discovery process faster and more efficient.
<br>
<br> We propose in this Thesis a self-based Automatic Text summarization, merging extensive.
<br>Big data modeling, information mapping, and Extraction deliver top-down and bottom-up
<br>searching and browsing—the Intersection for faster knowledge discovery and Extraction.
<br>The problem consists of finding information on the documents that accept or deny the
<br>correlation between Respiratory Syndrome and Weather of COVID-19 spreading, taking
<br>into consideration social, Ethics, and Media. In order to discover and extract, and then
<br>summarize Texts from the massive amount of paper documents with Natural language processing,
<br>text mining techniques, and unsupervised learning mechanism.
<br>
<br> We designed an automatic Knowledge discovery and extraction analytical pipeline system.
<br>The input dataset to the proposed system generates first using Tokenization. Then
<br>breaking a text into linguistically meaningful units called tokens removes words that occur
<br>frequently but do not contribute to the content of the text, followed by applying an N-gram
<br>algorithm that can help capture multiword phrases precisely. Next, the LDA model is used
<br>to discover topics that hide in a set of text documents. Next, Hadoop and Spark are
<br>adopted to run a topic model and provide summarized text. Finally, for preprocessing, T.B.
<br>size input datasets were processed.