Ajou University repository

SciPDFindexer: Distributed Information Retrieval system using MapReduce
  • Murtazaev, Aziz
Citations

SCOPUS

0

Citation Export

Advisor
Sangyoon Oh
Affiliation
아주대학교 일반대학원
Department
일반대학원 컴퓨터공학과
Publication Year
2011-08
Publisher
The Graduate School, Ajou University
Description
학위논문(석사)아주대학교 일반대학원 :컴퓨터공학과,2011. 8
Alternative Abstract
Indexing allows converting raw document collection into easily searchable representation. Web searching by Google or Yahoo provides sub-second response time which is made possible by efficient indexing of web-pages over the entire Web. Indexing process gets challenging when the scale gets bigger. Parallel techniques, such as MapReduce framework can assist in efficient large-scale indexing process. We target at the problem of large-scale indexing of documents with specific structure. We propose SciPDFindexer, system for indexing and querying scientific papers in PDF using MapReduce programming model in a distributed system. Unlike Web search engines, our target domain is scientific papers, which has pre-defined structure, such as title, abstract, sections, references. Our proposed system enables parsing large number of scientific papers in PDF recreating their structure and performing efficient distributed indexing with MapReduce framework in a cluster of nodes. Our contributions are distributed indexing scheme suitable for scientific articles’ structures and corresponding full-functional implementation which includes parsing, indexing, querying logics. We show the difference of our scheme from distributed indexing scheme described in the original MapReduce paper. And we describe each part of the system in detail, particularly, besides indexing scheme we show our proposed PDF parsing logic for scientific articles and ranking model used in our system. We conducted three types of experimental evaluations of our system in a cluster of nodes. First, we show that our distributed indexing scheme can be parallelized efficiently and can scale with adding nodes. Second, we found optimal MapReduce parameters for our system under the given conditions. And third, we showed that our querying system provides sub-second response time for various length of queries.
Language
eng
URI
https://dspace.ajou.ac.kr/handle/2018.oak/10064
Fulltext

Type
Thesis
Show full item record

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Total Views & Downloads

File Download

  • There are no files associated with this item.