Ajou University repository

Democratizing LLM Adaptation via Monolingual Datasets from Common Crawl
  • TESSEMA BETHEL MELESSE
Citations

SCOPUS

0

Citation Export

DC Field Value Language
dc.contributor.advisorTae-Sun Chung-
dc.contributor.authorTESSEMA BETHEL MELESSE-
dc.date.issued2024-02-
dc.identifier.other33564-
dc.identifier.urihttps://aurora.ajou.ac.kr/handle/2018.oak/38863-
dc.description학위논문(석사)--인공지능학과,2024. 2-
dc.description.abstractLarge language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tu ning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware.-
dc.description.tableofcontents1 Introduction 1_x000D_ <br> 1.1 Problem Definition 2_x000D_ <br> 1.2 Motivation and Significance 3_x000D_ <br> 1.3 Research Questions 6_x000D_ <br> 1.4 Proposed Method 7_x000D_ <br> 1.5 Contribution 9_x000D_ <br> 1.6 Organization 10_x000D_ <br>2 Related Works 11_x000D_ <br> 2.1 Multilingual Large Language Models 11_x000D_ <br> 2.2 Large Multilingual or Monolingual Datasets 12_x000D_ <br> 2.3 Common Crawl and Dataset Extraction 13_x000D_ <br> 2.4 Deduplication 14_x000D_ <br> 2.5 Low Resource Model Adaptation 15_x000D_ <br>3 Methods 17_x000D_ <br> 3.1 Data Collection Framework 17_x000D_ <br> 3.1.1 Index Filtering 18_x000D_ <br> 3.1.2 Extracting WARC Files 19_x000D_ <br> 3.1.3 Text Extraction 20_x000D_ <br> 3.1.4 Deduplication 20_x000D_ <br> 3.2 Low Resource Model Adaptation 21_x000D_ <br>4 Experimental Settings and Implementation Details 23_x000D_ <br> 4.1 Languages and Benchmark Datasets and Dataset Collection 23_x000D_ <br> 4.1.1 Dataset Collection 23_x000D_ <br> 4.1.2 Compute Requirements 24_x000D_ <br> 4.1.3 Languages 24_x000D_ <br> 4.1.4 Benchmark Datasets 25_x000D_ <br> 4.2 Models and Model Adaptation Settings 25_x000D_ <br> 4.2.1 Models 25_x000D_ <br> 4.2.2 Model Adaptation 26_x000D_ <br> 4.3 Evaluation Settings 26_x000D_ <br> 4.3.1 Language Modeling Evaluation 27_x000D_ <br> 4.3.2 Downstream Evaluation 27_x000D_ <br> 4.3.2.1 Question Answering Task 27_x000D_ <br> 4.3.2.2 Few Shot Prompting Evaluation 28_x000D_ <br> 4.3.3 Evaluation Metrics 28_x000D_ <br>5 Performance Evaluation 29_x000D_ <br> 5.1 Data Collection Evaluation 29_x000D_ <br> 5.1.1 UnifiedCrawl Amharic 30_x000D_ <br> 5.1.2 UnifiedCrawl for other Languages 31_x000D_ <br> 5.1.3 Dataset Comparison with other Datasets 31_x000D_ <br> 5.2 Method Evaluation 33_x000D_ <br> 5.2.1 Language Modeling Evaluation 33_x000D_ <br> 5.2.2 Downstream Few Shot Prompting 34_x000D_ <br>6 Ablation Studies 35_x000D_ <br> 6.1 Comparison with Full Finetuning 35_x000D_ <br> 6.2 Comparison with Training from Scratch 37_x000D_ <br> 6.3 Comparison on Downstream Supervised Training 37_x000D_ <br>7 Limitations and Future Works 39_x000D_ <br>8 Conclusion 41_x000D_-
dc.language.isoeng-
dc.publisherThe Graduate School, Ajou University-
dc.rights아주대학교 논문은 저작권에 의해 보호받습니다.-
dc.titleDemocratizing LLM Adaptation via Monolingual Datasets from Common Crawl-
dc.typeThesis-
dc.contributor.affiliation아주대학교 대학원-
dc.contributor.department일반대학원 인공지능학과-
dc.date.awarded2024-02-
dc.description.degreeMaster-
dc.identifier.urlhttps://dcoll.ajou.ac.kr/dcollection/common/orgView/000000033564-
dc.subject.keywordAdapters-
dc.subject.keywordCommon Crawl-
dc.subject.keywordLLMs-
dc.subject.keywordLarge Language Models-
dc.subject.keywordLoRA-
dc.subject.keywordLow-Resource Languages-
Show simple item record

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Total Views & Downloads

File Download

  • There are no files associated with this item.