Democratizing LLM Adaptation via Monolingual Datasets from Common Crawl

TESSEMA BETHEL MELESSE

DC Field	Value	Language
dc.contributor.advisor	Tae-Sun Chung	-
dc.contributor.author	TESSEMA BETHEL MELESSE	-
dc.date.issued	2024-02	-
dc.identifier.other	33564	-
dc.identifier.uri	https://aurora.ajou.ac.kr/handle/2018.oak/38863	-
dc.description	학위논문(석사)--인공지능학과,2024. 2	-
dc.description.abstract	Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tu ning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware.	-
dc.description.tableofcontents	1 Introduction 1_x000D_ <br> 1.1 Problem Definition 2_x000D_ <br> 1.2 Motivation and Significance 3_x000D_ <br> 1.3 Research Questions 6_x000D_ <br> 1.4 Proposed Method 7_x000D_ <br> 1.5 Contribution 9_x000D_ <br> 1.6 Organization 10_x000D_ <br>2 Related Works 11_x000D_ <br> 2.1 Multilingual Large Language Models 11_x000D_ <br> 2.2 Large Multilingual or Monolingual Datasets 12_x000D_ <br> 2.3 Common Crawl and Dataset Extraction 13_x000D_ <br> 2.4 Deduplication 14_x000D_ <br> 2.5 Low Resource Model Adaptation 15_x000D_ <br>3 Methods 17_x000D_ <br> 3.1 Data Collection Framework 17_x000D_ <br> 3.1.1 Index Filtering 18_x000D_ <br> 3.1.2 Extracting WARC Files 19_x000D_ <br> 3.1.3 Text Extraction 20_x000D_ <br> 3.1.4 Deduplication 20_x000D_ <br> 3.2 Low Resource Model Adaptation 21_x000D_ <br>4 Experimental Settings and Implementation Details 23_x000D_ <br> 4.1 Languages and Benchmark Datasets and Dataset Collection 23_x000D_ <br> 4.1.1 Dataset Collection 23_x000D_ <br> 4.1.2 Compute Requirements 24_x000D_ <br> 4.1.3 Languages 24_x000D_ <br> 4.1.4 Benchmark Datasets 25_x000D_ <br> 4.2 Models and Model Adaptation Settings 25_x000D_ <br> 4.2.1 Models 25_x000D_ <br> 4.2.2 Model Adaptation 26_x000D_ <br> 4.3 Evaluation Settings 26_x000D_ <br> 4.3.1 Language Modeling Evaluation 27_x000D_ <br> 4.3.2 Downstream Evaluation 27_x000D_ <br> 4.3.2.1 Question Answering Task 27_x000D_ <br> 4.3.2.2 Few Shot Prompting Evaluation 28_x000D_ <br> 4.3.3 Evaluation Metrics 28_x000D_ <br>5 Performance Evaluation 29_x000D_ <br> 5.1 Data Collection Evaluation 29_x000D_ <br> 5.1.1 UnifiedCrawl Amharic 30_x000D_ <br> 5.1.2 UnifiedCrawl for other Languages 31_x000D_ <br> 5.1.3 Dataset Comparison with other Datasets 31_x000D_ <br> 5.2 Method Evaluation 33_x000D_ <br> 5.2.1 Language Modeling Evaluation 33_x000D_ <br> 5.2.2 Downstream Few Shot Prompting 34_x000D_ <br>6 Ablation Studies 35_x000D_ <br> 6.1 Comparison with Full Finetuning 35_x000D_ <br> 6.2 Comparison with Training from Scratch 37_x000D_ <br> 6.3 Comparison on Downstream Supervised Training 37_x000D_ <br>7 Limitations and Future Works 39_x000D_ <br>8 Conclusion 41_x000D_	-
dc.language.iso	eng	-
dc.publisher	The Graduate School, Ajou University	-
dc.rights	아주대학교 논문은 저작권에 의해 보호받습니다.	-
dc.title	Democratizing LLM Adaptation via Monolingual Datasets from Common Crawl	-
dc.type	Thesis	-
dc.contributor.affiliation	아주대학교 대학원	-
dc.contributor.department	일반대학원 인공지능학과	-
dc.date.awarded	2024-02	-
dc.description.degree	Master	-
dc.identifier.url	https://dcoll.ajou.ac.kr/dcollection/common/orgView/000000033564	-
dc.subject.keyword	Adapters	-
dc.subject.keyword	Common Crawl	-
dc.subject.keyword	LLMs	-
dc.subject.keyword	Large Language Models	-
dc.subject.keyword	LoRA	-
dc.subject.keyword	Low-Resource Languages	-

Show simple item record

qrcode

트윗하기

Total Views & Downloads

File Download

There are no files associated with this item.