SCOPUS
0Citation Export
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.advisor | Tae-Sun Chung | - |
| dc.contributor.author | TESSEMA BETHEL MELESSE | - |
| dc.date.issued | 2024-02 | - |
| dc.identifier.other | 33564 | - |
| dc.identifier.uri | https://aurora.ajou.ac.kr/handle/2018.oak/38863 | - |
| dc.description | 학위논문(석사)--인공지능학과,2024. 2 | - |
| dc.description.abstract | Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tu ning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware. | - |
| dc.description.tableofcontents | 1 Introduction 1_x000D_ <br> 1.1 Problem Definition 2_x000D_ <br> 1.2 Motivation and Significance 3_x000D_ <br> 1.3 Research Questions 6_x000D_ <br> 1.4 Proposed Method 7_x000D_ <br> 1.5 Contribution 9_x000D_ <br> 1.6 Organization 10_x000D_ <br>2 Related Works 11_x000D_ <br> 2.1 Multilingual Large Language Models 11_x000D_ <br> 2.2 Large Multilingual or Monolingual Datasets 12_x000D_ <br> 2.3 Common Crawl and Dataset Extraction 13_x000D_ <br> 2.4 Deduplication 14_x000D_ <br> 2.5 Low Resource Model Adaptation 15_x000D_ <br>3 Methods 17_x000D_ <br> 3.1 Data Collection Framework 17_x000D_ <br> 3.1.1 Index Filtering 18_x000D_ <br> 3.1.2 Extracting WARC Files 19_x000D_ <br> 3.1.3 Text Extraction 20_x000D_ <br> 3.1.4 Deduplication 20_x000D_ <br> 3.2 Low Resource Model Adaptation 21_x000D_ <br>4 Experimental Settings and Implementation Details 23_x000D_ <br> 4.1 Languages and Benchmark Datasets and Dataset Collection 23_x000D_ <br> 4.1.1 Dataset Collection 23_x000D_ <br> 4.1.2 Compute Requirements 24_x000D_ <br> 4.1.3 Languages 24_x000D_ <br> 4.1.4 Benchmark Datasets 25_x000D_ <br> 4.2 Models and Model Adaptation Settings 25_x000D_ <br> 4.2.1 Models 25_x000D_ <br> 4.2.2 Model Adaptation 26_x000D_ <br> 4.3 Evaluation Settings 26_x000D_ <br> 4.3.1 Language Modeling Evaluation 27_x000D_ <br> 4.3.2 Downstream Evaluation 27_x000D_ <br> 4.3.2.1 Question Answering Task 27_x000D_ <br> 4.3.2.2 Few Shot Prompting Evaluation 28_x000D_ <br> 4.3.3 Evaluation Metrics 28_x000D_ <br>5 Performance Evaluation 29_x000D_ <br> 5.1 Data Collection Evaluation 29_x000D_ <br> 5.1.1 UnifiedCrawl Amharic 30_x000D_ <br> 5.1.2 UnifiedCrawl for other Languages 31_x000D_ <br> 5.1.3 Dataset Comparison with other Datasets 31_x000D_ <br> 5.2 Method Evaluation 33_x000D_ <br> 5.2.1 Language Modeling Evaluation 33_x000D_ <br> 5.2.2 Downstream Few Shot Prompting 34_x000D_ <br>6 Ablation Studies 35_x000D_ <br> 6.1 Comparison with Full Finetuning 35_x000D_ <br> 6.2 Comparison with Training from Scratch 37_x000D_ <br> 6.3 Comparison on Downstream Supervised Training 37_x000D_ <br>7 Limitations and Future Works 39_x000D_ <br>8 Conclusion 41_x000D_ | - |
| dc.language.iso | eng | - |
| dc.publisher | The Graduate School, Ajou University | - |
| dc.rights | 아주대학교 논문은 저작권에 의해 보호받습니다. | - |
| dc.title | Democratizing LLM Adaptation via Monolingual Datasets from Common Crawl | - |
| dc.type | Thesis | - |
| dc.contributor.affiliation | 아주대학교 대학원 | - |
| dc.contributor.department | 일반대학원 인공지능학과 | - |
| dc.date.awarded | 2024-02 | - |
| dc.description.degree | Master | - |
| dc.identifier.url | https://dcoll.ajou.ac.kr/dcollection/common/orgView/000000033564 | - |
| dc.subject.keyword | Adapters | - |
| dc.subject.keyword | Common Crawl | - |
| dc.subject.keyword | LLMs | - |
| dc.subject.keyword | Large Language Models | - |
| dc.subject.keyword | LoRA | - |
| dc.subject.keyword | Low-Resource Languages | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.