Ajou University repository

Democratizing LLM Adaptation via Monolingual Datasets from Common Crawl
  • TESSEMA BETHEL MELESSE
Citations

SCOPUS

0

Citation Export

Advisor
Tae-Sun Chung
Affiliation
아주대학교 대학원
Department
일반대학원 인공지능학과
Publication Year
2024-02
Publisher
The Graduate School, Ajou University
Keyword
AdaptersCommon CrawlLLMsLarge Language ModelsLoRALow-Resource Languages
Description
학위논문(석사)--인공지능학과,2024. 2
Abstract
Large language models (LLMs) under-perform on low-resource languages due to limited training data. We present a method to efficiently collect text data for low-resource languages from the entire Common Crawl corpus. Our approach, UnifiedCrawl, filters and extracts common crawl using minimal compute resources, yielding mono-lingual datasets much larger than previously available sources. We demonstrate that leveraging this data to fine-tu ning multilingual LLMs via efficient adapter methods (QLoRA) significantly boosts performance on the low-resource language, while minimizing VRAM usage. Our experiments show large improvements in language modeling perplexity and an increase in few-shot prompting scores. Our work and released source code provide an affordable approach to improve LLMs for low-resource languages using consumer hardware.
Language
eng
URI
https://aurora.ajou.ac.kr/handle/2018.oak/38863
Journal URL
https://dcoll.ajou.ac.kr/dcollection/common/orgView/000000033564
Show full item record

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Total Views & Downloads

File Download

  • There are no files associated with this item.