Fine-Tuning Open LLMs for Native Malayalam Cultural Understanding: A Comparative Study
Compares fine-tuning approaches (Llama, Qwen, Gemma) on the Malayalam culture corpus, evaluating language fluency and cultural-knowledge accuracy against base/baseline models.
Abstract
We present a comparative study of fine-tuning open large language models — Llama, Qwen, and Gemma — on our Malayalam culture corpus. We evaluate each model's performance on language fluency, cultural-knowledge accuracy, and contextual understanding of Kerala-specific concepts against their base configurations and existing multilingual baselines.
1. Introduction
General-purpose LLMs trained on internet-scale data often lack depth in low-resource language domains. Malayalam cultural knowledge — including art form terminology, historical context, and idiomatic expressions — is poorly represented in training data. Fine-tuning on domain-specific corpora can bridge this gap.