MikavMikav
fine-tuningllmmalayalamllamaqwengemmaDraft

Fine-Tuning Open LLMs for Native Malayalam Cultural Understanding: A Comparative Study

Compares fine-tuning approaches (Llama, Qwen, Gemma) on the Malayalam culture corpus, evaluating language fluency and cultural-knowledge accuracy against base/baseline models.

Hrudu Shibu·

Abstract

We present a comparative study of fine-tuning open large language models — Llama, Qwen, and Gemma — on our Malayalam culture corpus. We evaluate each model's performance on language fluency, cultural-knowledge accuracy, and contextual understanding of Kerala-specific concepts against their base configurations and existing multilingual baselines.

1. Introduction

General-purpose LLMs trained on internet-scale data often lack depth in low-resource language domains. Malayalam cultural knowledge — including art form terminology, historical context, and idiomatic expressions — is poorly represented in training data. Fine-tuning on domain-specific corpora can bridge this gap.

2. Base Models

2.1 Llama (Meta)

2.2 Qwen (Alibaba)

2.3 Gemma (Google)

3. Fine-Tuning Approach

3.1 Data Preparation

3.2 Training Configuration

3.3 Instruction Tuning

4. Evaluation Methodology

4.1 Language Fluency Metrics

4.2 Cultural Knowledge Accuracy

4.3 Contextual Understanding

5. Results

6. Discussion

7. Conclusion

References