MikavMikav
datasetcorpusmalayalamcultural-heritageopen-dataDraft

Building an Open Malayalam Culture Corpus: Collection, Cleaning, and Licensing of Low-Resource Heritage Data

Documents the methodology for sourcing, cleaning, and licensing Malayalam text and cultural heritage data (manuscripts, oral history, festival/art records) into an open, reusable dataset — addressing IP ownership and low-resource data challenges.

Hrudu Shibu·

Abstract

This paper documents the methodology for sourcing, cleaning, and licensing Malayalam text and cultural heritage data — including manuscripts, oral history transcriptions, and festival/art records — into an open, reusable dataset. We address challenges specific to low-resource languages: IP ownership of cultural materials, quality control in digitisation, and building community trust for data contribution.

1. Introduction

Malayalam, despite being spoken by over 38 million people, remains underrepresented in large-scale NLP datasets. Cultural heritage data — art form documentation, festival records, oral histories — is particularly scarce in digital, machine-readable formats. This paper presents our approach to building an open corpus that preserves authenticity while enabling AI applications.

2. Data Sources

2.1 Manuscript Collections

  • Palm leaf manuscripts (digitised)
  • Historical literary works in public domain
  • Government cultural archives

2.2 Oral History

  • Transcriptions of practitioner interviews
  • Recorded narratives from cultural institutions
  • Community-submitted oral traditions

2.3 Festival and Art Records

  • Temple festival documentation
  • Art form technique descriptions
  • Regional cultural calendars

3. Collection Methodology

4. Cleaning Pipeline

5. Licensing Framework

6. Challenges and Limitations

7. Conclusion and Future Work

References