Building an Open Malayalam Culture Corpus: Collection, Cleaning, and Licensing of Low-Resource Heritage Data
Documents the methodology for sourcing, cleaning, and licensing Malayalam text and cultural heritage data (manuscripts, oral history, festival/art records) into an open, reusable dataset — addressing IP ownership and low-resource data challenges.
Abstract
This paper documents the methodology for sourcing, cleaning, and licensing Malayalam text and cultural heritage data — including manuscripts, oral history transcriptions, and festival/art records — into an open, reusable dataset. We address challenges specific to low-resource languages: IP ownership of cultural materials, quality control in digitisation, and building community trust for data contribution.
1. Introduction
Malayalam, despite being spoken by over 38 million people, remains underrepresented in large-scale NLP datasets. Cultural heritage data — art form documentation, festival records, oral histories — is particularly scarce in digital, machine-readable formats. This paper presents our approach to building an open corpus that preserves authenticity while enabling AI applications.
2. Data Sources
2.1 Manuscript Collections
- Palm leaf manuscripts (digitised)
- Historical literary works in public domain
- Government cultural archives
2.2 Oral History
- Transcriptions of practitioner interviews
- Recorded narratives from cultural institutions
- Community-submitted oral traditions
2.3 Festival and Art Records
- Temple festival documentation
- Art form technique descriptions
- Regional cultural calendars