Underdocumented language data corpus construction

Gabriela Caballero, UC San Diego & Lucien Carroll, Cisco

Growing concern about language endangerment and recent technological innovations have brought about increased development of language corpora of primary data obtained through field research. However, linguistic analyses of underdocumented languages generally do not provide many details or contextualization of the content and structure of the corpora on which they are based. Kwaras is a searchable interface that links WAV audio files, time-aligned annotations produced with ELAN (Sloetjes & Wittenburg 2008) and document metadata that was designed to fill this gap. This tutorial will provide an overview of how ELAN and Kwaras work, and will provide participants with hands-on experience in annotation and assembly of corpora for the description and analysis of phonological phenomena using these tools.

The following upcoming paper describes the main features of Kwaras:

Caballero, Gabriela, Lucien Carroll & Kevin Mach. (to appear). Accessing, managing and mobilizing an ELAN-based language documentation corpus: the Kwaras and Namuti tools. Language Documentation and Conservation. PDF