Corpora and Archival Collections

A central part of linguistic research is the creation of corpora and other collections of language material, ideally accessible through a long-term preservation archive. Archived language corpora and documentary collections may be used for decades or even centuries, for academic research, language revitalization, and other purposes we cannot guess today; such material is the longest-lasting academic research product of many linguists. Current Berkeley faculty and students (and very recent PhD recipients) have been involved in creating the corpora and archival collections below.

Corpora of English speech

The UC Berkeley Up Project is a speech corpus and phonetic longitudinal study based on the "Up" series of documentary films by director Michael Apted, showing a set of individuals at seven year intervals over a period of 42 years. Recordings in the corpus comprise 250 utterances; 11 speakers, 9 of which have utterances from each represented age; 21,328 word tokens; 27,921 vowel tokens; and 41,284 consonant tokens. The corpus is described by Susanne Gahl, Emily Cibelli, Kathleen Hall and Ronald Sprouse, "The 'Up' corpus: A corpus of speech samples across adulthood", Corpus Linguistics and Linguistic Theory 10 (2014) 315 – 328.

  • [Ohio English, 2000] M. A. Pitt, L. Dilley, Keith Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier. 2007. Buckeye Corpus of Conversational Speech. 2nd release. Columbus, OH: Department of Psychology, Ohio State University. http://www.buckeyecorpus.osu.edu/.

The Buckeye Corpus of conversational speech contains high-quality recordings from 40 speakers in Columbus, Ohio, conversing freely with an interviewer. The speech has been orthographically transcribed and phonetically labeled. The audio and text files, together with time-aligned phonetic labels, are stored in a format for use with speech analysis software (Xwaves and Wavesurfer). Software for searching the transcription files is currently being written. The corpus is FREE for noncommercial uses.

Language documentation: Archival collections 

Linguists currently at Berkeley have mostly archived language documentation materials in the California Language Archive (UC Berkeley), the Archive of the Indigenous Languages of Latin America (University of Texas at Austin), and the Endangered Languages Archive (SOAS, University of London).

Scope and content: 133 recordings of Apache, Cahuilla, Chemehuevi, Havasupai, Mojave, Navajo, and Yavapai songs.

Scope and content: The Papers consist primarily of Leanne Hinton's notes and related documents and recordings from linguistics field methods classes held at the University of California, Berkeley and the University of California, San Diego. This includes materials for Navajo, Quechua, Ashaninka Campa, Hopi, Q'anjob'al, K'ichean, Mixtec, Yowlumne Yokuts, Paraguayan Guaraní, and Yucatec Maya. Also included are materials related to the Yahi Translation Project.

Scope and content: Linguistic field recordings: linguistic data; stories; ethnographic data; songs. Spanish glosses provided; some translation, discussion and explanations in Spanish.

  • [Ayacucho Quechua (quy), South Bolivian Quechua (quh), 1979] Robert Aronowitz, Karen Beeman, Ditri Daza, Leanne Hinton, Tom Larsen, Laura Runi, Margarida Salomao, Mark Seligman, David Leedom Shaul. Berkeley Field Methods: Quechua Sound Recordings. Survey of California and Other Indian Languages, UC Berkeley, SCL 2015-02, http://dx.doi.org/doi:10.7297/X2CZ354M.

Scope and content: This collection consists of 12 digitized audio recordings focusing largely on the recitation of basic words and phrases, as well as some more complex sentences. It also includes dialogic and poetic texts.

Scope and content: This collection consists of audio recordings and scanned copies of field notes that derive from elicitation sessions conducted during biweekly class meetings held throughout the course of the academic year. Some texts are included.

  • [Caquinte (cot), 2011-Zachary O'Hagan, Antonina Salazar Torres, Joy Salazar Torres, Emilia Sergio Salazar, Miguel Sergio Salazar. Caquinte Field Materials. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-13, http://cla.berkeley.edu/collection/11102.

Scope and content: Audio recordings of elicitation sessions, and of autobiographical and traditional narrative texts; field notes; ancillary documents.

Scope and content: Audio recordings of lexical and grammatical elicitation sessions, sociolinguistic surveys, narratives, and songs.

Scope and content: This collection consists of Guébie materials collected by Hannah Sande from October 2013 through July 2015 in the United States, Canada, and Côte d'Ivoire. Materials include sound recordings (e.g., of grammatical and lexical elicitation sessions and narratives with translations), field notes, and other relevant documentation. The content of elicitation sessions is described in the Description field.

Scope and content: 109 recordings of songs, discussion, and narrative.

Scope and content: Audio recordings and notes from fieldwork covering adverbial clauses and tense-aspect.

Scope and content: This collection was produced by the Iquito Language Documentation Project of the University of Texas at Austin. It consists of audio and video recordings, most of which have restricted access levels.

Scope and content: The collection consists mainly of field recordings made by Berkeley faculty and students with Karuk elders as well as younger language learners and second-language speakers. Most of the items in the collection are organized as follows: recordings made on a single research trip (on one or more days) are bundled together as digital assets of a single item. One item in the collection contains grant applications (e.g. for a National Science Foundation grant); another item contains handouts and posters from conference presentations by Berkeley project participants. The field recordings include a wide range of texts, text types, and methodologies (elicitation, free texts, responses to stimuli, discussion of legacy recordings); they cover a variety of linguistic topics (phonetics, phonology, morphology, syntax, and semantics).

  • [Kashaya (kju), 1989-1990] Eugene Buckley, David Gamon, Kira Hall, Leanne Hinton, Milton "Bun" Lucas, Robert L. Oswalt. Berkeley Field Methods: Kashaya Sound Recordings. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-17, http://cla.berkeley.edu/collection/11106.

Scope and content: This collection consists of 32 digitized audio recordings that derive from elicitation sessions conducted during class meetings held throughout the course of the academic year.

  • [Kwak'wala (kwk), 2014-2015] Violet Bracic, Mildred Child, Ruby Dawson Cranmer, Katie SardinhaKatie Sardinha's Kwak'wala Fieldwork Collection. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-11, http://dx.doi.org/doi:10.7297/X2K0727F.

Scope and content: The materials document Katie Sardinha's fieldwork on Kwak'wala during the period of 2014-2015, working with various Kwakwaka'wakw elders to study a range of semantic and syntactic topics. The collection includes original audio recordings and transcriptions of the recordings. Each transcriptions also contain a list of vocabulary encountered in the elicitation session. Main topics covered include the linguistic expression of causation, unaccusativity, psych verbs, intensifiers, and short narratives.

Scope and content: 33 recordings.

Scope and content: This collection includes primary materials (e.g., audio and video recordings), derived products (e.g., transcriptions and translations), and linguistic analyses of Máíhĩ̵̀kì produced by the Máíhĩ̵̀kì Project, which was launched in June 2010, and is currently ongoing (as of September 2015). File bundle 2013-02.141 contains an index that indicates the file bundle location of each media file and each of its associated annotation files as of September 13, 2015.

Scope and content: Nine notebooks.

Scope and content: This resource is derived from six notebooks containing hand-written texts authored by José Vargas Pereira (1 notebook, 4 of its texts in this deposit) and Haroldo Vargas Pereira (5 notebooks, 166 of their texts in this deposit) between February and May 2011. All the texts are written in Matsigenka with line-by-line free translations in Spanish (Peruvian Castellano). The authors are brothers; both are fluent in Matsigenka and Spanish. The texts include (1) traditional Matsigenka narratives, (2) more recent historical narratives, (3) auto-ethnographic descriptions of Matsigenka culture and practices, and (4) personal narratives. These notebooks (plus three more by Haroldo Vargas Pereira that are not included in this depost) were solicited of the authors by Michael and Beier in order to help develop a corpus of Matsigenka texts, and were paid for as part of Michael's research project "Documentation and Analysis of Matsigenka, an endangered Amazonian Language", funded by a Hellman Family Faculty Fund Award in 2010. Typing of the hand-written texts was done primarily by Beier. The typed texts were reviewed line by line by either Michael or Beier in consultation with the authors, and were corrected as necessary. The reviewed and corrected texts were then parsed in FLEx (Fieldworks Language Explorer) by either Michael or O’Hagan.

Scope and content: Audio recordings of elicitation sessions targeting lexicon, grammar, and history; field notes.

Scope and content: This collection consists of audio recordings with transcriptions and translations.

  • [Nez Perce (nez), 1960-2014] Haruo Aoki, Agnes Moses, Sam Watters, Harry Wheeler, Ida James Wheeler, Elisabeth P. Wilson. Haruo Aoki Papers on the Nez Perce Language. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-12, http://cla.berkeley.edu/collection/11101.

Scope and content: These papers document the linguistic work of Haruo Aoki on the Nez Perce language, including materials related to his original fieldwork as well as materials he derived from other researchers’ recordings of Nez Perce. Aoki conducted fieldwork on Nez Perce during the summers of 1960 through 1972 at Kooskia and Kamiah, Idaho, during which time his primary consultants were Harry Wheeler, Ida James Wheeler, and Elizabeth P. Wilson. Included in this collection are Aoki’s original field notes and notebooks from this time period, containing vocabulary and elicited sentences; also included are grammatical notes, word lists, and research articles he derived from these materials. The collection also includes Haruo Aoki’s transcriptions, with glosses, of Nez Perce texts that were originally recorded by Sven Liljeblad and Deward E. Walker in 1966-1967. The primary consultants for these texts were Agnes Moses, Sam Watters, and Elizabeth P. Wilson. In addition to original work on Nez Perce, a range of other materials related to Aoki’s professional activities, personal life, and linguistic interests are also included in the collection. Of biographical relevance, the collection includes Aoki’s autobiography, correspondence relating to some of Aoki’s professional activities, and papers Aoki wrote on non-linguistic topics, specifically English and English literature and critical writing concerning Japanese cultural heritage (in Japanese). As well, the collection includes a large amount of material that was gathered from outside sources such as museums, societies, and libraries by Aoki throughout his research on various language families. These obtained materials include: papers, photocopies of notebooks, and historical documents on Nez Perce and other Sahaptian languages; primary materials on Molalla; Edward Sapir’s Takelma note cards; and materials concerning comparative work on Na-Dene and Sino-Tibetan. Finally, the collection includes Aoki’s work on a previously undescribed Nagasaki dialect of Japanese, including a set of notebooks and a research manuscript.

Scope and content: Linguistic field recordings: linguistic data; stories; ethnographic data; songs; additional ethnographic or ethnohistorical texts; conversation; reminiscences; untitled texts. Some portions have English glosses.

Collaborative research project stemming from a 2012-2013 UC Berkeley field methods course on the related Atlantic language Seereer, taught by Professor Peter Jenks. Baier and Merrill were students in the course.

  • [Omagua (omg), 2003-] Lazarina Cabudivo Tuisima, Manuel Cabudivo Tuisima, Amelia Huanaquiri Tuisima, Arnaldo Huanaquiri Tuisima, Alicia Huanío Cabudivo, Lino Huanío Cabudivo, Lev MichaelZachary O'Hagan, Clare S. Sandy, Tammy Stark, Vivian Wauters. Materials of the Omagua Documentation Project. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-01, http://cla.berkeley.edu/collection/11088.

Scope and content: Audio recordings of elicitation sessions and narrative texts; field notes; written narrative texts; derivative products (e.g., theses, dictionary drafts, conference handouts, etc.); preliminary grammatical descriptions; FLEx back-ups; historical and genealogical materials; grant proposals and budgets; personal correspondence; research products on colonial-era Old Omagua (OOMG) and Proto-Omagua-Kokama (POK).

  • [Omurano (omu), 2011-2013] Rafael Inuma Macusi, Simón Inuma Manizari, Teolinda Inuma Vela, Jorge Macusi Nuribe, José Manuel Macusi Nuribe, Juan Macusi Nuribe, Francisco Murayari MacusiZachary O'Hagan. Omurano Field Materials. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-14, http://dx.doi.org/doi:10.7297/X29K488M.

Scope and content: Audio recordings, scanned field notes, and photographs that derive from interviews and elicitation sessions concerning the Omurano language and the regional history of the Urituyacu river basin.

  • [Omurano (omu), Urarina (ura), 2013]  Oscar Inuma Macusi, Juana Macusi Manizari, Jorge Macusi Nuribe, Zachary O'Hagan, Elsa Vela Clemente, Ignacio Vela Nuribe. Songs from the Urituyacu River. Survey of California and Other Indian Languages, UC Berkeley, SCL 2016-05, http://dx.doi.org/doi:10.7297/X2GH9FZD.

Scope and content: Twelve songs, some with multiple versions, in Urarina and Omurano, from five singers.

Scope and content: This collection is made up of a series of bundles that minimally contain media files (audio and/or video recordings) and metadata. ELAN files and PDF documents are also included. 

  • [Q'anjob'al (kjb), 1986-1987] Natasha Beery, Norine Berenz, David J. Costa, Ernesto Diaz-CouderLeanne Hinton, Jeanie Lerner, Birch Moonwomon, Tony Moy, Rafael Pascual, Jean Perry. Berkeley Field Methods: Kanjobal Sound Recordings. Berkeley Language Center, UC Berkeley, 2016-01, http://dx.doi.org/doi:10.7297/X2639MQQ.

Scope and content: This collection consists of 121 digitized audio recordings that derive from elicitation sessions conducted during biweekly class meetings held throughout the course of the academic year.

Scope and content: Linguistic field recordings: linguistic data; ethnographic data. Glosses provided in Spanish; discussion and explanations in Spanish.

Audio recordings of elicitation sessions; field notes; grant proposal and final report; collection guide.

  • [Southeastern Pomo (pom), 2006-2007] Eugenia Antic, Charles Chang, Thera Crane, Donna Fenton, Hannah Haynie, Leanne Hinton, Jisup Hong, Shira Katseff, Loretta Kelsey, Russell Lee-Goldman, Lindsey Newbold, Marta Piqueras-Brunet, Yao Yao. Berkeley Field Methods: Southeastern Pomo Sound Recordings. Survey of California and Other Indian Languages, UC Berkeley, SCL LA 252, http://dx.doi.org/doi:10.7297/X2JD4TTV.

Scope and content: This collection consists of 188 audio recordings that derive from elicitation sessions conducted during biweekly class meetings held throughout the course of the academic year.

Scope and content: Audio recordings of elicitation sessions targeting lexicon, grammar, and history; texts; field notes; previously published materials.

Scope and content: Primary materials (e.g., audio recordings), derived products (e.g., transcriptions and translations), and analyses of Ticuna.

Scope and content: Linguistic field recordings: linguistic data. Some English glosses provided.