A central part of linguistic research is the creation of corpora and other collections of language material, ideally accessible through a long-term preservation archive. Archived language corpora and documentary collections may be used for decades or even centuries, for academic research, language revitalization, and other purposes we cannot guess today; such material is the longest-lasting academic research product of many linguists. Current Berkeley faculty and students have been involved in creating the corpora and archival collections below.
Corpora of English speech
- [British English, 1977-1998] Emily Cibelli, Susanne Gahl, Kathleen Hall, Ronald Sprouse. "Up" Corpus. Department of Linguistics, UC Berkeley. http://linguistics.berkeley.edu/~up/.
The UC Berkeley Up Project is a speech corpus and phonetic longitudinal study based on the "Up" series of documentary films by director Michael Apted, showing a set of individuals at seven year intervals over a period of 42 years. Recordings in the corpus comprise 250 utterances; 11 speakers, 9 of which have utterances from each represented age; 21,328 word tokens; 27,921 vowel tokens; and 41,284 consonant tokens. The corpus is described by Susanne Gahl, Emily Cibelli, Kathleen Hall and Ronald Sprouse, "The 'Up' corpus: A corpus of speech samples across adulthood", Corpus Linguistics and Linguistic Theory 10 (2014) 315 – 328.
- [Ohio English, 2000] M. A. Pitt, L. Dilley, Keith Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier. 2007. Buckeye Corpus of Conversational Speech. 2nd release. Columbus, OH: Department of Psychology, Ohio State University. http://www.buckeyecorpus.osu.edu/.
The Buckeye Corpus of conversational speech contains high-quality recordings from 40 speakers in Columbus, Ohio, conversing freely with an interviewer. The speech has been orthographically transcribed and phonetically labeled. The audio and text files, together with time-aligned phonetic labels, are stored in a format for use with speech analysis software (Xwaves and Wavesurfer). Software for searching the transcription files is currently being written. The corpus is FREE for noncommercial uses.
Language documentation: Archival collections
Linguists currently at Berkeley have mostly archived language documentation materials in the California Language Archive (UC Berkeley), the Archive of the Indigenous Languages of Latin America (University of Texas at Austin), and the Endangered Languages Archive (SOAS, University of London).
- [multiple languages, 1964] Leanne Hinton. Southwest Indian sound recordings. Phoebe A. Hearst Museum of Anthropology, UC Berkeley, PHM 19, http://cla.berkeley.edu/collection/11019.
Scope and content: 133 recordings of Apache, Cahuilla, Chemehuevi, Havasupai, Mojave, Navajo, and Yavapai songs.
- [multiple languages, 1972-2004] Leanne Hinton. Leanne Hinton Papers on Indigenous Languages of the Americas. Survey of California and Other Indian Languages, UC Berkeley, SCL Hinton, http://cla.berkeley.edu/collection/26.
Scope and content: The Papers consist primarily of Leanne Hinton's notes and related documents and recordings from linguistics field methods classes held at the University of California, Berkeley and the University of California, San Diego. This includes materials for Navajo, Quechua, Ashaninka Campa, Hopi, Q'anjob'al, K'ichean, Mixtec, Yowlumne Yokuts, Paraguayan Guaraní, and Yucatec Maya. Also included are materials related to the Yahi Translation Project.
- [Ashaninka (cni), 1983] Abel Chapay, Leanne Hinton. Campa sound recordings. Berkeley Language Center, UC Berkeley, LA 161, http://cla.berkeley.edu/collection/10173.
Scope and content: Linguistic field recordings: linguistic data; stories; ethnographic data; songs. Spanish glosses provided; some translation, discussion and explanations in Spanish.
- [Ayacucho Quechua (quy), South Bolivian Quechua (quh), 1979] Robert Aronowitz, Karen Beeman, Ditri Daza, Leanne Hinton, Tom Larsen, Laura Runi, Margarida Salomao, Mark Seligman, David Leedom Shaul. Berkeley Field Methods: Quechua Sound Recordings. Survey of California and Other Indian Languages, UC Berkeley, SCL 2015-02, http://dx.doi.org/doi:10.7297/X2CZ354M.
Scope and content: This collection consists of 12 digitized audio recordings focusing largely on the recitation of basic words and phrases, as well as some more complex sentences. It also includes dialogic and poetic texts.
- [Aymara, 2014-2015] Kenneth Baclawski, Spencer Lamoureux, Herman H. Leung, Lev Michael, Zachary O'Hagan, Alfonso Otaegui, Nicholas Rolle, Kamala Russell, Hannah Sande, Eva Schinzel, Amalia Horan Skilton, Hector Zapana AlmanzaBerkeley Field Methods: Aymara. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-10, http://dx.doi.org/doi:10.7297/X2S180HS.
Scope and content: This collection consists of audio recordings and scanned copies of field notes that derive from elicitation sessions conducted during biweekly class meetings held throughout the course of the academic year. Some texts are included.
- [Caquinte (cot), 2011-] Zachary O'Hagan, Antonina Salazar Torres, Joy Salazar Torres, Emilia Sergio Salazar, Miguel Sergio Salazar. Caquinte Field Materials. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-13, http://cla.berkeley.edu/collection/11102.
Scope and content: Audio recordings of elicitation sessions, and of autobiographical and traditional narrative texts; field notes; ancillary documents.
- [CiShingini (asg), 2015] Geoff Bacon, Seth Ango Liatu, Ishaya Musa, Nicholas Rolle, Mark Sunday, Joshua Zaure. CiShingini (Agwara Kambari) Field Materials. Survey of California and Other Indian Languages, UC Berkeley, SCL 2015-05, http://dx.doi.org/doi:10.7297/X23N21JC.
Scope and content: Audio files, field notes, metadata spreadsheets, tone database.
- [Havasupai (yuf), 1964-1985] Dan Hanna, Leanne Hinton. Havasupai sound recordings. Berkeley Language Center, UC Berkeley, LA 189, http://cla.berkeley.edu/collection/10022
Scope and content: 109 recordings of songs, discussion, and narrative.
- [Iquito (iqu), 2002-2004] Christine Beier, Lev Michael. Iquito Language Documentation Project Collection. Archive of the Indigenous Languages of Latin America, University of Texas at Austin, https://www.ailla.utexas.org/islandora/object/ailla%3A124377.
Scope and content: This collection was produced by the Iquito Language Documentation Project of the University of Texas at Austin. It consists of audio and video recordings, most of which have restricted access levels.
- [Karuk (kyh), 2010-] LuLu Alexander, Tamara Alexander, Sonny Davis, Andrew Garrett, Erik H. Maier, Line Mikkelsen, Crystal Richardson, Clare Sandy, Vina Smith, Florrine Super, Charlie Thom Sr. Materials of the Berkeley Karuk Project. Survey of California and Other Indian Languages, UC Berkeley, SCL 2017-04, http://cla.berkeley.edu/collection/11150.
Scope and content: The collection consists mainly of field recordings made by Berkeley faculty and students with Karuk elders as well as younger language learners and second-language speakers. Most of the items in the collection are organized as follows: recordings made on a single research trip (on one or more days) are bundled together as digital assets of a single item. One item in the collection contains grant applications (e.g. for a National Science Foundation grant); another item contains handouts and posters from conference presentations by Berkeley project participants. The field recordings include a wide range of texts, text types, and methodologies (elicitation, free texts, responses to stimuli, discussion of legacy recordings); they cover a variety of linguistic topics (phonetics, phonology, morphology, syntax, and semantics).
- [Kashaya (kju), 1989-1990] Eugene Buckley, David Gamon, Kira Hall, Leanne Hinton, Milton "Bun" Lucas, Robert L. Oswalt. Berkeley Field Methods: Kashaya Sound Recordings. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-17, http://cla.berkeley.edu/collection/11106.
Scope and content: This collection consists of 32 digitized audio recordings that derive from elicitation sessions conducted during class meetings held throughout the course of the academic year.
- [Karipuna (kuq) and Uru-Eu-Wau-Wau (urz), 2017-] Wesley dos Santos, Aripã Karipuna, Katika Karipuna, Boakara Uru-Eu-Wau-Wau, Pajajup Uru-Eu-Wau-Wau, Mandá Uru-Eu-Wau-Wau, Boreá Uru-Eu-Wau-Wau. Kawahiva Language Documentation Archive. Survey of California and Other Indian Languages, UC Berkeley, SCL 2019-06, http://dx.doi.org/doi:10.7297/X2P26W9H.
Scope and content: Video and audio recordings of elicitation sessions, and of traditional narrative texts, personal life stories and songs; field notes.
- [Lahu (lhu), 1965-1966] James MatisoffLahu sound recordings. Berkeley Language Center, UC Berkeley, LA 50, http://cla.berkeley.edu/collection/10138.
Scope and content: 33 recordings.
- [Máíhĩ̵̀kì (ore), 2009-2015] Christine Beier, Stephanie Farmer, Greg Finley, Elizabeth Goodrich, Lev Michael, Kelsey Neely, Grace Neveu, Amalia Horan Skilton, John Sylak. Materials of the Berkeley Máíhĩ̵̀kì Project. Survey of California and Other Indian Languages, UC Berkeley, SCL 2013-02, http://dx.doi.org/doi:10.7297/X2DR2SGD.
Scope and content: This collection includes primary materials (e.g., audio and video recordings), derived products (e.g., transcriptions and translations), and linguistic analyses of Máíhĩ̵̀kì produced by the Máíhĩ̵̀kì Project, which was launched in June 2010, and is currently ongoing (as of September 2015). File bundle 2013-02.141 contains an index that indicates the file bundle location of each media file and each of its associated annotation files as of September 13, 2015.
- [Matsigenka (mcb), 2011] Christine Beier, Lev Michael, Haroldo Vargas Pereira, José Vargas Pereira. Materials of the Berkeley Matsigenka Project. Survey of California and Other Indian Languages, UC Berkeley, SCL 2013-03, http://cla.berkeley.edu/collection/11087.
Scope and content: Nine notebooks.
- [Matsigenka (mcb), 2011] Christine Beier, Lev Michael, Zachary O'Hagan, Haroldo Vargas Pereira, José Vargas Pereira. Matsigenka Texts (2011). Archive of the Indigenous Languages of Latin America, University of Texas at Austin, https://www.ailla.utexas.org/islandora/object/ailla%3A134948.
Scope and content: This resource is derived from six notebooks containing hand-written texts authored by José Vargas Pereira (1 notebook, 4 of its texts in this deposit) and Haroldo Vargas Pereira (5 notebooks, 166 of their texts in this deposit) between February and May 2011. All the texts are written in Matsigenka with line-by-line free translations in Spanish (Peruvian Castellano). The authors are brothers; both are fluent in Matsigenka and Spanish. The texts include (1) traditional Matsigenka narratives, (2) more recent historical narratives, (3) auto-ethnographic descriptions of Matsigenka culture and practices, and (4) personal narratives. These notebooks (plus three more by Haroldo Vargas Pereira that are not included in this depost) were solicited of the authors by Michael and Beier in order to help develop a corpus of Matsigenka texts, and were paid for as part of Michael's research project "Documentation and Analysis of Matsigenka, an endangered Amazonian Language", funded by a Hellman Family Faculty Fund Award in 2010. Typing of the hand-written texts was done primarily by Beier. The typed texts were reviewed line by line by either Michael or Beier in consultation with the authors, and were corrected as necessary. The reviewed and corrected texts were then parsed in FLEx (Fieldworks Language Explorer) by either Michael or O’Hagan.
- [Métchif (crg), 1977-1987] Eva Herman, Veronica Jerome, Richard A. Rhodes, Rosie Swenson, George Wilke.Métchif Sound Recordings. Survey of California and Other Indian Languages, UC Berkeley, SCL 2018-01, http://dx.doi.org/doi:10.7297/X2CZ35B9.
Scope and content: Audio files containing recordings of elicitation and texts in Métchif. Item descriptions were taken from metadata provided in the .csv file in Item 2018-01.001; question marks in the Item descriptions reflect question marks in the original metadata file.
- [Murato, 2011] Juan Cahuaza Mucushúa, Zachary O'Hagan. Murato Field Materials. Survey of California and Other Indian Languages, UC Berkeley, SCL 2016-11, http://cla.berkeley.edu/collection/11141.
Scope and content: Audio recordings of elicitation sessions targeting lexicon, grammar, and history; field notes.
- [Nafaanra (nfr), 2017-] Job Kwabena Ababio, James Anane, Sampson Kwasi Attah, Karee Garvin, Charles Munufie. Nafaanra Documentation Project. Survey of California and Other Indian Languages, UC Berkeley, SCL 2017-11, http://dx.doi.org/doi:10.7297/X2V98672.
Scope and content: This collection consists of Nafaanra materials collected by Karee Garvin from June 2017 through August 2018 in Banda Ahenkro, Ghana. Materials include sound recordings, field notes. Each file bundle contains either a grammatical elicitation, lexical elicitation, narratives, or a phonetic word list data. The content of individual elicitation sessions is described in the Description field. The respective contents of the field notes from each file bundle can also be found in the continuous notebook file, Field Notes: 2017-11.002.
- [Nanti (cox), 2002-2005] Christine Beier, Lev Michael. Nanti Collection of Christine Beier and Lev Michael. Archive of the Indigenous Languages of Latin America, University of Texas at Austin, https://www.ailla.utexas.org/islandora/object/ailla%3A124377.
Scope and content: This collection consists of audio recordings with transcriptions and translations.
- [Nez Perce (nez), 1960-2014] Haruo Aoki, Agnes Moses, Sam Watters, Harry Wheeler, Ida James Wheeler, Elisabeth P. Wilson. Haruo Aoki Papers on the Nez Perce Language. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-12, http://cla.berkeley.edu/collection/11101.
Scope and content: These papers document the linguistic work of Haruo Aoki on the Nez Perce language, including materials related to his original fieldwork as well as materials he derived from other researchers’ recordings of Nez Perce. Aoki conducted fieldwork on Nez Perce during the summers of 1960 through 1972 at Kooskia and Kamiah, Idaho, during which time his primary consultants were Harry Wheeler, Ida James Wheeler, and Elizabeth P. Wilson. Included in this collection are Aoki’s original field notes and notebooks from this time period, containing vocabulary and elicited sentences; also included are grammatical notes, word lists, and research articles he derived from these materials. The collection also includes Haruo Aoki’s transcriptions, with glosses, of Nez Perce texts that were originally recorded by Sven Liljeblad and Deward E. Walker in 1966-1967. The primary consultants for these texts were Agnes Moses, Sam Watters, and Elizabeth P. Wilson. In addition to original work on Nez Perce, a range of other materials related to Aoki’s professional activities, personal life, and linguistic interests are also included in the collection. Of biographical relevance, the collection includes Aoki’s autobiography, correspondence relating to some of Aoki’s professional activities, and papers Aoki wrote on non-linguistic topics, specifically English and English literature and critical writing concerning Japanese cultural heritage (in Japanese). As well, the collection includes a large amount of material that was gathered from outside sources such as museums, societies, and libraries by Aoki throughout his research on various language families. These obtained materials include: papers, photocopies of notebooks, and historical documents on Nez Perce and other Sahaptian languages; primary materials on Molalla; Edward Sapir’s Takelma note cards; and materials concerning comparative work on Na-Dene and Sino-Tibetan. Finally, the collection includes Aoki’s work on a previously undescribed Nagasaki dialect of Japanese, including a set of notebooks and a research manuscript.
- [Nez Perce (nez), 1959-1961] Haruo Aoki. Nez Perce sound recordings. Berkeley Language Center, UC Berkeley, LA 70, http://cla.berkeley.edu/collection/10151.
Scope and content: Linguistic field recordings: linguistic data; stories; ethnographic data; songs; additional ethnographic or ethnohistorical texts; conversation; reminiscences; untitled texts. Some portions have English glosses.
- [Northern Paiute (pao), 2005-2006] Molly Babel, Grace Dick, Leona Cluette Dick, Andrew Garrett, Joyce Glazier, Erin Haynes, Michael Houser, Morris Jack, Reiko Kataoka, Fanny Liu, Elaine Lundy, Nicole Marcus, Edith McCann, Edna Mega Dick McDonald, Meg McDonald, Ruth Rouvier, Madeline Stevens, Angela Strom-Weber, Maziar Toosarvandani. Berkeley Field Methods: Northern Paiute. Survey of California and Other Indian Languages, UC Berkeley, SCL 2018-02, http://dx.doi.org/doi:10.7297/X2251GBJ.
Scope and content: This collection consists of audio recordings that derive from fieldwork in Bridgeport and Colevile, CA. The contents of the recordings include lexical elicitation, grammatical elicitation, and some texts.
- [Omagua (omg), 2003-] Lazarina Cabudivo Tuisima, Manuel Cabudivo Tuisima, Amelia Huanaquiri Tuisima, Arnaldo Huanaquiri Tuisima, Alicia Huanío Cabudivo, Lino Huanío Cabudivo, Lev Michael, Zachary O'Hagan, Clare S. Sandy, Tammy Stark, Vivian Wauters. Materials of the Omagua Documentation Project. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-01, http://cla.berkeley.edu/collection/11088
Scope and content: Audio recordings of elicitation sessions and narrative texts; field notes; written narrative texts; derivative products (e.g., theses, dictionary drafts, conference handouts, etc.); preliminary grammatical descriptions; FLEx back-ups; historical and genealogical materials; grant proposals and budgets; personal correspondence; research products on colonial-era Old Omagua (OOMG) and Proto-Omagua-Kokama (POK).
- [Omurano (omu), 2011-2013] Rafael Inuma Macusi, Simón Inuma Manizari, Teolinda Inuma Vela, Jorge Macusi Nuribe, José Manuel Macusi Nuribe, Juan Macusi Nuribe, Francisco Murayari Macusi, Zachary O'Hagan. Omurano Field Materials. Survey of California and Other Indian Languages, UC Berkeley, SCL 2014-14, http://dx.doi.org/doi:10.7297/X29K488M.
Scope and content: Audio recordings, scanned field notes, and photographs that derive from interviews and elicitation sessions concerning the Omurano language and the regional history of the Urituyacu river basin.
- [Omurano (omu), Urarina (ura), 2013] Oscar Inuma Macusi, Juana Macusi Manizari, Jorge Macusi Nuribe, Zachary O'Hagan, Elsa Vela Clemente, Ignacio Vela Nuribe. Songs from the Urituyacu River. Survey of California and Other Indian Languages, UC Berkeley, SCL 2016-05, http://dx.doi.org/doi:10.7297/X2GH9FZD.
Scope and content: Twelve songs, some with multiple versions, in Urarina and Omurano, from five singers.
- [Panará (kre), 2016].Bernat Bardagil-Mas, Myriam Lapierre. A Digital Documentation of Panará. Endangered Languages Archive, SOAS, University of London, https://wurin.lis.soas.ac.uk/Collection/MPI945311.
Scope and content: This collection is made up of a series of bundles that minimally contain media files (audio and/or video recordings) and metadata. ELAN files and PDF documents are also included.
-
[Q'anjob'al (kjb), 1986-1987] Natasha Beery, Norine Berenz, David J. Costa, Ernesto Diaz-Couder, Leanne Hinton, Jeanie Lerner, Birch Moonwomon, Tony Moy, Rafael Pascual, Jean Perry. Berkeley Field Methods: Kanjobal Sound Recordings. Berkeley Language Center, UC Berkeley, 2016-01, http://dx.doi.org/doi:10.7297/X2639MQQ.
Scope and content: This collection consists of 121 digitized audio recordings that derive from elicitation sessions conducted during biweekly class meetings held throughout the course of the academic year.
- [San Miguel El Grande Mixtec (mig), 1985] Margarita Cuevas Cortes, Leanne Hinton, Monica Ann Macaulay. Mixtec sound recordings. Berkeley Language Center, UC Berkeley, LA 177, http://cla.berkeley.edu/collection/10111.
Scope and content: Linguistic field recordings: linguistic data; ethnographic data. Glosses provided in Spanish; discussion and explanations in Spanish.
- [South Bolivian Quechua (quh), 2016-2017] Margaret Cychosz, Efrain Escobar, Dmetri Hayes, Myriam Lapierre, Tyler Lau, Lev Michael, Julia Nee,Emily Remirez. Berkeley Field Methods: South Bolivian Quechua. Survey of California and Other Indian Languages, UC Berkeley, SCL 2016-13, http://cla.berkeley.edu/collection/11143.
Scope and content: Audio recordings of elicitation sessions, as well as accompanying notes. Content includes lexical and grammatical elicitation as well as texts. Some texts are transcribed in ELAN, and ELAN transcriptions are included.
- [South Bolivian Quechua (quh), 2017] Julia Nee, Fridda Ramos. Path in South Bolivian Quechua. Survey of California and Other Indian Languages, UC Berkeley, SCL 2017-03, http://cla.berkeley.edu/collection/11149.
Audio recordings of elicitation sessions; field notes; grant proposal and final report; collection guide.
- [Southeastern Pomo (pom), 2006-2007] Eugenia Antic, Charles Chang, Thera Crane, Donna Fenton, Hannah Haynie, Leanne Hinton, Jisup Hong, Shira Katseff, Loretta Kelsey, Russell Lee-Goldman, Lindsey Newbold, Marta Piqueras-Brunet, Yao Yao. Berkeley Field Methods: Southeastern Pomo Sound Recordings. Survey of California and Other Indian Languages, UC Berkeley, SCL LA 252, http://dx.doi.org/doi:10.7297/X2JD4TTV.
Scope and content: This collection consists of 188 audio recordings that derive from elicitation sessions conducted during biweekly class meetings held throughout the course of the academic year.
- [Taushiro (trr), 2015-] Amadeo García García, Zachary O'Hagan. Taushiro Field Materials. Survey of California and Other Indian Languages, UC Berkeley, SCL 2016-09, http://cla.berkeley.edu/collection/11139.
Scope and content: Audio recordings of elicitation sessions targeting lexicon, grammar, and history; texts; field notes; previously published materials.
- [Yakima (yak), 1961] Haruo Aoki, Donald Umtuch. Northern Sahaptin sound recordings. Berkeley Language Center, UC Berkeley, LA 71, http://cla.berkeley.edu/collection/10085.
Scope and content: Linguistic field recordings: linguistic data. Some English glosses provided.