Gopala Anumanchipalli (UCSF), Josh Chartier (UCSF, Berkeley), & Edward Chang (UCSF) - "Synthesizing speech directly from the human brain"
Neurological conditions that impair one's ability to speak are debilitating. In this talk, I will detail our efforts to create technology that translates cortical activity into speech. I will begin by sharing some insights into the neural mechanisms underlying speech production and how they relate to the vocal tract behavior during continuous speech articulation. I will then make a case for reconciling the neurological, articulatory, and acoustic representations of speech, that forms the basis of our biomimetic strategy towards neural speech synthesis. The neural data used in this study come from Epilepsy patients undergoing Electrocorticography recording, and behavioral data (vocal tract kinematics) comes from Electromagnetic Midsagittal Articulography.
Georgia Zellou, Michelle Cohn, and Bruno Ferenc Segedin (UC Davis) - Talking Tech: How does voice-AI influence human speech?
It's a new digital era: humans are now interfacing with technology using spoken language. We are regularly talking to voice-activated artificially intelligent (AI) personal assistants, such as Siri and Alexa, that spontaneously and more naturalistically produce interactive speech. Human speech patterns toward these new voice-AI interlocutors serves as a test to our scientific understanding of speech communication, language use, and even linguistic change. In this research program, we explore how interactions with voice-AI can influence human speech patterns, both for a single individual, and the potential that this has to lead to sound change across speech communities.
In this talk we present two case studies of how voice-AI influences human speech patterns. First, we address how human-AI interactions shape speech variation in resolving linguistic misunderstandings. Second, we investigate patterns of phonetic alignment when talking to voice-AI devices and, specifically, how this varies by speaker characteristics, such as age and gender.
Jeremy Steffman (UCLA) - Prosody mediates listeners' perception of durational cues
The temporal structure of speech is highly variable, and it's well known that listeners compute the duration of temporal cues relative to surrounding context in speech perception, allowing for flexible interpretation of duration based on how slow/fast surrounding speech is. This sort of process, known as speech rate normalization, is seen as playing an important role in speech processing given that many cues to contrasts in language are temporal (e.g., VOT, contrastive vowel length, etc.). In the classic view, speech rate normalization is seen as an automatic, domain-general process. In this talk I'll argue that speech rate normalization is in fact more flexible, and can interact with language patterns in perception. Specifically, I'll consider prosodic structure as a source of durational variability (e.g., phrase-final/accentual lengthening), and argue that listeners are sensitive to these sorts of prosodic factors in their interpretation of durational cues. This issue will be surveyed with three recent experiments, each exploring a different aspect of prosodic organization. In these experiments I show that prosodic factors can effectively override otherwise expecte speech rate effects and that listeners further incorporate prosody into their computation of speech rate itself. Results will be discussed in terms of their implications for listeners' processing of durational cues, and the role language patterns/experience may play.
Meng Yang (UCLA) - Language experience and auditory enhancement in perceptual cue-weighting
Listeners come to associate acoustic-phonetic cues to speech categories either because they have experience with the co-variation between cues or because the cues share an auditory effect (i.e. are enhancing). I present results from a series of cue-weighting experiments which show how these factors, experience and enhancement, constrain listeners' ability to shift their attention between speech cues that either are enhancing (pitch and breathiness) or are not enhancing (pitch and vowel duration). Two groups of listeners differing on their language experience with these cue pairs took part in these studies: English listeners who have no experience with either cue pair, and Hani listeners who have experience with both cue pairs in the same contrast.
Ernesto Gutiérrez (UC Berkeley) - The production of coronal stops by Spanish-English bilinguals: Acoustic measurements of the dental and alveolar voiced stops
The present study investigates the effects of word frequency, cognate status, and jargon on the production of /d/ in the speech of Spanish-English bilingual speakers. The production of coronal stops by Spanish monolinguals has been shown to differ in place of articulation from the production of the corresponding phones by English monolinguals, with a dental and alveolar articulation, respectively. I examined the acoustic properties of voiced coronal stops, as produced by 11 L1-Spanish, L2-English bilinguals. Relative intensity and the four spectral moments (i.e. center of gravity, standard deviation, skewness, and kurtosis) of d-initial Spanish and English words were analyzed. Mixed-effects linear regression models showed a difference in relative intensity between voiced coronal stops in Spanish and English words, as well as a difference in skewness and kurtosis as a function of language, which I argue to be suggestive of two separate places of articulation of coronal stops for Spanish-English bilinguals (one per language).
Kate Lindsey (Stanford) - Is Ende reduplication phonological copying or morphological doubling?
Ende infinitival verbs are an interesting puzzle for the Dual Theory of reduplication (Inkelas 2008), which distinguishes phonological and morphological doubling as formally and functionally distinct phenomena. In Ende, infinitival reduplication is sensitive to phonological structure (monosyllabic verb roots reduplicate, multisyllabic verb roots do not reduplicate) and to morphological structure (mono-morphemic verb roots reduplicate, multimorphemic verb roots do not). The shape of the reduplicant may be phonologically-conditioned (CV template, TETU patterns) or morphologically-conditioned (total reduplication, no TETU patterns). In this talk, I will contrast the two potential analyses, showing that neither a strictly phonological nor a strictly morphological analysis can account for all the data, and suggest an alternative mixed approach.
This paper proposes to subsume Syntax-Prosody Match Theory under General Correspondence Theory, which distinguishes purely existential MAX/DEP constraints (requiring nothing but the existence of a correspondent in the output/input, which can be rather different from the input element) from IDENT and other faithfulness constraints, Exact correspondence (preservation of edges, no deletion, no insertion, uniqueness of mapping, order preservation, etc.) is enforced by Syntax-Prosody and Prosody-Syntax Alignment and by standard Faithfulness. The empirical topic is the impossibility of phrase-final enclisis in English (*I don't know where Tom's vs. Tom's here) and its proper explanation.
Meg Cychosz (UCB) - The Lexical advantage: Kids learn words, not sounds
A critical question in phonological theory is how speech representations develop throughout childhood. In a traditional view, children acquire phonology from the bottom up, beginning with the production and perception of individual sounds. Eventually, children learn to string these phones together to construct words and build a lexicon (Berent, 2013; Dinnsen & Gierut, 2008; Jakobson, 1941/1968; Jusczyk et al., 2002). Alternative accounts suggest that children construct speech representations gradually, by generalizing over language chunks such as words or syllables (Edwards et al., 2004; Ferguson & Farwell, 1975; Gathercole et al., 1991; Metsala & Walley, 1998). If children do rely on words to construct phonological representations, we should anticipate an interaction of speech production with children’s vocabulary size and language input. Specifically, children with larger vocabularies, who receive more environmental stimuli in the input, should have more abstract segmental representations. We test this hypothesis in four- and five-year-old children who completed nonword and real word repetition tasks. The children produced the real words more accurately, and with less coarticulation, than the nonwords. Performance interacted with children’s vocabulary size and the number of words heard in the input, which we take as evidence for the primacy of the lexicon in early phonological development.
Yulia Oganian(link is external) (UCSF) - A Temporal landmark for syllabic representations of continuous speech in human superior temporal gyrus
The most salient acoustic features in the speech signal are the peaks and valleys that define the amplitude envelope. Perceptually, the envelope is necessary for speech comprehension. Yet the neural computations that represent the envelope, and their linguistic implications, are heavily debated. A widely held theory is that the amplitude envelope underlies segmentation of speech into syllabic units (e.g. /seg/-/men/-/ta/-/tion/), as speech amplitude peaks in syllabic centers (vowels) and reaches local minima around syllabic boundaries. In contrast, animal studies suggest that neural encoding of the speech envelope selectively represents timepoints of rapid increases in the envelope, or the continuous moment-by-moment envelope itself. I will describe a series of three experiments using high-density human intracranial recordings, that address this debate. In two experiments participants listened to natural speech at regular and slowed speech rates. Neural responses at all speech rates were driven by onset edges in the speech signal, i.e. local peaks in the first derivative of the speech envelope. A follow-up experiment with amplitude-modulated non-speech tones confirmed this result: Neural responses were time-locked to onsets of amplitude modulation ramps and larger for faster amplitude rises. Finally, acoustic analysis of natural speech revealed that 1) Auditory edges reliably cue the information-rich transition between the consonant-onset and vowel-nucleus of syllables, and 2) The sharpness of edges – encoded in the magnitude of neural responses - cues lexical stress. In summary, our findings establish that encoding of auditory edges in human STG underlies the perception of the temporal structure of speech.
Daniel Silverman (San Jose State University) - Evolution of the speech code: Higher order symbolism and the grammatical Big Bang
As our ancestors innovatively juxtaposed one meaning bearing sound to another, a huge increase in the inventory of speech sounds was triggered. Still, sporadic semantic ambiguity required deeper structural analyses in order for listeners to extract intended meanings, culminating in the emergence of compositional, post-compositional, and ultimately hierarchically-arranged and recursive constituent structures. These primordial pressures and their yielded structures, in remarkably similar function and form, continue to constrain, shape, and change the speech code to this very day. The early juxtaposition of two meaning-bearing sounds was thus both necessary and sufficient for full-blown grammatical complexity to evolve, triggering a grammatical “Big Bang”.
Santiago Barreda-Castanon (UC Davis) - Speech perception and apparent-speaker characteristics
The perception of speech sounds may be inherently related to the perception of apparent speaker characteristics in a way that facilitates both processes. I will present a general sketch of how the joint consideration of the speaker and the utterance might work, and suggest that this behavior may have an ecological basis. Finally, I will discuss the similarities between vowel normalization and perceptual constancy in the visual domain, and the way that speakers appear to facilitate speech perception by restricting the sorts of variation that they produce.
Katherine Demuth, (Macquarie University) - Resolving Variation: Listeners, Learners & Grammar
Researchers have long been aware of the 'invariance' problem, where listeners and learners must determine underlying representations from variable surface forms. Some of this variation may be contextually induced, or may be a result of different speakers, different dialects or different speaking conditions. Much of this research has focussed on the level segments/phonemes. Much less is known about how similar types of variability are dealt with by children in the morphological domain. This talk will explore some of these challenges, reporting on recent findings exploring listeners and learners’ sensitivity to allomorphic variation and more. These results suggest that young learners are highly sensitive to all kinds of variation, rapidly constructing robust linguistic representations despite surface variation.
Khalil Iskarous (USC) - The Space of Optimality Theories
When trying to understand a deductive scientific theory, and the predictions of that theory, it’s important to appreciate the space in which that theory lives. Knowing the larger space of theories allows one to determine if the predictions made are special to the deductive theory under consideration alone, or to a whole number of theories living in the same space. If a whole number of similar theories in the space predict similar outcomes, we can determine a deeper deductive theory. Optimality Theory, for instance, has been embedded in larger spaces which include Harmonic Serialist theories and Harmonic Grammar theories. This has helped in understanding the original theory. In this presentation, I will embed OT in a larger space that contains mathematical physical theories, the space of dynamical systems. In the Newtonian approach to these theories, time (discrete or continuous) is a crucial variable, just as in transformational phonology. In the Leibnizian approach, on the other hand, time does not enter, just as in OT. These two approaches, and their equivalence for many problems in mathematical physics, will be introduced from scratch. From within this dynamical space, we will see that OT requires us to consider two related dynamical systems: 1) one where the dynamical variable is a representation/candidate; 2) one where the dynamical variable is the salience of a constraint. This will allow the consideration of theories where constraints are not atomic and local, but distributed throughout a representation. The moral of the story will be that there exist OT-theoretic approaches where the computational and representational aspects of phonology are organically integrated.
Various (UC Berkeley) - Practice talks for AMP/NWAV/CILLA
Alice Shen (UC Berkeley) - Switch costs in Mandarin-English bilingual auditory comprehension
When trying to understand a deductive Previous bilingual research has reported a switch cost in the recognition of spoken Spanish-English code-switched words, though this cost can be modulated by the presence of anticipatory phonetic cues such as shifts in VOT or intonation (Piccinini & Garellek, 2014; Fricke et al., 2016) or by dominant language (Olson, 2017). This study investigates the roles of anticipatory phonetic cues and dominant language in whether auditory comprehension of Mandarin and English code-switches necessarily incur processing costs. Code-switched target words were spliced from originally code-switched sentences into originally unilingual sentences in two eye tracking experiments to test the effects of withholding anticipatory phonetic cues on code-switched word recognition. Bilingual participants’ language dominance scores were calculated via the Bilingual Language Profile (Birdsong et al., 2012). Results suggest that the ease of processing a code-switched word is dependent on the language of the word and the listener’s dominant language. Acoustic analysis of both speakers’ stimuli suggests that the functionality of f0 as an anticipatory phonetic cue might depend on the speaker’s tone-specific patterns of tonal coarticulation.
Andrew Cheng (UC Berkeley) - A cross-linguistic comparison of back vowels in English-Korean bilinguals
In this talk, I will share some findings from an investigation of back vowels in a pool of English-Korean bilinguals, focusing on evidence for and against a "shared phonological system" in certain speakers.
October 28 (postponed)
Maho Morimoto (UC Santa Cruz) - Preservation of liquid geminates in Japanese loanwords from Italian
Loanword from Italian is one of the few contexts in which liquid geminates are attested in Japanese phonology. In this talk, I summarize the occurrence patterns of liquid geminates in this context, as not all of the liquid geminates in the source language survive adaptation. I then discuss EMA and acoustic data of geminated liquids in Japanese to examine the realization of this marginal length contrast.
John Harris (UCL) - How much of what phonologists know about do speakers know? The learnability of a simple, regular, unnatural sound pattern in English
There is a well-established collection of speaker-independent methods for discovering phonotactic patterns in languages, e.g. comparative reconstruction, phonological analysis, and computational learning. There is also an increasingly varied collection of experimental methods for ascertaining how much of this patterning is actually internalised by speaker-hearers. In seeking to determine what makes a phonotactic pattern learnable or not, researchers have focused on a variety of factors, including phonological regularity, productivity, naturalness, and formal simplicity. Experimental studies have investigated various permutations of these factors, with results that are more or less surprising. For example, speakers have been shown to have internalised and to be able to productively apply (a) patterns that are regular, simple and natural (e.g. wug tests of English -s) but also (b) patterns that are irregular, relatively complex and not synchronically natural, such as English velar softening (e.g. Ohala 1974, Pierrehumbert 2006).
In this paper, we examine the English phonotactic pattern where consonants following /aw/ are restricted to coronals; hence tout, but not */tawk/, */tawp/ (e.g. Halle & Clements 1983). The pattern (‘awT’) is pretty regular, more so than velar softening. It is general, in that it affects a large swath of the lexicon. It is formally quite simple, arguably more so than the -s pattern. And it is not natural. It is the synchronically accidental outcome of a series of largely unrelated sound changes; each of the changes might be natural, but their cumulative effect is not. Moreover, the pattern is readily overturned in closely related Germanic languages (cf. German Raum, taub, Rauch, saugen, schaukeln, or Scots cowp, bowk).
We report the results of a non-word judgement experiment designed to test the extent to which native speakers of English have tacit knowledge of the awT pattern. Native British English listeners were presented with non-word stimuli containing the diphthongs /aw/ (MOUTH), /ow/ (GOAT), /ij/ (FLEECE), followed by a range of consonants, and were asked to rate how English-like they sounded. The selection of the non-words was controlled for lexical neighbourhood density, weighted by frequency.
The question of whether speakers have implicit knowledge of a given phonotactic pattern can be approached in two stages: (a) do they have any tacit awareness of the pattern at all and, if so, (b) is the awareness commensurate with the pattern being stored as a grammatical rule? Broadly speaking, the results of the rating experiment show weak evidence of an awareness of awT but little or no evidence that this reflects grammaticalised knowledge. That is, to the extent that speakers have any tacit inkling of the pattern at all, it is probably not encapsulated in anything like a phonologist’s rule or constraint. Where a coronal preference is detectable, it is not specific to /aw/. Moreover, it is influenced by onset size and lexical neighbourhood factors, which suggests subjects were making on-the-fly judgements of how much the non-words resemble real words.
We conclude that awT is a case where phonologists know more about a phonotactic pattern than speakers know. In the light of our results, we consider whether this should be attributed to the fact that awT is not natural (cf. Hayes & White 2013) or to other factors, such as that it is not involved in alternations. The study has a cautionary tale to tell phonologists: before building a formal account of any given phonological pattern, we need to be confident that it has indeed been internalised by native speakers.
Hannah Sande (Georgetown) - Ebrié cross-word nasal harmony
Lexical models of phonology evaluate words or sub-word units, with only exceptionless 'post-lexical' phonology applying after words have been concatenated. Such models struggle to account for phrasal phonology, or phonological alternations that cross word boundaries, in particular when such phenomena are sensitive to the identity of morphemes present. There have been many phrasal tone processes of this type reported (see Sande et al. 2019 for an overview and analysis), though very few cross-word segmental processes are attested. I present data collected with three speakers of Ebrié (Kwa) in Côte d'Ivoire showing a cross-word nasal harmony process that affects both consonants and vowels: [àká ɓà lé ɓá], 'Aka will not come'; [à̃ mà̃ né̃ má̃], 'She will not come'. I demonstrate that 1) only certain morphemes containing nasal features can trigger cross-word harmony, and 2) nasal spreading can be blocked by phonological obstacles as well as syntactic or prosodic ones. Word-based models cannot account for this type of morpheme-specific cross-word phonology, so I propose that a prosodic or phase-based spell-out approach be adopted instead.
Gasper Begus (University of Washington) "Generative Adversarial Networks, phonetics, phonology, and sound change."
Applications of deep learning have recently seen exponential growth in computational cognitive science and linguistics, but the vast majority of models operate exclusively on syntactic, semantic, or symbolic levels. In this talk, I propose that phonetic and phonological learning can be modeled as a dependency between random space and data generated by the Generative Adversarial Networks (Goodfellow et al. 2014, Radford et al. 2015, Donahue et al. 2019). The advantage of this approach is that the networks are trained on raw acoustic data in a completely unsupervised manner with no pre-assumed levels of abstraction and that phonetic and phonological learning are thus modeled simultaneously. I argue that the network learns an allophonic distribution -- the distribution of aspiration in English. The learning is, however, imperfect, and the network occasionally generates innovative outputs that violate the training data, but closely resemble imperfect learning in L1 acquisition.
I additionally propose a method for uncovering the network's internal representations and argue that the network learns to encode phonetic and phonological information in its latent space. For example, the proposed method identifies variables in the latent space that have parallels in phonetic and phonological features; by manipulating a specific variable, we can actively force certain sounds (such as [s]) in the output and control their amplitudes and spectral properties. The talk will also discuss how the network’s architecture and innovative outputs resemble and differ from linguistic behavior in language acquisition, speech disorders, and speech errors.