Computational and Experimental Methods

Bleaman publishes in American Speech

May 5, 2021

Congrats to Isaac Bleaman and Dan Duncan (Newcastle University) on the publication of their article "The Gettysburg Corpus: Testing the proposition that all tense /æ/s are created equal" in American Speech. Read it here!

Beguš publishes in Neural Networks

April 21, 2021

Congrats to Gašper Beguš on the publication of his article "CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks" in Neural Networks! Click here to download the article (Open Access).

Beguš speaks at ICBS

February 23, 2021

Gašper Beguš will be giving a seminar talk at UC Berkeley's Institute of Cognitive and Brain Sciences on Friday, March 5, from 11:10am to 12pm. The title of his talk is "Modeling Language with Generative Adversarial Networks" and the abstract is below. Click here for more details. Congrats, Gašper!

Can we build models of language acquisition from raw acoustic data in an unsupervised manner? Can deep convolutional neural networks learn to generate speech using linguistically meaningful representations? In this talk, I will argue that language acquisition can be modeled with Generative Adversarial Networks (GANs) and that such modeling has implications both for the understanding of language acquisition and for the understanding of how neural networks learn internal representations. I propose a technique that allows us to wug-test neural networks trained on raw speech. I further propose an extension of the GAN architecture in which learning of meaningful linguistic units emerges from a requirement that the networks output informative data. With this model, we can test what the networks can and cannot learn, how their biases match human learning biases (by comparing behavioral data with networks’ outputs), how they represent linguistic structure internally, and what GAN's innovative outputs can teach us about productivity in human language. This talk also makes a more general case for probing deep neural networks with raw speech data, as dependencies in speech are often better understood than those in the visual domain and because behavioral data on speech acquisition are relatively easily accessible.

Beguš speaks at UC Davis PhonLab

November 4, 2020

Gašper Beguš will be speaking at the UC Davis PhonLab on Friday, Nov 6 at 10AM on the topic "Encoding linguistic meaning into raw audio data with deep neural networks."

Bacon joins Google

October 8, 2020

Congrats to Geoff Bacon, who recently filed his dissertation Evaluating linguistic knowledge in neural networks and has just taken up a position as a computational linguist at Google!

Nichols colloquium

October 8, 2020

The 2020-2021 colloquium series kicks off this coming Monday, October 12, with a talk by Johanna Nichols (UC Berkeley), held via Zoom. The talk is entitled Proper measurement of linguistic complexity (and why it matters), and the abstract is as follows:

Hypotheses involving linguistic complexity generate interesting research in a variety of subfields – typology, historical linguistics, sociolinguistics, language acquisition, cognition, neurolinguistics, language processing, and others. Good measures of complexity in various linguistic domains are essential, then, but we have very few and those are mostly single-feature (chiefly size of phoneme inventory and morphemes per word in text).
In other ways as well what we have is not up to the task. The kind of complexity that is favored by certain sociolinguistic factors is not what is usually surveyed in studies invoking the sociolinguistic work. Phonological and morphological complexity are very strongly inversely correlated and form opposite worldwide frequency clines, yet surveys of just one or the other, or both lumped together, are used to support cross-linguistic generalizations about the distribution of complexity writ large. Complexity of derivation, syntax, and lexicon is largely unexplored. Measuring the complexity of polysynthetic languages in the right terms has not been seriously addressed.
This paper proposes a tripartite metric---enumerative, transparency-based, and relational---using a set of different assays across different parts of the grammar and lexicon, that addresses these problems and should help increase the grammatical sophistication of complexity-based hypotheses and choice of targets for computational extraction of complexity levels from corpora. Meeting current expectations of sustainability and replicability, the set is reusable, revealing, reasonably granular, and (at least mostly) amenable to computational implementation. I demonstrate its usefulness to typology and historical linguistics with some cross-linguistic and within-family surveys.

Beguš speaks at MIT

September 15, 2020

Gašper Beguš will be giving a talk at the CompLang group at MIT on Tuesday, September 22, at 5pm EDT (2pm Pacific) over Zoom (p/w "Language"). Here is the title and abstract:

Modeling Language with Generative Adversarial Networks

In this talk, I argue that speech acquisition can be modeled with deep convolutional networks within the Generative Adversarial Networks framework. A proposed technique for retrieving internal representations that are phonetically or phonologically meaningful (Beguš 2020) allows us to model several processes in speech and compare outputs of the models both behaviorally as well as in terms of representation learning. The networks not only represent phonetic units with discretized representations (resembling the phonemic level), but also learn to encode phonological processes (resembling rule-like computation). I further propose an extension of the GAN architecture in which learning of meaningful linguistic units emerges from a requirement that the networks output informative data. I briefly present five case studies (allophonic learning, lexical learning, reduplication, iterative learning, and artificial grammar experiments) and argue that correspondence between single latent variables and meaningful linguistic content emerges. The key strategy to elicit the underlying linguistic values of latent variables is to manipulate them well outside of the training range; this allows us to actively force desired features in the output and test what types of dependencies deep convolutional networks can and cannot learn.

The advantage of this proposal is that speech acquisition is modeled in an unsupervised manner from raw acoustic data and that deep convolutional networks output not replicated, but innovative data. These innovative outputs are structured, linguistically interpretable, and highly informative. Training networks on speech data thus not only informs models of language acquisition, but also provides insights into how deep convolutional networks learn internal representations. I will also make a case that higher levels of representation such as morphology, syntax and lexical semantics can be modeled from raw acoustic data with this approach and outline directions for further experiments.

Bleaman to appear in Frontiers in Artificial Intelligence

April 23, 2020

Congrats to Isaac Bleaman, whose article "Implicit standardization in a minority language community: Real-time syntactic change among Hasidic Yiddish writers" has been accepted for publication at Frontiers in Artificial Intelligence. The article will appear in the section Language and Computation as part of the research topic in Computational Sociolinguistics. Read the abstract here!