How can deep neural networks encode information that corresponds to words in human speech into raw acoustic data? This paper proposes two neural network architectures for modeling unsupervised lexical learning from raw acoustic inputs: ciwGAN (Categorical InfoWaveGAN) and fiwGAN (Featural InfoWaveGAN). These combine Deep Convolutional GAN architecture for audio data (WaveGAN; Donahue et al., 2019) with the information theoretic extension of GAN – InfoGAN (Chen et al., 2016) – and propose a new latent space structure that can model featural learning simultaneously with a higher level classification and allows for a very low-dimension vector representation of lexical items. In addition to the Generator and Discriminator networks, the architectures introduce a network that learns to retrieve latent codes from generated audio outputs. Lexical learning is thus modeled as emergent from an architecture that forces a deep neural network to output data such that unique information is retrievable from its acoustic outputs. The networks trained on lexical items from the TIMIT corpus learn to encode unique information corresponding to lexical items in the form of categorical variables in their latent space. By manipulating these variables, the network outputs specific lexical items. The network occasionally outputs innovative lexical items that violate training data, but are linguistically interpretable and highly informative for cognitive modeling and neural network interpretability. Innovative outputs suggest that phonetic and phonological representations learned by the network can be productively recombined and directly paralleled to productivity in human speech: a fiwGAN network trained on suit and dark outputs innovative start, even though it never saw start or even a [st] sequence in the training data. We also argue that setting latent featural codes to values well beyond training range results in almost categorical generation of prototypical lexical items and reveals underlying values of each latent code. Probing deep neural networks trained on well understood dependencies in speech bears implications for latent space interpretability and understanding how deep neural networks learn meaningful representations, as well as potential for unsupervised text-to-speech generation in the GAN framework.
March 19, 2021
Beguš, G. (2021). CiwGAN and fiwGAN: Encoding information in acoustic data to model lexical learning with Generative Adversarial Networks. Neural Networks, 139, 305–325. https://doi.org/https://doi.org/10.1016/j.neunet.2021.03.017