We propose a new model for speaker-independent vowel recognition which uses the flexibility of the dynamic linking that results from the synchronization of oscillating neural units. The system consists of an input layer and three neural layers, which are referred to as the A-, B- and C-centers. The input signals are a time series of linear prediction (LPC) spectrum envelopes of auditory signals. At each time-window within the series, the A-center receives input signals and extracts local peaks of the spectrum envelope, i.e., formants, and encodes them into local groups of independent oscillations. Speaker-independent vowel characteristics are embedded as a connection matrix in the B-center according to statistical data of Japanese vowels. The associative interaction in the B-center and reciprocal interaction between the A- and B-centers selectively activate a vowel as a global synchronized pattern over two centers. The C-center evaluates the synchronized activities among the three formant regions to give the selective output of the category among the five Japanese vowels. Thus, a flexible ability of dynamical linking among features is achieved over the three centers. The capability in the present system was investigated for speaker-independent recognition of Japanese vowels. The system demonstrated a remarkable ability for the recognition of vowels very similar to that of human listeners, including misleading vowels. In addition, it showed stable recognition for unsteady input signals and robustness against background noise. The optimum condition of the frequency of oscillation is discussed in comparison with stimulus-dependent synchronizations observed in neurophysiological experiments of the cortex.
The authors present an oscillating recurrent neural network model for the recognition of Japanese vowels. The model consists of 4 layers: 1) an input layer which gives pre-processed frequency information, 2) an oscillatory hidden layer with local inhibition, 3) another oscillatory hidden layer with long-range inhibition and 4) a readout layer implementing the classification of vowels using a winner-takes-all mechanism. Layers 1-3 each contain 32 units where each unit is associated to one input frequency. The output layer contains one unit for each of the 5 vowels and the readout mechanism is based on multiplication of weighted sums of layer 3 activities such that the output is also oscillatory. The oscillatory units in layers 2 and 3 consist of an excitatory element coupled with an inhibitory element which oscillate, or become silent, depending on the input. The long-range connections in layer 3 are determined manually based on known correlations between formants (characteristic frequencies) of the individual vowels.
In experiments the authors show that the classification of their network is robust against different speakers (14 men, 5 women, 5 girls, 5 boys): 6 out of 145 trials were correctly classified. However, they do not report what exactly their criterion for classification performance was (remember that the output was oscillatory, also sometimes alternative vowels show bumps in the time course of a vowel in the shown examples). They also report robustness to imperfect stimuli (formants varying within a vowel) and noise (superposition of 12 different conversations), but only single examples are shown.
Without being able to tell what the state of the art in neural networks in 1994 was, I guess the main contribution of the paper is that it shows that vowel recognition may be robustly implemented using oscillatory networks. At least from today’s perspective the suggested network is a bad solution to the technical problem of vowel recogntion, but even alternative algorithms at the time were probably better in that (there’s a hint in one of the paragraphs in the discussion). The paper is a good example for what was wrong with neural network research at the time: the models give the feeling that they are pretty arbitrary. Are the units in the network only defined and connected like they are, because these were the parameters that worked? Most probably. At least here the connectivity is partly determined through some knowledge of how frequencies produced by vowels relate, but many other parameters appear to be chosen arbitrarily. Respect to the person who made it work. However, the results section is rather weak. They only tested one example of a spoken vowel per person and they don’t define classification performance clearly. I guess, you could argue that it is a proof-of-concept of a possible biological implementation, but then again it is still unclear how this can be properly related to real networks in the brain.