Flexible vowel recognition by the generation of dynamic coherence in oscillator neural networks: speaker-independent vowel recognition.

Liu, F., Yamaguchi, Y., and Shimizu, H.
Biol Cybern, 71:105–114, 1994
DOI, Google Scholar


We propose a new model for speaker-independent vowel recognition which uses the flexibility of the dynamic linking that results from the synchronization of oscillating neural units. The system consists of an input layer and three neural layers, which are referred to as the A-, B- and C-centers. The input signals are a time series of linear prediction (LPC) spectrum envelopes of auditory signals. At each time-window within the series, the A-center receives input signals and extracts local peaks of the spectrum envelope, i.e., formants, and encodes them into local groups of independent oscillations. Speaker-independent vowel characteristics are embedded as a connection matrix in the B-center according to statistical data of Japanese vowels. The associative interaction in the B-center and reciprocal interaction between the A- and B-centers selectively activate a vowel as a global synchronized pattern over two centers. The C-center evaluates the synchronized activities among the three formant regions to give the selective output of the category among the five Japanese vowels. Thus, a flexible ability of dynamical linking among features is achieved over the three centers. The capability in the present system was investigated for speaker-independent recognition of Japanese vowels. The system demonstrated a remarkable ability for the recognition of vowels very similar to that of human listeners, including misleading vowels. In addition, it showed stable recognition for unsteady input signals and robustness against background noise. The optimum condition of the frequency of oscillation is discussed in comparison with stimulus-dependent synchronizations observed in neurophysiological experiments of the cortex.


The authors present an oscillating recurrent neural network model for the recognition of Japanese vowels. The model consists of 4 layers: 1) an input layer which gives pre-processed frequency information, 2) an oscillatory hidden layer with local inhibition, 3) another oscillatory hidden layer with long-range inhibition and 4) a readout layer implementing the classification of vowels using a winner-takes-all mechanism. Layers 1-3 each contain 32 units where each unit is associated to one input frequency. The output layer contains one unit for each of the 5 vowels and the readout mechanism is based on multiplication of weighted sums of layer 3 activities such that the output is also oscillatory. The oscillatory units in layers 2 and 3 consist of an excitatory element coupled with an inhibitory element which oscillate, or become silent, depending on the input. The long-range connections in layer 3 are determined manually based on known correlations between formants (characteristic frequencies) of the individual vowels.

In experiments the authors show that the classification of their network is robust against different speakers (14 men, 5 women, 5 girls, 5 boys): 6 out of 145 trials were correctly classified. However, they do not report what exactly their criterion for classification performance was (remember that the output was oscillatory, also sometimes alternative vowels show bumps in the time course of a vowel in the shown examples). They also report robustness to imperfect stimuli (formants varying within a vowel) and noise (superposition of 12 different conversations), but only single examples are shown.

Without being able to tell what the state of the art in neural networks in 1994 was, I guess the main contribution of the paper is that it shows that vowel recognition may be robustly implemented using oscillatory networks. At least from today’s perspective the suggested network is a bad solution to the technical problem of vowel recogntion, but even alternative algorithms at the time were probably better in that (there’s a hint in one of the paragraphs in the discussion). The paper is a good example for what was wrong with neural network research at the time: the models give the feeling that they are pretty arbitrary. Are the units in the network only defined and connected like they are, because these were the parameters that worked? Most probably. At least here the connectivity is partly determined through some knowledge of how frequencies produced by vowels relate, but many other parameters appear to be chosen arbitrarily. Respect to the person who made it work. However, the results section is rather weak. They only tested one example of a spoken vowel per person and they don’t define classification performance clearly. I guess, you could argue that it is a proof-of-concept of a possible biological implementation, but then again it is still unclear how this can be properly related to real networks in the brain.

Tuning properties of the auditory frequency-shift detectors.

Demany, L., Pressnitzer, D., and Semal, C.
J Acoust Soc Am, 126:1342–1348, 2009
DOI, Google Scholar


Demany and Ramos [(2005). J. Acoust. Soc. Am. 117, 833-841] found that it is possible to hear an upward or downward pitch change between two successive pure tones differing in frequency even when the first tone is informationally masked by other tones, preventing a conscious perception of its pitch. This provides evidence for the existence of automatic frequency-shift detectors (FSDs) in the auditory system. The present study was intended to estimate the magnitude of the frequency shifts optimally detected by the FSDs. Listeners were presented with sound sequences consisting of (1) a 300-ms or 100-ms random “chord” of synchronous pure tones, separated by constant intervals of either 650 cents or 1000 cents; (2) an interstimulus interval (ISI) varying from 100 to 900 ms; (3) a single pure tone at a variable frequency distance (Delta) from a randomly selected component of the chord. The task was to indicate if the final pure tone was higher or lower than the nearest component of the chord. Irrespective of the chord’s properties and of the ISI, performance was best when Delta was equal to about 120 cents (1/10 octave). Therefore, this interval seems to be the frequency shift optimally detected by the FSDs.


If you present 5 tones simultaneously, people cannot tell whether a subsequently presented tone was one of the 5 tones, or lay in the middle between any 2 of the 5 tones. On the other hand, people can judge whether a subsequently presented tone lay above or below any one of the 5 tones. This paper investigates the dependency of this effect on how much the subsequent tone lay above or below one of the 5 (here actually 6) tones (frequency shift), on how much the 6 tones were separated (Iv) and on the interstimulus interval (ISI) between the first set of tones and the subsequent tone. The authors replicated the mentioned previous findings and presented data suggesting that there is an optimal frequency shift at which subjects performed best in the task. They argue that this is at roughly 120 cents.

I have several remarks about the analysis. First of all, the number of subjects in the two experiments is very low (7 and 4, each including the first author). While in experiment 1 the curves of d-prime over subjects look relatively consistent, this is not the case for larger ISIs in experiment 2. The main flaw of the analysis is that their suggestion of an optimal frequency shift of 120 cents is based on curve fitting of an exponential function to 4,5, or 6 data points where they also add an artificial baseline data point at d-prime=0 for frequency shift=0. The data point as such makes sense as the judgement of a subject whether the shift was up or down must be random when the true shift was actually 0. Still, it feels wrong to include an artificial data point in the analysis. In the end, especially for large ISIs the variability of thus estimated optimal frequency shifts for individual subjects is so variable that it seems pointless to conclude anything about the mean over (4) subjects.

Sam actually tried to replicate the original finding on which this paper is based. He commented that it was hard to replicate it in a large group of subjects and he found differences between musicians and non-musicians (which shouldn’t be true for something that belongs to really basic hearing abilities). He also noted that subjects were generally quite bad in this task and that he found it to be impossible to make the task easier, when one wants to maintain that the 6 initial tones cannot be perceived individually.

The authors of the paper seem to use subjects, which perform particularly well in these tasks, repeatedly in their experiments.

It has been noted in the groupmeeting that this research could be linked better to, e.g., the mismatch negativity literature which is also concerned with detection of deviations. Sam pointed to the publication containing the original findings in response.