Download Encoding Sound Timbre in the Auditory System

IETE Journal of Research Vol 49, No 2, March-April, 2003, pp Encoding Sound Timbre in the Auditory System SHIHAB SHAMMAI Neural Systems Laboratory, Electrical and Computer Engineering Department, Institute for Systems Research, University of Maryland College Park, College Park, MD 20742 In a complex acoustic environment, several sound sources may simultaneously change their loudness, location, timbre, and pitch. Yet humans like many other animals are able to integrate effortlessly the multitude of cues arriving at their ears, and to derive coherent percepts and judgments about the different attributes of each source. This facility to analyze an auditory scene is conceptually based on a multi-stage process in which sound is first analyzed in terms of a relatively few perceptually significant attributes (the alphabet of auditory perception), followed by higher level integrative processes that organize and group the extracted attributes according to specific context-sensitive rules (the syntax of auditory perception) [1]. The sound received at the two ears is processed for attributes including source location, acoustic ambience, and source attributes such as tone and pitch, timbre and intensity. Decades of physiological and psychoacoustical studies [2,3] have revealed elegant strategies at various stages of the mammalian auditory system for the representation of the signal cues underlying auditory perception. This information has facilitated the development of biophysical models, mathematical abstractions, and computational algorithms of the early and central auditory stages with the aim of capturing the functionality, robustness, and enormous versatility of the auditory system [4]. Numerous groups have implemented such algorithms in software and hardware, and have evaluated them by comparing their performance to human performance and against a range of robustness and flexibility requirements. Furthermore, these auditory-inspired processing strategies have been utilized in a wide range of applications including acoustic diagnostic monitoring systems for machines and manufacturing processes, battlefield acoustic signal analysis, sound analysis and recognition systems, robust detection and recognition of multiple interacting faults, and detection and recognition of underwater transients and weak signals in low signal-to-noise ratio (SNR) in acoustically-cluttered environments [5,6]. We shall briefly review the auditory encoding of various sound attributes to illustrate the above ideas. We shall specifically focus on the percept of sound timbre: what acoustic cues are most intimately correlated with it? How are they represented at various stages of the auditory pathway? And how the abstracted auditory signal processing algorithms and representations can be applied to measure speech intelligibility to describe musical timbre and to analyze complex auditory scenes ? Indexing terms: xxxxxx AUDITORY ANATOMY AND PHYSIOLOGY When sound arrives to the ears as pressure waves in air or water, it causes the eardrum to vibrate. It in turn transmits the vibrations to the cochlea of the inner ear via an attached chain of three tiny bones situated in the middle ear. The cochlea is the key hearing organ [3]. It’s an elongated fluid filled cavity with elaborate sensory cells that are embedded into exquisitely sensitive membranes that extend along its entire length. The cochlea performs two fundamental functions (Fig1). First, it converts the sound pressure wave into a spatially ordered pattern of membrane vibrations. Specifically, the mechanics of the cochlear membranes Paper No 193-B; Copyright  2003 by the IETE. gradually change their electro-mechanical properties along its length in a manner that causes different places to vibrate best (or be tuned) to different frequencies. Consequently, sounds with very fast frequencies cause large membrane vibrations near the entrance of the cochlea, whereas slow frequencies drive best the end of the cochlea. In this way, the cochlea effectively separates the different components of the sound according to their frequency, sending them off to different places and creating a frequency-organized axis known as the tonotopic axis of the cochlea. Each complex sound, therefore, creates a unique spatial pattern of strong and weak vibrations along the tonotopic axis that reflects the amplitudes of its different frequency components – or its frequency spectrum. The second 2 IETE JOURNAL OF RESEARCH, Vol 49, No 2, 2003 Fig 1 A schematic model of the early stages in auditory processing. Cochlear Analysis: Sound enters the cochlea via the eardrum and the middle ear, initiating travelling wave displacement patterns on the basilar membrane. Vibrations due to low frequencies propagate and achieve their maximum amplitude further down the cochlea (red pattern) compared to high frequencies (green pattern). This mapping of sound frequency components onto different places along the cochlea creates the tonotopically ordered (spatial) axis of the auditory system. Basilar membrane vibrations are transduced into spatiotemporal responses on the auditorynerve by an array of hair cells distributed along the length of the cochlea. Auditory-Nerve Responses: The spatiotemporal response patterns on the auditory nerve due a two-tone stimulus (300 and 600 Hz). The ordinate represents the tonotopic axis (labeled by the CF at each location). The response at each CF represents the instantaneous probability of firing in the nerve fiber at that CF. Note that each component in the stimulus initiates a localized travelling wave pattern that abruptly ends creating a prominent discontinuity near the appropriate CF (marked by the arrow heads to the right of the panel). The stimulus responses depicted here are at a low sound level such that it does not saturate the nerve responses. The response amplitudes are therefore strongest near the CFs of the two tones, resulting in clear peaks in the average response curve (red plot to the right). function of the cochlea is to transform these vibrations into neural patterns on the auditory nerve to be later interpreted by the brain. This is accomplished by more than 3000 specialized sensory cells (called the hair cells) distributed along the cochlea. Hair cells possess channels that open and close rapidly modulating electric current flow into them. The currents initiate a cascade of electrochemical events culminating in neural signals on the auditory nerve that faithfully encode the phase (or are phase-locked to the time-course) of the vibrations at each point up to fairly high frequencies (4000 Hz in some mammals, and 9000 Hz in some birds). Since stronger vibrations also lead to more vigorous neural response, the auditory nerve in effect encodes the spectrum of the sound both by the level and by the phase-locked structure of the responses along the tonotopic axis. The first neural structure beyond the auditory nerve is called the cochlear nucleus. It consists of several anatomically elaborate subdivisions that receive parallel direct projections from the nerve. Multiple pathways emerge from the cochlear nucleus up through the midbrain and thalamus to the auditory cortex, each passing through different neural structures, repeatedly converging onto and diverging from other pathways along the way. This complexity reflects the rich and varied auditory percepts extracted from the sound, the integration of these percepts into a whole auditory sensation, and its final fusion with vision and other sensory modalities and with motor actions. While there is still much to be learned about exactly how all these neural pathways and structures process sound, it is nevertheless clear which signal cues the nervous system must extract so as to give rise to a few important auditory percepts that include loudness, pitch, timbre, and sound location. The acoustic spectrum extracted early in the auditory pathway at the cochlear nucleus, the first stage beyond the auditory nerve. It is then projected to the auditory cortex via a tonotopically organized pathway through the midbrain and thalamus. Much is known about the representation of the spectral profile in the early stages of cochlear filtering, the auditory nerve, and some subdivisions of the cochlear nucleus and the binaural Superior Olivary Complex of nuclei. Beyond that, however, the response properties and functional organization of the Inferior Colliculus, Medial Geniculate Body, and the cortex become more vague. As with other cortical sensory areas, the auditory cortex is subdivided, with a primary auditory field (AI) in the center, surrounded by a belt of secondary areas that are distinguishable both anatomically and physiologically [7]. The responses in AI have been recorded and analyzed for a wide range of acoustic stimuli, both natural and artificial, spectrally narrow and broad, species specific and otherwise [8-13]. SHIHAB SHAMMA : ENCOLDING SOUND TIMBRE Insights into the temporal and spectral selectivity of AI cells has been gained using a stimulus called ripples [14,15], which are broadband sounds with a sinusoidally modulated spectro-temporal envelope (Fig 2). In their construction and use, they serve the same function of regular sinusoids in measuring the transfer function of linear filter, except that the cortical filters are spectrally and temporally selective, ie, are two dimensional (and hence are the test ripple stimuli). AI cells respond well to ripples, and are usually selective to a narrow range of ripple parameters that reflect details of their transfer functions. By compiling a complete description of the responses of a unit to all ripple densities and velocities it is possible to inverse Fourier transform the transfer function and compute the SpectroTemporal Response Field (STRF) and hence characterize both the cells’ spectral, as well as dynamic response selectivity. Examples of STRFs recorded in AI are shown in Fig 3. They exhibit a wide range of spectral bandwidths and temporal dynamics, and hence provide indirect support to the existence of different response maps mentioned earlier stimuli [8,10]. It has also been shown that the shape of a unit’s STRF remains unchanged whether it is measured with one ripple at a time, or multiple ripples simultaneously, e.g, using the reverse-correlation method [15]. It remains to be seen, however, how this extensive variety of STRFs IN THE AUDITORY SYSTEM arises, and the relationship between them and the morphology and connectivity of different cell types in AI. COMPUTATIONAL AUDITORY MODELS The computational auditory model that we have utilized in our studies is one based on neurophysiological, biophysical, and psychoacoustical findings at various stages of the auditory system (see references [16-18] for a detailed description). It consists of the following basic stages (see Fig 4): 1. An early auditory stage which models the transformation of the acoustic signal into an internal neural representation referred to as an auditory spectrogram. This stage provides all the input to the following stages of cortical spectral analysis, pitch extraction, and binaural interactions; 2. A cortical multiscale analysis stage, which analyzes the spectrogram to estimate the content of its spectro-temporal features, specifically its spectral and temporal modulations, using a bank of modulation selective filters mimicking those described in the mammalian primary auditory cortex [19,20]. Fig 2 The ripple stimulus. (a) The “moving” ripple spectral profile. It consists of 100 tones per octave, equally spaced spanning 5 octaves, with a sinusoidal envelope that parametrized by peak density(Ω) in units of cycles/octave; and the constant velocity of travel (ω) in Hz. The spectrogram of such a ripple is shown to the bottom emphasizing the dynamic nature of the sinusoidal and broadband envelope of the spectrum. (b) Auditory Spectrogram of the sentence /come home right away/ spoken by a male. In principle, it can be decomposed by a Fourier transform into a collection of spectro-temporal sinusoidal functions called ripples. (c) Spectrograms of ripples with velocities ω = 4 to 12 Hz in two directions and ripple density Ω = 0.2–0.6 cycle/octave. Ripples over a wider range are typically used to measure the transfer functions of a unit. 4 IETE JOURNAL OF RESEARCH, Vol 49, No 2, 2003 Fig 3 Examples of STRFs in AI of the awake ferret. Red (Blue)) color indicates regions of strongly excitatory (suppressed) responses. The STRFs display a wide range of properties from temporally fast to slow, spectrally sharp to broad, with symmetric or asymmetric inhibition, and direction-selectivity or not. The early auditory stages Computing the auditory spectrogram (such as that shown in Fig 2b) consists of a sequence of three operations mimicking the early stages of auditory processing: (i) A frequency analysis stage that consists of a bank of bandpass filters equally spaced on a logarithmic frequency axis. The model employs 24 filters/octave over a 5 octave range. Details of the filter shapes and bandwidth are available in [16,17]. (ii) Filter outputs are half-wave rectified, and then lowpass filtered mimicking the action of the hair cells [21]. (iii) A first difference operation is applied across the channel array (mimicking the effect of a lateral inhibitory network [21,22], followed by a half-wave rectifier and Fig 4 Schematic of cortical multiscale decomposition. (a) Auditory spectrogram of a sound stimulus consisting of two groups of unresolved harmonics. (b) Schematic of a bank of cortical STRFs as spectrotemporal modulation selective filters. The filters vary their parameters along the scale and rate axes. The basic outline of one such a cortical filter is shown below. Not shown is the distribution of the filters along the tonotopic axis. (c) the representation of the output activity from the three dimensional bank of filters. Each spectrogram would generate a unique distribution pattern of activity along the three axes. SHIHAB SHAMMA : ENCOLDING SOUND TIMBRE a short-term integrator. A key component of the early auditory model is the nonlinear compressive stage in the hair cell model. When fully compressed, the cochlear filter outputs are turned into square waves that essentially preserve only the zerocrossings of the response. We have demonstrated in an earlier study [17,18] that these (1) zero-crossings are sufficient to preserve all speech information, (2) that we can reconstruct a very good quality version of the original signal given these waveforms, and (3) that the resulting spectrogram (as in Fig 5 ) is more enhanced and it displays a 6 dB SNR enhancement relative to the input signal. This significant advantage is added to the fact that the signal representation becomes completely independent of the overall level of the input signal, and hence no AGC stages are needed anymore. IN THE AUDITORY SYSTEM The cortical multiscale analysis stage The findings of a wide variety of STRFs covering a range of bandwidths, asymmetry, and CFs (Fig 3) suggests that they may as a population perform a multiscale analysis of their input spectral profile, as explained below [20]. Specifically, the cortical stage estimates the spectral and temporal modulation content of the auditory spectrogram by a bank of modulation selective filters. Each filter is broadly tuned over a range of temporal modulation rates (ω) and spectral resolutions or scales (Ω), and is centered at a specific CF along the tonotopic axis as illustrated in Fig 4. Consequently, a filter has a spectro-temporal impulse response (STRF) similar to that shown in the enlarged panel in Fig 4b. The filter output is computed by a convolution of its STRF with the input auditory spectrogram, i.e, it produces a modified spectrogram. Note that the spectral and temporal cross-sections of an STRF are typical of a Fig 5 The representation and extraction of stimulus periodicity of a harmonic complex. (a) The auditory spectra of two harmonic series of a 125 Hz fundamental. The low order harmonics are well resolved along the tonotopic axis. These harmonics evoke a pitch sensation at the fundamental frequency of the series (i.e, at 125 Hz). The high order harmonics (>8th harmonic or 1 kHz) are unresolved and they instead evoke a pattern that “beats” at the difference frequency of 125 Hz. These harmonics evoke a weaker pitch (called the “residue pitch”) at the beating frequency (i.e, 125 Hz in this case). (Left) The harmonic complex here contains all 40 harmonics and evokes a strong pitch at 125 Hz. (Right) The harmonic complex here lacks the lowest three harmonics (125, 250, 375 Hz); Nevertheless, it still evokes a strong pitch at the “missing fundamental” frequency of 125 Hz. (b) Two algorithms for extracting periodicity pitch. (Left) Schematic illustrates an auto-correlogram implementation. It presumes the existence of organized delay lines to compute the auto-correlation of the responses from each auditory nerve channel (or fiber) prior to computing the pitch. (Right) Schematic illustrates a template matching algorithm. It is spatial (spectral) in character, and presumes the existence of harmonic templates in the brain that are matched to incoming spectra so as to measure the pitch. 6 IETE JOURNAL OF RESEARCH, Vol 49, No 2, 2003 bandpass impulse response in having alternating excitatory (positive) and inhibitory (negative) fields. Consequently, the output is large only if the spectrogram modulations are commensurate with the rate, scale, and direction of the STRF, i.e, each filter will respond best to a narrow range of spectro-temporal modulations. A map of the responses across the filterbank, therefore, provides a unique characterization of the spectrogram, one that is sensitive to the spectral shape and dynamics over the entire stimulus. The operations used to compute the overall model outputs are as follows. For the early auditory stages (Figs 1, 3b), the auditory spectrogram is computed through a sequence of five stages representing the cochlear filter (y1), hair cell (y2), lateral inhibition (y3 , y4), and final spectrogram output (y5) : y3 (t,x) = ∂x y2 (t, x) *x ν(x) y1(t, x) = s(t) *t h(t; x) y4(t, x) = max (y3 (t, x), 0) y2(t, x) = g(∂t y1) (t,x)) *t ω(t) y5(t, x) = y4(t, x) *t µ(t; τ) where, s(t) is the sound, h(t) is the cochlear filter impulse response, and w(t), µ(t), υ(t) are various integration windows. For the cortical output (Fig 6), the STRF is denoted by: !"# = g!"# (t; Rc , θc) . h!"# (x; Ωc , φc) where g(.) and h(.) represent the temporal and spectral impulse response dimensions of the STRF, each parameterized by a center scale and rate, and their respective phases: h!"# (x; Ωc ,φc) = h(x; Ωc) cos φc + h (x; Ωc) sin φc g!"# (t; Rc , θc) = g(t; Rc) cos θc + g(t; Rc ) sin θc The final output r(.) is then computed from a convolution of the STRF and the auditory spectrogram y(.), r(t,x;Rc,θc,Ωc,θc = y(t,x) *xt [g!"# (t;Rc ,θc) . h!"# (x;Ωc ,φc)] = y(t,x) *xt [g.h cos φc cos φc + g.Àh cos φc sin φc+ g.h sin φc cos φc + g.Àh sin φc sin φc] which has a magnitude and phase. To display this output, we usually collapse one axis resulting in a threedimensional output as illustrated in Fig 4c. Finally the range of parameter values used in the model is based on data from cortical physiology and human psychoacoustics employing spectrally and temporally modulated stimuli [20]. Specifically, we assume that human subjects are primarily listening to scales of 0.25-8 cyc/oct, rates of 2-512 Hz, and frequencies up to 8 kHz. We have already computed the overall sensitivity of the total auditory model to spectral and temporal modulations and compared them favorably to psychoacoustical data from human subjects [20,23-24]. All stages of this model have been implemented in a MATLAB environment, with a wide variety of computational and graphical modules to allow the user the flexibility of constructing any appropriate sequence of operations. The package also contains demos and help files for users, together with default parameter settings making it easy learn for the new user. We have been making this software available for download to many of our colleagues in the auditory and speech processing community through our website at /www.isr.umd.edu/CAAR/ under “Publications”. BASIC PERCEPTUAL ATTRIBUTES OF SOUND There are at least four basic attributes of sound: its loudness, pitch, location and timbre. We summarize here the first three percepts and their neural bases. Timbre is addressed in more detail in the following sections. Loudness is perhaps the most intuitive of the above percepts. It is normally associated with increasing the volume (amplitude or intensity) of the sound. But there is another physical dimension that correlates strongly with loudness and that is the range of frequencies that make up the sound, or its bandwidth [2]. Loudness, therefore, can be approximately viewed as mediated by the total volume of neural activity on the auditory nerve, and hence increasing the activity either by raising sound intensity or bandwidth leads to a louder sound. The sensation of pitch is also a readily understood attribute of sound that is normally associated with musical scales and melodies, or with the low and high voices of males and females. Pitch strongly correlates with the repetition rates or frequencies in a sound. However, unlike loudness, the neural basis of pitch is a much more contentious topic, largely because “pitch” is an imprecise term that is ascribed to multiple sensations having distinct origins, and most likely different neural mechanisms. For example, the pitch of a pure tone is directly related to its frequency, and is felt over a very broad range of approximately 50 Hz to 20,000 Hz; from a neural perspective, this percept is readily encoded by the location of best tone-evoked response along the tonotopic axis. A second example is so-called rattle-pitch, a relatively weak sensation that is correlated with the modulation rate of the amplitude of a noise or a tone. This pitch is typically heard only up to a few hundred hertz (<400 Hz), and is likely mediated by neural responses (on the auditory nerve and beyond) that are explicitly phase-locked to the modulation rate. The final and most salient sensation of pitch is that of musical instruments and voices. This percept exists over a SHIHAB SHAMMA : ENCOLDING SOUND TIMBRE IN THE AUDITORY SYSTEM Fig 6 Schematic of the STMIT computation. (a) The clean and noisy speech signals are given as inputs to the auditory model. Their outputs are normalized by the base signals as explained in [50]. The right panel shows the cortical output of both clean and noisy inputs. These cortical patterns are then used to compute the MTF and STMI. (b) Effect of White Noise and Reverberation on the Global MTF. (TOP) The global (clean) MTF of the auditory model computed from all ripples, summarized by the rate-scale plot (i.e, collapsing the frequency axis x). The bottom three rows of figures illustrate the progressive distortion of the global MTF (rate-scale plot) with increasing levels of white noise, reverberation, and combined additive white noise and reverberation [50]. 8 IETE JOURNAL OF RESEARCH, Vol 49, No 2, 2003 moderate range of frequencies (<4000 Hz), and is exclusively associated with harmonic sounds composed of frequencies that are integer multiples of a common fundamental frequency (Fig 5). An interesting fact about this pitch is that its value remains that of the fundamental frequency even if the fundamental component in the sound is missing (and hence the common description of this percept as the “pitch of the missing fundamental”). That is, the pitch value is derived from the harmonic relationship between the components, and not simply from the fundamental frequency per se. The neural basis of this percept remains uncertain. One plausible theory proposes that the brain stores (or learns) harmonic templates of all pitch values, and the percept is derived according to which templates match best the spectrum of the incoming sound [25-28]. Another theory holds that the pitch is computed directly from the incoming sound without resort to any templates, and as such it does not explicitly distinguish between this percept and the “rattle” pitch described earlier [29-30]. Both these theories (and many other variations) can account for most relevant psychoacoustical findings; the major missing piece in the pitch puzzle remains the lack of firm biological understanding of the mechanisms underlying pitch processing in general. Finally we consider the perception of sound location in space, an auditory function that is critical for survival, as in escaping predators, following prey, and finding mating partners. Despite enormous variability in the nature of cues and mechanisms involved in this task, there are some principles common to most species. First is the use of differences between the sound impinging on the two ears, especially the differences in the time-of-arrival and in the sound intensity [31-32]. Specifically, when a sound source is centered in front of the head, it arrives to the two ears simultaneously. If it moves to the right, the path to the right (relative to the left) ear shortens, and hence the sound arrives sooner to the right ear by a fraction of a thousandth of a second, a detectable difference for many animals (especially those with larger heads). An analogous disparity in sound level occurs when a source is closer to one ear, especially for high frequency sounds where the head shadow is more effective. Most animals have evolved neural mechanisms to detect, extract, and utilize these interaural differences to locate the sound source. For example, there are specialized coincidence cells in all mammals and birds that receive phase-locked inputs from the two ears, and are tuned to detect a particular time-delay between them. In barn owls, such cells are highly organized topographically so as to create a best time-delay axis. Another commonly found kind of cell is one tuned to detect specific level differences between the two ears. The second important localization principle concerns the use of special spectral cues (from one or two ears) to locate the elevation of a sound source or to characterize the acoustic environment [33]. These spectral cues are usually introduced by auxiliary structures such as the pinnea (the external ear), shoulders, and nearby walls and floors. In the case of the pinnea, its highly convoluted cavities function as mini-resonators that absorb or amplify certain sound frequencies depending on the direction of sound arrival. This is highly useful because when a source is located on the midline, the two paths to the ears are equal regardless of source elevation, and hence the only way to localize it is based on pinnea-originated spectral cues. In mammals, there is some evidence that specialized neural pathways have evolved as early as the cochlear nucleus to detect these unique cues and process them in conjunction with the binaural cues. Another useful function of certain spectral cues is the information they convey about the reverberant qualities of a room, and hence indirectly its size and material structure. This issue is extremely important in the architectural design of music halls and auditoriums, and in the assessment of the quality of communication channels and equipment. AUDITORY REPRESENTATION OF TIMBRE Timbre is the most difficult of all sound attributes to characterize uniquely and systematically the way we can do that for pitch, loudness, and location [2]. The difficulty stems from the broad nature of cues that affect this percept, ranging from the spectral shape, its temporal modulations, and onset dynamics. Therefore, the availability of a quantitative descriptor of the auditory spectrum and its dynamics as in multiscale cortical model outlined above allows us to utilize timbre in a wide range of applications. They include speech recognition (where timbre is the key attribute of a speech token to be recognized), and musical synthesize and analysis where the accurate characterization of timbre is crucial for rendering the sound of a specific instrument. The cortical multiscale decomposition along the spectral dimension is in fact analogous to the well-known cepstral representation [34] with two fundamental distinctions: the cortical representation is local (rather than global) and is complex (i.e, it includes the phase information). Details of this and other connections between the cortical representation and other commonly used ASR features can be found in [19]. Along the temporal dimension, the cortical analysis generates an analogous decomposition into “rate” coefficients that are also local and complex. In fact, the two dimensions are decomposed simultaneously by the STRF described earlier reproducing a complex output function with a magnitude and phase. This representation can then be used as are the cepstral coefficients in ASR systems, in voice recognition, or in the assessment of speech intelligibility as we discuss next. Quantitative assessment of speech intelligibility The perception of speech is critically dependent on the faithful representation of spectral and temporal SHIHAB SHAMMA : ENCOLDING SOUND TIMBRE modulations in the auditory spectrogram [35-42]. Therefore, an intelligibility index which reflects the integrity of these modulations can be effective regardless of the source of the degradation. Currently, The Articulation Index (AI) and Speech Transmission Index (STI) are the most widely used predictors of speech intelligibility [43-49], and have proven to be extremely valuable in a wide range of applications ranging from architectural designs to vocoder characterization [47-48]. However, both indices are based on specific assumptions of the noise conditions underlying the loss of intelligibility, and hence do not yield accurate results in noisy conditions that involve nonlinear and spectrotemporally inseparable distortions such as phase-jitter [50]. In contrast, a Spectro-Temporal Modulation Index (STMI) can be derived based on the cortical model that takes into account both the spectral and temporal dimensions of the speech signal explicitly. Unlike the AI and STI, it requires no assumptions to be made about the speech signal or the disruptive noise, and hence can in principle be applied in any detrimental circumstance. Psychoacoustic tests have been carried out to confirm the correspondence between the STMI values and actual intelligibility results from human subjects [50]. The STMI is defined as the amount of changes in the multiscale cortical output as distortions are applied to the sound signal. Implied in such a definition is the existence of a clean reference output that is relevant to the task (Fig 6a). For instance, if the task is to characterize a channel (e.g, a recording or transmission medium, a room, or a vocoder), then one can measure how modulations of different spectral scale and temporal rates are affected by noise, reverberation, or other forms of interference during transmission through the channel (Fig 6b). This is IN THE AUDITORY SYSTEM equivalent to characterizing the channel by its spectrotemporal modulation transfer function (MTF) under the specific noise conditions, and noting how it differs from the clean MTF of the model. Alternatively, the task may be to measure the intelligibility of a noisy token of speech or other complex sounds without reference to the specific clean version of the token. In this case, one can still compare the deterioration in the modulations of the test signal to those expected from a similar (not identical) clean sample (or template). Details of the mathematical derivation of the STMI, and results of intelligibility tests with a wide variety of distortions are available in [50]. Multiscale clustering of musical timbres An example of the utility of the cortical representations is in the non-supervised hierarchical clustering of musical timbres according to their spectral similarities at each scale. Such an algorithm is described in detail [51] and is depicted schematically in Fig 7a. Two types of insights are gained from this kind of analysis: (1) The clustering of the spectral profiles at the same resolution is based on features that “belong together”. This, together with the hierarchical organization between features at different resolutions, is potentially very useful in isolating new shape features that may not be directly related to the source (e.g, the vocal tract shape for speech). (2) It is also likely that applying the clustering algorithm on spectral profiles from different databases (e.g, speech, music, animal vocalizations, and natural environmental sounds) will yield general principles or types of features that underlie these classifications regardless of the exact nature of the database. To illustrate the nature of the results from such an approach, we computed the multiscale representation of a large number of spectral profiles of musical sounds from Fig 7 Clustering of Musical instruments based on their multiscale spectral representation. (s) Schematic of the multiresolution TSVQ algorithm by splitting cells based on different resolution data. (b) Example of results of TSVQ with sounds generated by a variety of musical instruments. 10 IETE JOURNAL OF RESEARCH, Vol 49, No 2, 2003 several instruments playing arbitrary notes. Then, at each resolution, this signal space is partitioned into different clusters or cells (or collection of data vectors) which are determined by repeated application of the Linde-BuzoGray (LBG) algorithm [51]. The algorithm is first applied to the coarsest resolution of the data vectors, which results in relatively few clusters or classes which correspond to types of instruments or modes of resonance (Fig 7b). However, it is the clustering performed next at the finer resolution which yields the newest insights. Specifically, each cluster (or equivalence class of coarse representations) is split next with another round of repeated applications of the LBG algorithm, as illustrated in Fig 7a. Thus, a treestructure emerges at which each new layer corresponds exactly to partitions based on the next finer resolution data. Rules and criteria for performing this partitioning have already been developed in detail, and have been applied in a variety of engineering tasks such as the analysis of ship radar-returns [51]. Auditory Scene Analysis The grouping and segregation of different sound streams are key ingredients in our ability to detect and follow one speaker in the presence of numerous interfering voices, or in the organization of a complex auditory scene containing many competing sound sources. It is known that several cues play a critical role in this process [1]. For instance, we tend to group together all harmonics of a common fundamental on the correct assumption that natural sounds tend to be harmonic [52-53]. Another important cue is the synchronicity of the onsets of all components belonging to one voice, and to a lesser extent their common offsets [1]. Another set of well known grouping principles concerns the tendency of human listeners to stream together a sequence of tones as they become temporally and spectrally closer to each other [1]. A final useful perceptual ability of the auditory system is the so-called “Continuity Illusion” in which a temporally briefly interrupted speech signal is perceived as continuous if the gap is filled with relatively loud white noise [54]. Can we explain (and hence exploit) these auditory abilities using the multiscale cortical model, or are these heuristic rules due to more complex cognitive phenomena. The multiscale cortical model might potentially yield its biggest reward in providing a deeper understanding of streaming phenomena of sound sequences into different rhythms and tracks. There is now accumulating evidence that spectral proximity and overlap among the sound elements is only one among many factors that govern this perceptual process. Others include the dynamics, the speed of the sequence, and the pitch of its elements. Perhaps the most efficient way to characterize all these factors is to view them as influencing the “perceptual distance” between the sequence notes. To the extent that the cortical model takes into account all spectral and temporal cues in its representation of sound timbre, it can be used to predict the relevant perceptual distances and hence whether streaming is likely to occur, and under what conditions it can be manipulated. Specifically, we predict that the degree to which two sounds stream into different tracks is directly proportional to the distance between their cortical representation – or more loosely speaking, their timbre differences, and that pitch differences per se may only play an indirect role through the inevitable spectral differences they cause. That is why it is readily possible to stream two different complexes at the same pitch. Finally, extending this approach to speech signals, it is possible to have these streaming rules and cues integrated into algorithms that can “stream” multi-talker speech into separate streams based on voice timbres, pitch, and de-synchronized onsets and offsets. Besides their obvious applications in the design of robust speech recognition systems, these algorithms may shed light on closely related questions such as: Why does speech comprehension suffer much more from competing speech than from white noise? And why does it dramatically recover when the competing speech is separated from the target speech along different perceptual dimensions such as pitch and spatial location? [55]. SUMMARY Apart from the four basic percepts discussed here, there are other auditory attributes and tasks in the animal world that we have not touched upon, and that are as important to these animals as the perception of sound timbre, pitch, and location to us. These include the ultrasonic echolocation in bats and dolphins that enable them to locate prey and avoid obstacles in cluttered environments, the infrasonic (very low frequency) communication signals among many terrestrial animals such as elephants, and the unique auditory adaptations of many animals that help them deal with the limitations of their small size (insects) or their aquatic environments (fish). Finally, we have not considered the key questions of how the auditory system assembles all these disparate percepts into an integrated whole that identifies the sound source as an entity amidst the clutter of other simultaneous sound sources in the environment. And, how do auditory percepts become integrated with visual, motor, and other neural processes of attention and memory so as to give rise to the typical active auditory behaviors that we normally associate with this amazing modality. Answers to these and countless other auditory mysteries are the promise of future exciting research. REFERENCES 1. Bregman, A S, Auditory scene analysis: The perceptual organization of sound. Cambridge, Massachusetts: MIT Press, 1991. SHIHAB SHAMMA : ENCOLDING SOUND TIMBRE 2. Moore B, An Introduction of the Psychology of Hearing, Academic Press, London, 1989. 3. Pickles J O, An Introduction to the Physiology of Hearing, Academic Press, 1988. 4. Auditory Computations, Ed by H Hawkins, E T McMullen, A Popper & R Fay, Springer Verlag, pp 221-270. 5. Center for Auditory and Acoustic Research www.isr.umd.edu/CAAR IN THE AUDITORY SYSTEM 20. T Chi, Y Gao, M C Guyton, P Ru, & SA Shamma, Spectrotemporal modulation transfer functions and speech intelligibility. The Journal of the Acoustical Society of America, 106(5), pp 2719-2732, 1999. 21. S A Shamma, Methods of Neuronal modeling, chapter: Spatial and temporal processing in the auditory system, MIT Press, second edition, pp 411-460, 1998. 6. S Greenberg, The ear as a speech analyzer. J Phonetics, pp 139-150, 1988. 22. S A Shamma, Speech Processing in the Auditory System: II. Lateral inhibition and the central processing of speech evoked activity in the auditory nerve, J Acoust Soc Am, 78, pp 1622-1632, 1985. 7. Clarey J et al, Physiology of thalamus and cortex, The Mammalian Auditory Pathway: Neurophysiolog, Ed D Webster, et al, Springer Verlag, pp 232-334,1992. 23. N Viemeister, Temporal modulation transfer functions based upon modulation thresholds. J Acoust Soc Am, 66(5), pp 1364-1380, 1979. 8. Shamma S, J Fleshman & P Wiser, Response Area Organization in the Ferret Primary Auditory Cortex, J Neurophys, 69(2), pp 367-383,1993. 24. D Green, Frequency’ and the detection of spectral shape change Auditory Frequency Selectivity. Edited by B C J Moore and RD Patterson. Plenum Press, Cambridge, pp pp 351-359, 1986. 9. Evans, E & Whitfield I, Classification of unit responses in auditory cortex of the unanesthetized and unrestrained cat, J Physiol 171, pp 476-493, 1964. 10. Schreiner C & Urbas J, Representation of Amplitude Modulation in the Auditory Cortex of the Cat. I: The anterior field, Hear Res, 21, pp 227-241, 1988. 11. Middlebrooks J C et al, Binaural response-specific bands in primary auditory cortex of the cat: topographical organization orthogonal to isofrequency contours, Brain Res, 181, pp 31-48, 1980. 12. Nelken I, Versnel H, Responses to linear and logarithmic frequency-modulated sweeps in ferret primary auditory cortex, Eur J Neurosci, 12(2): pp 549-562, 2000. 13. Wang X, et al, Representation of Species-Specific Vocalizations in the Primary Auditory Cortex of Common Marmosets: Temporal and Spectral Characteristics. J Neurophysiol, 74, pp 2685-2706,1995. 14. N Kowalski, et al, Analysis of dynamic spectra in ferret primary auditory cortex: I. Characteristics of single unit responses to moving ripple spectra, J Neurophysiol, 76(5) 3503-3523, 1996. 15. Klein D J, et al, Robust spectro-temporal reverse correlation for the auditory system: Optimizing stimulus design, J Comput Neuroscience, 9, pp 85-111,1999 16. R. Lyon, and S. A. Shamma. (1996) Auditory Computation, volume 6 of Springer Handbook of Auditory Research, chapter: Auditory representations of timbre and pitch, Springer-Verlag, New York, Inc, pp 221-270, 1996. 17. K Wang & S A Shamma, Self-normalization and noiserobustness in early auditory representations. IEEE Transactions on Speech and Audio Processing, 2(3), pp 421-435, 1994. 18. X Yang, K Wang & S A Shamma, Auditory representations of acoustic signals. IEEE Transactions on Information Theory, 38(2), pp 824-839, 1992. 19. K Wang & S A Shamma, Spectral shape analysis in the central auditory system. IEEE transactions on Speech and Audio Processing, 3(5), pp 382-395, 1995. 25. Goldstein J, An optimum processor theory for the central formation of pitch of complex tones, J Acoust Soc Am, 54, pp 1496-1516, 1973. 26. Terhardt E, Calculating Virtual Pitch, Hearing Res 1, pp 155-182, 1979. 27. F Wightman, A Pattern Transformation Model of Pitch, J Acoust Soc Am, 54, pp 397-406, 1973. 28. S A Shamma & Klein D, The case of the missing pitch templates: How harmonic templates emerge in the early auditory system, J Acoust Soc Am, 107, pp 2631-2644, 2000. 29. Licklieder J, A Duplex Theory of Pitch Perception, Experientia 7, pp 128-133, 1951. 30. Slaney M & Lyon R, On the importance of time - A temporal representation of sound. In M Cooke, S Beet & M Crawford (Eds) Visual Representations of Speech Signals, J Wiley and Sons, Sussex England, 1993. 31. Jeffress A, A place theory of sound localization, J Comp Physiol Psych, 61, pp 468-486, 1948. 32. Durlach N & Colburn S, Binaural phenomena in Handbook of Perception, Ed by E C Carterette and M P Friedman, pp 365-466, 1978. 33. J Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization, Revised Edition (MIT Press, Cambridge, MA), 1997. 34. Rabiner L, Schafer R, Digital processing of speech signals (Prentice Hall, Englewood Cliffs, NJ), 1978. 35. R Drullman, J Festen & R Plomp, Effect of envelope smearing on speech perception, The Journal of the Acoustical Society of America, 95(2): pp 1053-1064, 1994. 36. T Dau, D Püschel & A Kohlrausch, A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. The Journal of the Acoustical Society of America, 99(6): pp 3615-3622, 1999. 37. H Hermansky & N Morgan, RASTA processing of speech. IEEE transactions on speech and audio processing, 2(4): pp 12 IETE JOURNAL OF RESEARCH, Vol 49, No 2, 2003 578-589, 1994. 38. R Shannon, F-G Zeng, J Wygonski, V Kamath & M Ekelid, Speech recognition with primarily temporal cues. Science, (270): pp 303-304, 1995. 39. S Greenberg, T Arai & R. Silipo, Speech intelligibility derived from exceedingly sparse spectral information. Proceedings of the International Conference on Spoken Language Processing, Sydney, in press, 1998 40. T Arai, M Pavel, H Hermansky & C Avendano. () Intelligibility of speech with filtered time trajectories of spectral envelopes, Proceedings of ICSLP, pp 2490-2492, 1996. 41. S Greenberg & T Arai, Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the Internatioal Congress on Acoustics, Seattle, pp 2677-2678, 1998. 42. K Saberi & D R Perrott, Cognitive restoration of reversed speech, Nature, 398, p 760, 29 April 1999. 43. American National Standards Institute, New York. American national standard methods for calculation of the speech intelligibility index. ANSI S3.5, 1997. 44. T Houtgast & H J M Steeneken, Predicting speech intelligibility in rooms from the modulation transfer function, I. General room acoustics, Acoustica, 46: 60-72, 1980. 45. K D Kryter, Methods for the calculation and use of the articulation index, The Journal of the Acoustical Society of America, 34(11), pp 1689-1697, 1962. 46. JS Bradley, Predictors of speech intelligibility in rooms. The Journal of the Acoustical Society of America, 80(3), pp 837-845, 1986. 47. T Houtgast & H J M Steeneken, A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria. The Journal of the Acoustical Society of America, 77(3), pp 1069-1077, 1985. 48. H J M Steeneken, & T Houtgast, A physical method for measuring speech-transmission quality. The Journal of the Acoustical Society of America, 67(1), pp 318-326, 1979. 49. Sussman, E, Ceponiene, R, Shestakova, A, Naatanen, R & Winkler I, Auditory stream segregation processes operate similarly in school-aged children and adults, Hear Res, 153, pp 108-114, 2001. 50. Mounya Elhilali, Taishih Chi, & Shihab A Shamma, A Spectro-Temporal Modulation Index (STMI) for assessment of speech intelligibility, Speech Communication (in press), 2002. 51. J Baras & S Wolk, Efficient organization of large ship radar databases using wavelets and structured vector quantization, Proc of the 27th Asilomar Conference on Signals, Systems, and Computers, 1993. 52. Culling JF, Darwin CJ, Role of timbre in the segregation of simultaneous voices with intersecting F0 contours, Percept Psychophys, 54(3), pp 303-309, 1993. 53. Assmann, P F, Fundamental frequency and the intelligibility of competing voices, Proceedings of the 14th International Congress of Phonetic Sciences, pp 179-182, 1999. 54. Bashford J, Meyers M Brubaker B& Warren R, Illusory continuity of interrupted speech: Speech rates determines durational limits, J Acoust Soc Am, 84(5), pp 1635-1638, 1988. 55. Brungart D S, Informational and energetic masking effects in the perception of two simultaneous talkers, J Acoust Soc Am, 109, pp 1101-1109, 2001. AUTHOR Shihab Shamma received his BS degree in 1976 from Imperial College, in London, UK He received his MS and PhD degrees in Electrical Engineering from Stanford University in 1977 and 1980, respectively. Prof Shamma received his MA in Slavic Languages and Literature in 1980 from the same institution. Dr Shamma has been a member of the University of Maryland faculty since 1984. He has been associated with the Systems Research Center since its inception in 1985, and received a joint appointment in 1990. Prof Shamma also holds a joint appointment with the University of Maryland Institute for Advanced Computer Studies. Prof Shamma has also worked at the National Institutes of Health and Stanford University. Prof. Shamma’s research interests include biological aspects of speech analysis and neural signal processing and has a number of publications to his credit.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Encoding Sound Timbre in the Auditory System