This is a support page to deliver additional information and demonstrations related to the paper:
ANALYSIS/SYNTHESIS OF HARMONIC SOUNDS BASED ON AM/FM SCALING OF A PROTOTYPE SIGNAL
submitted to IEEE Transactions on Audio, Speech and Language Processing.
]
Abstract The paper describes an enhancement of the harmonic sinusoidal modeling technique
that extends its capabilities of efficiently representing sounds which are not purely tonal.
An improvement is achieved in reconstruction accuracy thanks to introduction of additional
modulating components to the partial instantaneous frequencies and instantaneous amplitudes.
For this purpose, a classical heterodyne analysis technique is combined with principal component
analysis of partials. The sound is represented by a discrete harmonic envelope and a narrowband
prototype signal that carries the modulations. We show by experiments that the model is capable of
reproducing transients and a significant part of mechanical noise in musical sounds and yields
good subjective quality at SNR of 15dB to 26dB.
In the paper we propose a model for harmonic audio signals that may be used in object-based low bit rate coding. The high level, low-resolution spectral characteristics of the signal are represented by a two-dimensional harmonic envelope (HE, shown below, middle), a complex-valued structure similar (but not exactly the same) to the data delivered by a harmonic phase vocoder by Beauchamp and harmonic sinusoidal model by Serra. A high temporal resolution is offered thanks to the second part of the model, the prototype signal (below, right).
Advantages: The fundamental advantage of the above representation is that these two parts are much less complex signals than the original signal and they can be very efficiently encoded. In the paper we describe the analysis and synthesis process in detail. On this web page we demonstrate that our model is capable of representing accurately the significant acoustic features of the sound: its pitch, texture, timbre, and tonal/noisy character. In a traditional SM this would require a high data rate (a small time shift between consecutive frames, or an excessively high number of partials), which is prohibitive in compression applications. In our model the data rate is low (partial control parameters are subsampled typically 1:500 or 1:1000), however the inclusion of the prototype signal (which is mostly narrow-band) allows to efficiently represent the dominant low-level fluctuations of instantaneous parameters (IF and IA) which are responsible for the acoustic features mentioned above.
How it works: The core of the system is a near phase coherent heterodyne analyzer (the big grey box in the
scheme shown above) consisting of a bank of complex harmonic oscillators, individual multipliers and lowpass filters.
The harmonic oscillators operate on integer multiples of the instantaneous fundamental frequency (IF0[n])
so that each individual channel deals with a single harmonic partial, yielding its baseband representation
(the complex envelope) at the corresponding output. A single output signal for 1st partial of the glissando sound
is shown below, left.
Each of the output complex-valued signals from the analyzer is subsequently decimated (1:R). This low-resolution
information is stored in the Harmonic Envelope (a matrix whose each row corresponds to one partial and each
column corresponds to a decimated time sample). A single row of HE obtained for the glissando sound is shown below,
in the middle.
For each of the baseband signals, its high frequency residual (the remainder after subtracting an upsampled previously
decimated signal) is obtained (above, right). These residuals for a single sound are subject to Principal Component
Analysis (PCA). The aim of this analysis is to identify and extract the common residual AM modulation that enhances
the representation stored in the HE.
PCA represents the collection of its input residual signals ak[n] as a linear combination of other signals
which we may consider as a local orthogonal base. Only one of these signals (corresponding to the principal eigenvector
of the covariance matrix) is preserved. We denote it by a0[n]. In order to reverse the linear transformation
in the decoder we also need a short vector of complex-valued scaling constants, G1.
The final prototype signal is a product of AM and FM modulation. The FM term of this signal is defined by IF0[n],
and the AM term is defined by the real-valued PCA output which is offset by an arbitrary constant (b > max|a0[n]|) that
prevents the amplitude from going below zero (over-modulation). This simple trick allows the both terms to be
separated during demodulation. An illustrative example below (left) shows the idea (not to scale). Real examples
of the prototype IF and IA are also shown below (middle and right).
Reconstruction is a straightforward process. The reconstructed sound is generated by the means of additive
synthesis, similarly as in the case of a normal sinusoidal model. Individual partials are synthesized as a product of AM and FM
modulations and are composed of two complex exponentials. One exponential is the upsampled corresponding row of the HE. The second
exponential is obtained by appropriate scaling of the IF and IA recovered from the prototype. The IF is scaled simply by the partial
order, k. The IA is scaled by the appropriate element of the scaling vector, G1. This vector is obtained in the
process of PCA analysis in the encoder (see above).
For this purpose, the IF and IA have to be recovered from the prototype signal. This may be performed quite reliably using the standard
Gabor method employing Hilbert transform, since the prototype is narrow-band and mono-component. More precisely speaking, we do not
estimate the instantaneous frequency (IF), which would involve phase unwrapping. For synthesis of partials we simply measure the
instantaneous phase of the prototype φ0[n] and apply the scaling by k = 1,2..Kmax.
For demonstrations, please select in the menu on the left.