The paper describes requirements, concept and general structure of audio compression scheme employed in interactive television iTVP project. A perceptual coder based on a psychoacoustic model and spatio-temporal signal decomposition using a hybrid filterbank is proposed. The coder structure is derived from standard MPEG L3 technique. Some ideas for improving the performance at low bit rates are introduced. Sinusoidal+noise modeling is proposed as a means of bandwidth extension. Software implementation issues regarding real-time performance are discussed at the end of the paper.