Computer audition
Computer audition or machine listening is the general field of study of algorithms and systems for audio interpretation by machines. Since the notion of what it means for a machine to "hear" is very broad and somewhat vague, computer audition attempts to bring together several disciplines that originally dealt with specific problems or had a concrete application in mind. The engineer Paris Smaragdis, interviewed in Technology Review, talks about these systems "software that uses sound to locate people moving through rooms, monitor machinery for impending breakdowns, or activate traffic cameras to record accidents."
Inspired by models of human audition, CA deals with questions of representation, transduction, grouping, use of musical knowledge and general sound semantics for the purpose of performing intelligent operations on audio and music signals by the computer. Technically this requires a combination of methods from the fields of signal processing, auditory modelling, music perception and cognition, pattern recognition, and machine learning, as well as more traditional methods of artificial intelligence for musical knowledge representation.
Applications
Like computer vision versus image processing, computer audition versus audio engineering deals with understanding of audio rather than processing. It also differs from problems of speech understanding by machine since it deals with general audio signals, such as natural sounds and musical recordings.Applications of computer audition are widely varying, and include search for sounds, genre recognition, acoustic monitoring, music transcription, score following, audio texture, music improvisation, emotion in audio and so on.
Related disciplines
Computer Audition overlaps with the following disciplines:- Music information retrieval: methods for search and analysis of similarity between music signals.
- Auditory scene analysis: understanding and description of audio sources and events.
- Computational musicology and mathematical music theory: use of algorithms that employ musical knowledge for analysis of music data.
- Computer music: use of computers in creative musical applications.
- Machine musicianship: audition driven interactive music systems.
Areas of study
The study of CA could be roughly divided into the following sub-problems:
- Representation: signal and symbolic. This aspect deals with time-frequency representations, both in terms of notes and spectral models, including pattern playback and audio texture.
- Feature extraction: sound descriptors, segmentation, onset, pitch and envelope detection, chroma, and auditory representations.
- Musical knowledge structures: analysis of tonality, rhythm, and harmonies.
- Sound similarity: methods for comparison between sounds, sound identification, novelty detection, segmentation, and clustering.
- Sequence modeling: matching and alignment between signals and note sequences.
- Source separation: methods of grouping of simultaneous sounds, such as multiple pitch detection and time-frequency clustering methods.
- Auditory cognition: modeling of emotions, anticipation and familiarity, auditory surprise, and analysis of musical structure.
- Multi-modal analysis: finding correspondences between textual, visual, and audio signals.
Representation issues
Since audio signals usually comprise multiple sound sources, then unlike speech signals that can be efficiently described in terms of specific models, it is hard to devise a parametric representation for general audio. Parametric audio representations usually use filter banks or sinusoidal models to capture multiple sound parameters, sometimes increasing the representation size in order to capture internal structure in the signal. Additional types of data that are relevant for computer audition are textual descriptions of audio contents, such as annotations, reviews, and visual information in the case of audio-visual recordings.
Features
Description of contents of general audio signals usually requires extraction of features that capture specific aspects of the audio signal. Generally speaking, one could divide the features into signal or mathematical descriptors such as energy, description of spectral shape etc., statistical characterization such as change or novelty detection, special representations that are better adapted to the nature of musical signals or the auditory system, such as logarithmic growth of sensitivity in frequency or octave invariance.Since parametric models in audio usually require very many parameters, the features are used to summarize properties of multiple parameters in a more compact or salient representation.