Music Source Separation

Music Source Separation also known as Stem Separation, Demixing, Audio Source Separation or Unmixing is a technique of separating one audio track into multiple audio tracks by targeting mixed material using Music Information Retrieval (MIR) MSS is a branch of Signal Separation which was established in the mid-1990s as a technology to reconstruct one or more source signals from mixtures of them. The process is generally utilized by music professionals to separate existing recordings for the purposes of enhancing the balance of the mix, remixing or remastering. There are additional use cases where there is no multitrack or session files available of the sound recording so it becomes a necessity to rely on tools that can provide stem separation from a single audio file.
Initial audio source separation for commercial purposes resulted in a file that was non-destructively separated, so that the resulting files could be reconstructed and sound exactly like the original without introducing issues when all tracks were performed simultaneously.
There are a wide variety of applications of the technology outside of music including teaching, forensics, speech separation, live sound cancelation, audio restoration, and VR/AR.

How AI Stem Separation Generally Works

This process involving reverse engineering stems from mastered tracks relies on training models to identify targets in mixtures. Millions of real isolated stems from project files are used to update the parameter margins of models to generate estimates for the final output from mixtures. Large multitrack datasets are developed from the provided isolated stems with further adjustments to mixtures to provide higher numbers in the dataset that train the models for higher degrees of accuracy. Initially providers utilized online-based stem separation because it enable the utilization of powerful computational systems, now there are many options for local system based processing of the AI because of optimizations in the processing approach. There are also CPU developments that include neural workflows which facilitate the faster processing architecture needed for highest fidelity stem separation with lower time requirements.

AI Stem Separation in Sync Music

A growing number of companies are providing the ability for both music publishers and clients to utilize the stem separation technologies for their project needs. Especially useful in the case of vocal removal from mixtures. Utilizing these tools provides editors and agents of the film and TV music industry to quickly have available the ability to adjust and contour songs without the need to reach out to providers which would cause time delays. This improves the potential of a usage because a common issue with sync placements is that certain kinds of sounds can interfere too much with the application of the underscore. This also provides the sync professionals the ability to take the track into unexpected directions and otherwise enhance the mix for the purpose of the application of the track.

AI Stem Separation for the DJ

Quick stem separation is a perfect match for the professional DJ looking to create unique mashups. Generally the track would be rendered into a stem by placing the desired songs into the appropriate folder, when the song is selected it will have the basic four stem groupings available and in some cases individual parts can be triggered on pads for live performance.

Notable Case Studies of AI Stem Separation

Disney Music Group made use of stem separation technologies to enhance their back catalog of recordings. Beatles recordings where split and enhanced with stem separation technologies and the engineers during this process also helped to progress the development of the technology. Numerous classic hit songs have been the target of restoration through stem separation achievements.

Stems vs AI Stem Separations

Stems have been used in the recording industry to mean files bounced during the mixing process, generally a collection of like sounds grouped as a "stem". Stems in the context of the original project files can provide a large number of exported audio files for multiple purposes. These kinds of files generally provide a better quality overall and offer the ability to further isolate project material without introducing artifacts.
AI Stem Separations have generally produced material that is ideally suited for volume adjustments or further effect processing or production. These kinds of stems generally have come in the basic four groupings of vocal, bass, drum and other. New approaches and deeper training of models resulted in the capability to isolate additional material beyond the basic four groupings however these kinds of separations generally have spectral anomalies, blend in additional sounds or change some quality of the original targeted sound.

Sound Design with AI Stem Separation Tools

The process of using AI and other methodologies to target specific kinds of sounds happened to enable a new method of spectral separation based sound design through new kinds of tools to edit with such as those in SpectraLayers and RipX. The instant ability to unmix components such as transient information and time based information into full tracks of unconventional sound creations. Groove shadows and other sound production dubbing techniques are easily achievable by revealing new timbres and structures based on spectral selections because of the advancements into tools to support stem manipulation.

Noise Reduction and AI Stem Separation

Aside from advanced noise reduction methods based on learning noise profiles, taking an inverted approach and removing known source targets such as the basic four and specialized models can result in leaving only the noise as a separate track depending on the ensemble. From that, one can remove the noise track. Noise may result on only a single stem and that stem can be targeted exclusively with noise reduction profiles in this way the entire mix does not need to be processed.

Karaoke (Vocal Remover) and AI Stem Separation

One of the most popular use cases of stem separation is for the purposes of creating an instrumental of a song where one isn't known to exist or available. There are dozens of sites using the technology to attract users aspiring to make such instrumental versions of their favorite songs.

RipX and the Melodyne Approach to Stem Separation

The RipX DAW is a unique take on the concept of stem separation because of its note-based harmonic audio visual structure branded as "Rip Audio Format". The system provides a stem separation tool that breaks down a single file into several tracks with notes being represented as the audio track. These notes are highly adjustable and the system includes highly specialized tools for working with the notes and the spectral aspects of the captures. Each note or note part can have specialized effects applied. Tracks can be swapped easily because of the utilization of this notation with other sounds entirely. So the stem is not only separated but the midi is transcribed making it possible to perform as a midi sequence and thereby direct instruments. The notation used by the Rip Audio format resembles the Melodyne architecture of note extraction from audio, these notes however also function as MIDI and audio simultaneously. RipX is a completely unique kind of DAW that is based around stem separation as well as this new Rip Audio format, where audio and midi worlds forge a symbiosis with new kinds of tools to support the new paradigm.

Stem Mastering Tool

Native Instruments created a specialized tool called "Stem Creator Tool" for working with four part stem tracks which is ideally suited for the DJ world as digital DJ consoles and Native Instruments hardware like Traktor and Maschine use the four track stem structure. This tool enables quick mastering and saving of files in a "stem" archival format. The tool is free to use and essential mastering effects applicable to stem-based audio are provided.

Example Approaches and Methodologies Employed

Deep Learning

Neural Networks
Convolutional Neural Networks
Recurrent Neural Networks and Transformers
Source Separation Algorithms

Signal Processing Techniques with AI Integration

Short-time Fourier transform STFT
Independent Component Analysis (ICA)
Non-negative Matrix Factorization (NMF)
Computational Auditory Scene Analysis (CASA)
Repetition-based methods
Masking-based approaches
End-to-end approaches
Hybrid approaches

Supporting Developments

Ensemble-based approaches
Leveraging large datasets
Text-based source separation
Conv-TasNet
Wave-U-Net
Mapping-based Methods
SynthSOD

Known Issues

The length of time it takes to analyze and separate the sound means that the mixtures generally need pre-rendering or there is a delay in processing. The process of AI-based stem separation produces artifacts and doesn't always result in the correct designation of target instruments. There may also be spectral bleed from part to part. It is easy to compromise the mixed structure of the full work by adjusting certain elements in isolated parts. A rapid volume modulated sound similar to tremolo may also be a factor of certain kinds of separations. The order in which the source audio is processed and the kind of applications and their sequence can effect the outcome of separations in addition to the kind of mixes and masters. The process can pickup spectral anomalies which may need to be merged into different tracks. There may be a need to reprocess a stem separation of specialized instruments until the desired balance of the captured target sound is realized. Audio editing tools exist to further clean up the processing of the stem separation which are specialized just for that purpose.