Sounds of the Commons - Neural Audio Mashups

The Data Science & Engineering group at the Wikimedia Foundation recently had it’s Fall hackathon. I proposed and led a small project related to neural re-synthesis and generating raw audio using free audio files found in the Wikimedia Commons. For those who aren’t familiar, Wikimedia Commons is a media repository of open images, sounds, videos and other media. There is a ton of cool and interesting audio files publicly available for use in datasets there: https://commons.wikimedia.org/wiki/Category:Audio_files

The project idea went like this:

Create a dataset of interesting sounds that fall into a couple of categories (ex: music/nature/human/animal/interior/exterior)
Write scripts that will randomly combine these audio files and sample the latent spaces of their combined embeddings to create new machine-generated audio files

Dataset Curation

First we needed to pull down a collection of wav files from the Commons. fkaelin wrote a script that created a 90 GB dataset of wav files hosted on Commons and stored it on hdfs: https://gitlab.wikimedia.org/fab/research-ml/-/blob/fk/swift/notebooks/wav.ipynb

The team explored various categories and pared down different themes that seemed interesting mashup. There was a small one-off dataset published on the analytics server that had sounds from 1951 and the folk music of Småland: https://analytics.wikimedia.org/published/datasets/one-off/wav/example/

Preprocessing

Next we wanted to covert the audio files to embeddings and randomly interpolate all the embeddings together. First we converted all audio files to mono and then used Magenta NSynth encode to convert all audio quickly. Next we “cross-fade” the different encodings together, where we fade out one and fade in another at random intervals and we sample this latent space before feeding it to our model for synthesis.

Our mashup function looked something like this:

def mashup(fname1, fname2, sample_length=100000):
    print('mashing up two files')
    audio1, encoding1 = load_encoding(fname1, sample_length=sample_length)
    audio2, encoding2 = load_encoding(fname2, sample_length=sample_length)
    mashed_encodings = cross_fade(encoding1, encoding2)
    return mashed_encodings

Model

We used an autoencoder model trained on WavNet to generate raw audio using the interpolated embeddings. We used Magenta’s Nsynth model as it was the fastest way to get started without training a model from scratch. One major downside of this approach is the lofi quality of the generated audio, although fidelity was not a primary goal for this project. When we first started generating audio using CPU only, it seemed to take ~6 min to generate 1 sec of audio, so the team started looking into ways to speed this up.

Technical Challenges

The Wikimedia Foundation has a strong commitment to using free & open source software (FOSS), and we decided to honor that commitment and do this entire project on WMF hardware. This meant we were unable to use Nvidia GPUs due to licensing, but we did have access to a number of AMD GPUs. Magenta requires Tensorflow as a dependency, which is notorious for being difficult to run on AMD, so we had to get creative.

elukey managed to get things working using ROCm on our machines to help us connect to our GPUs: https://phabricator.wikimedia.org/T287267

Synthesis

After solving our GPU problems, we managed to produce some raw audio and then randomly stictched it all together using a basic sequencer. The results are pretty interesting, sounds like something out of a sci-fi horror movie, you hear for yourself here:

Dataset Curation

Preprocessing

Model

Technical Challenges

Synthesis

Links