Skip to the content.

Abstract

Deep learning appears as an appealing solution for Automatic Synthesizer Programming (ASP), which aims to assist musicians and sound designers in programming sound synthesizers. However, integrating software synthesizers into training pipelines is challenging due to their potential non-differentiability. This work tackles this challenge by introducing a method to approximate arbitrary synthesizers. Our method trains a neural network to map synthesizer presets onto a perceptually informed embedding space defined by a pretrained audio model. This process creates a differentiable neural proxy for a synthesizer by leveraging the audio representations learned by the pretrained model. We evaluate the representations learned by various pretrained audio models in the context of neural-based nASP and assess the effectiveness of several neural network architectures – including feedforward, recurrent, and transformer-based models – in defining neural proxies. We evaluate the proposed method using both synthetic and hand-crafted presets from three popular software synthesizers and assess its performance in a synthesizer sound matching downstream task. Encouraging results were obtained for all synthesizers, paving the way for future research into the application of synthesizer proxies for neural-based ASP systems.

Sound Matching Examples

To demonstrate the integration of a preset encoder into an nASP pipeline and assess its potential benefits, we evaluated its performance to on a Synthesizer Sound Matching (SSM) downstream task using Dexed and Diva, on both in-domain and out-of-domain sounds.

The objective of the SSM task was to infer the set of synthesizer parameters that best approximates a given target audio. The preset encoder is used to compute a perceptual loss between its representation of the predicted preset and the representation of the target audio produced by a pretrained audio model (here we used mn20 from EfficientAT).

We evaluated three different loss scheduling configurations to assess the benefit of adding a perceptual loss, inspired by Masuda & Saito (2023):

In-domain sounds

Dexed

Target
Dexed PLoss
Dexed Mix
Dexed Switch

Diva

Target
Diva PLoss
Diva Mix
Diva Switch

Out-of-domain sounds

The following audio target examples are taken from the validation set of the NSynth dataset.

Bass elec. Bass synth. Brass ac. Flute synth. Guitar elec. Keyboard elec. Mallet ac. Reed ac. String ac. Vocal ac.
Target
Dexed PLoss
Dexed Mix
Dexed Switch
Diva PLoss
Diva Mix
Diva Switch