Abstract
Deep learning appears as an appealing solution for Automatic Synthesizer Programming (ASP), which aims to assist musicians and sound designers in programming sound synthesizers. However, integrating software synthesizers into training pipelines is challenging due to their potential non-differentiability. This work tackles this challenge by introducing a method to approximate arbitrary synthesizers. Our method trains a neural network to map synthesizer presets onto a perceptually informed embedding space defined by a pretrained audio model. This process creates a differentiable neural proxy for a synthesizer by leveraging the audio representations learned by the pretrained model. We evaluate the representations learned by various pretrained audio models in the context of neural-based nASP and assess the effectiveness of several neural network architectures – including feedforward, recurrent, and transformer-based models – in defining neural proxies. We evaluate the proposed method using both synthetic and hand-crafted presets from three popular software synthesizers and assess its performance in a synthesizer sound matching downstream task. Encouraging results were obtained for all synthesizers, paving the way for future research into the application of synthesizer proxies for neural-based ASP systems.
Sound Matching Examples
To demonstrate the integration of a preset encoder into an nASP pipeline and assess its potential benefits, we evaluated its performance to on a Synthesizer Sound Matching (SSM) downstream task using Dexed and Diva, on both in-domain and out-of-domain sounds.
The objective of the SSM task was to infer the set of synthesizer parameters that best approximates a given target audio. The preset encoder is used to compute a perceptual loss between its representation of the predicted preset and the representation of the target audio produced by a pretrained audio model (here we used mn20
from EfficientAT).
We evaluated three different loss scheduling configurations to assess the benefit of adding a perceptual loss, inspired by Masuda & Saito (2023):
PLoss
. Only the parameter loss is used and serves as a baseline.Mix
. The parameter loss is applied for the first 200 epochs, then the perceptual loss is gradually introduced over the next 200 epochs, and the estimator network is trained for the remaining 200 epochs using both parameter and perceptual losses.Switch
. This approach is similar to the previous one, but fully transitions from parameter loss to perceptual loss, resulting in the estimator network being trained exclusively with perceptual loss during the final 200 epochs.
In-domain sounds
Dexed
Target | ||||||||||
Dexed PLoss | ||||||||||
Dexed Mix | ||||||||||
Dexed Switch |
Diva
Target | |||||||||
Diva PLoss | |||||||||
Diva Mix | |||||||||
Diva Switch |
Out-of-domain sounds
The following audio target examples are taken from the validation set of the NSynth dataset.
Bass elec. | Bass synth. | Brass ac. | Flute synth. | Guitar elec. | Keyboard elec. | Mallet ac. | Reed ac. | String ac. | Vocal ac. | |
---|---|---|---|---|---|---|---|---|---|---|
Target | ||||||||||
Dexed PLoss | ||||||||||
Dexed Mix | ||||||||||
Dexed Switch | ||||||||||
Diva PLoss | ||||||||||
Diva Mix | ||||||||||
Diva Switch |