Document

Roi Benita , Michael Elad, Joseph Keshet
Technion - Israel Institute of Technology

Abstract: Diffusion models have recently been shown to be relevant for high-quality speech generation. Most work has been focused on generating spectrograms, and as such, they further require a subsequent model to convert the spectrogram to a waveform (i.e., a vocoder). This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform. The proposed model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one. Hence, our model can effectively synthesize an unlimited speech duration while preserving high-fidelity synthesis and temporal coherence. We implemented the proposed model for unconditional and conditional speech generation, where the latter can be driven by an input sequence of phonemes, amplitudes, and pitch values. Working on the waveform directly has some empirical advantages. Specifically, it allows the creation of local acoustic behaviors, like vocal fry, which makes the overall waveform sounds more natural. Furthermore, the proposed diffusion model is stochastic and not deterministic; therefore, each inference generates a slightly different waveform variation, enabling abundance of valid realizations. Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.

Official open-sourced implementation: Github repository
Accepted for publication at ICLR 2024 OpenReview

Unconditional generation

\[\left(L,L_0\right)\]
\[ \left(400,200\right) \]
\[ \left(500,250\right) \]
\[ \left(1000,500\right) \]

Conditional generation

	Personal relations	An examination of certain construction work appearing in the background of this photograph revealed that the picture was taken between March 8	Who was bound to report any deficiencies and abuses he might find at his periodical visits. The Secretary of State might go further.	At the last interrogation in November Oswald admitted to Postal Inspector Holmes that he had rented post office box 2915 Dallas
\[ {WaveGrad 2} \]
\[ {FastSpeech 2} \]
\[ {DiffAR \left( 200 \right)} \]
\[ {DiffAR \left( 1000 \right)} \]

Ablation Study

	and the general insufficiency was such	Boxes and cases were stacked behind him.	which had come to rest on a stack of paper.	which carry the major responsibility for supplying information about potential threats
\[ {Ground truth} \]
\[ {DiffAR-E \left( 200 \right)} \]
\[ {DiffAR \left( 200 \right)} \]
\[ {DiffAR \left( 1000 \right)} \]
\[ {DiffAR+P \left( 200 \right)} \]

Stochasticity and Controlabillity

	Generated examples	Energy	Pitch
\[ {DiffAR-E \left( 200 \right)} \]

\[ {DiffAR \left( 200 \right)} \]

\[ {DiffAR+P \left( 200 \right)} \]

vocal fry

Here is an example illustrating DiffAR's capability to produce vocal fry.

	Generated examples	Spectrogram	WaveForm
\[ {WaveGrad 2} \]

\[ {FastSpeech 2} \]

\[ {DiffAR \left( 200 \right)} \]

Here are DiffAR's Full Sentences synthesis Examples including vocal fry.

which had come to rest on a stack of paper.
There was a deep wound just over the ear, the skull was fractured and there were several other blows and wounds on the head
Break the rolls apart from one another and eat warm. They are also good cold and if the directions be followed implicitly very good always.
Little more remains to be said about Robson He appears to have accepted his position, and to have at once resigned himself to his fate.
At the last interrogation in November Oswald admitted to Postal Inspector Holmes that he had rented post office box 2915, Dallas.
who seldom let a session go by without visiting Newgate.
which carry the major responsibility for supplying information about potential threats
A formal and thorough description of the responsibilities of the advance agent is now in preparation by the service

DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation

Unconditional generation

Conditional generation

Ablation Study

Stochasticity and Controlabillity

vocal fry