Created 02/04/2020 at 08:35PM

概要

Mellotron是NVIDIA在19年10月末提出的一个端到端的语音合成模型,主要目标是做prosody方面的style transfer和音乐生成(singing voice synthesis,SVS)。不同于[1, 2],Mellotron可以更加精细化的调整韵律和音调(pitch),并且该模型的训练完全端到端化,不需要在数据集中含有音乐数据也可以生成音乐,也不需要对pitch和text进行人为的对齐就可以学到两者之间的alignment。

要点

Mellotron最大的创新之处是将F0信息引入TTS model,F0信息及基本频率,主要刻画声调信息,音乐演奏或歌唱中,基频是区别音高的主要元素,决定旋律。除此以外,Mellotron可以使用外界的attention map,从而实现rhythm transfer。

条件化建模

Mellotron定义了五个变量,分别是:

由上述五个变量来预测Mel spectrogram,所以整个Mellotron的公式化表示为:P(mel|T, S, P, R, Z; θ)。文中还阐述了为什么要用两个latent variable,首先R主要是用来学习alignment,并且可以完成rhythm transfer,其次,Z可以去刻画一些很难去显式刻画的隐变量,从而提高合成和transfer的效果。

模型架构

模型基于GST-Tacotron2来进行构建,论文中没有结构图,我们直接来看代码

# 代码起始地址:https://github.com/NVIDIA/mellotron/blob/master/model.py#L596
class Tacotron2(nn.Module):
    # ...

    def forward(self, inputs):
        # 通过YIN algorithm得到F0
        inputs, input_lengths, targets, max_len, \
            output_lengths, speaker_ids, f0s = inputs
        input_lengths, output_lengths = input_lengths.data, output_lengths.data

        embedded_inputs = self.embedding(inputs).transpose(1, 2)
        embedded_text = self.encoder(embedded_inputs, input_lengths)
        embedded_speakers = self.speaker_embedding(speaker_ids)[:, None]

        # GST模块
        embedded_gst = self.gst(targets)
        embedded_gst = embedded_gst.repeat(1, embedded_text.size(1), 1)
        embedded_speakers = embedded_speakers.repeat(1, embedded_text.size(1), 1)

        # GST和speaker embedding和text embedding一起concatenate
        encoder_outputs = torch.cat(
            (embedded_text, embedded_gst, embedded_speakers), dim=2)

        mel_outputs, gate_outputs, alignments = self.decoder(
            encoder_outputs, targets, memory_lengths=input_lengths, f0s=f0s)

        mel_outputs_postnet = self.postnet(mel_outputs)
        mel_outputs_postnet = mel_outputs + mel_outputs_postnet

        return self.parse_output(
            [mel_outputs, mel_outputs_postnet, gate_outputs, alignments],
            output_lengths)

    # ...

# Decoder部分截取,代码起始地址:https://github.com/NVIDIA/mellotron/blob/master/model.py#L415
class Decoder(nn.Module):
    # ...

    def forward(self, memory, decoder_inputs, memory_lengths, f0s):
        """ Decoder forward pass for training
        PARAMS
        ------
        memory: Encoder outputs
        decoder_inputs: Decoder inputs for teacher forcing. i.e. mel-specs
        memory_lengths: Encoder output lengths for attention masking.
        RETURNS
        -------
        mel_outputs: mel outputs from the decoder
        gate_outputs: gate outputs from the decoder
        alignments: sequence of attention weights from the decoder
        """

        decoder_input = self.get_go_frame(memory).unsqueeze(0)
        decoder_inputs = self.parse_decoder_inputs(decoder_inputs)
        decoder_inputs = torch.cat((decoder_input, decoder_inputs), dim=0)
        decoder_inputs = self.prenet(decoder_inputs)

        # audio features
        f0_dummy = self.get_end_f0(f0s)
        f0s = torch.cat((f0s, f0_dummy), dim=2)
        f0s = F.relu(self.prenet_f0(f0s))
        f0s = f0s.permute(2, 0, 1)

        self.initialize_decoder_states(
            memory, mask=~get_mask_from_lengths(memory_lengths))

        mel_outputs, gate_outputs, alignments = [], [], []
        while len(mel_outputs) < decoder_inputs.size(0) - 1:

            # --- 注意这里,F0和decoder input一起concatenate --- #
            if len(mel_outputs) == 0 or np.random.uniform(0.0, 1.0) <= self.p_teacher_forcing:
                decoder_input = torch.cat((decoder_inputs[len(mel_outputs)],
                                           f0s[len(mel_outputs)]), dim=1)
            else:
                decoder_input = torch.cat((self.prenet(mel_outputs[-1]),
                                           f0s[len(mel_outputs)]), dim=1)


            mel_output, gate_output, attention_weights = self.decode(
                decoder_input)
            mel_outputs += [mel_output.squeeze(1)]
            gate_outputs += [gate_output.squeeze()]
            alignments += [attention_weights]

        mel_outputs, gate_outputs, alignments = self.parse_decoder_outputs(
            mel_outputs, gate_outputs, alignments)

        return mel_outputs, gate_outputs, alignments

    # ...

实验

实验结果读者可以看这里,作者主要做了如下几个实验:

总结

题外话

因为课比较多,有一段时间没关注语音合成了,最近可能要重新开始实习,所以趁着假期开始看看paper。和几个同学花了一些时间做了一个GPU共享平台——MistGPU,我用之前复现的FastSpeech在这个平台上租用一个配置为2台1080ti的机器进行训练,大概跑了24个小时就能得到相当不错的效果,总花费大概是8*5折*24=96元,由于新用户还赠送8元抵用券,邀请用户继续送8元,所以实际花费很低,希望有需求的朋友多多支持。

参考文献

[1] RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” arXiv preprint arXiv:1803.09047, 2018.

[2] Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ SkerryRyan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” arXiv preprint arXiv:1803.09017, 2018.