MockingBird Poor attention with a different speaker encoder

gxwragnw 于 2022-11-02 发布在其他

关注(0)|答案(2)|浏览(224)

First, Thanks for the work by babysor!
I noticed that the speaker encoder used in this work is ge2e, performance of which is far fall behind the SOTA. So I replaced the ge2e encoder with ECAPA-TDNN model. One difference between ge2e and ECAPA-TDNN is that the dimension of embedding is 192 in ECAPA-TDNN while 256 in ge2e. I changed the speaker embedding and batch sizeparameter in hparams.py and followed the synthesizer_train.py to train Tacotron synthesizer. The parameter I used are as follows:

tts_schedule = [(2, 1e-3, 10_000, 32),
(2, 5e-4, 15_000, 32),
(2, 2e-4, 20_000, 32),
(2, 1e-4, 30_000, 32),
(2, 5e-5, 40_000, 32),
(2, 1e-5, 60_000, 32),
(2, 5e-6, 160_000, 32),
(2, 3e-6, 320_000, 32),
(2, 3e-6, 640_000, 32)]
speaker_embedding_size = 192

However, when i have trained the Tacotron 200k steps, i found my loss is 0.53 but the attention plot is blank. The mel output of each 500 steps is similar with the ground truth. The synthesized result with the 200k pretrained pt file is poor, but similar with the speaker used to synthesized. It is so weird.

Does anyone meet the same problem? Or do i need to change other parameter when i change the dimension of speaker embedding?

MockingBird

来源：https://github.com/babysor/MockingBird/issues/502

2条答案

按热度按时间

owfi6suc1#

First, Thanks for the work by babysor! I noticed that the speaker encoder used in this work is ge2e, performance of which is far fall behind the SOTA. So I replaced the ge2e encoder with ECAPA-TDNN model. One difference between ge2e and ECAPA-TDNN is that the dimension of embedding is 192 in ECAPA-TDNN while 256 in ge2e. I changed the speaker embedding and batch sizeparameter in hparams.py and followed the synthesizer_train.py to train Tacotron synthesizer. The parameter I used are as follows:

tts_schedule = [(2, 1e-3, 10_000, 32), (2, 5e-4, 15_000, 32), (2, 2e-4, 20_000, 32), (2, 1e-4, 30_000, 32), (2, 5e-5, 40_000, 32), (2, 1e-5, 60_000, 32), (2, 5e-6, 160_000, 32), (2, 3e-6, 320_000, 32), (2, 3e-6, 640_000, 32)] speaker_embedding_size = 192

However, when i have trained the Tacotron 200k steps, i found my loss is 0.53 but the attention plot is blank. The mel output of each 500 steps is similar with the ground truth. The synthesized result with the 200k pretrained pt file is poor, but similar with the speaker used to synthesized. It is so weird.

Does anyone meet the same problem? Or do i need to change other parameter when i change the dimension of speaker embedding?

Before jumping to Taco2, how's your result regarding w/ training the encoder?

赞(0）回复(0）举报 2022-11-02

wqsoz72f2#

First, Thanks for the work by babysor! I noticed that the speaker encoder used in this work is ge2e, performance of which is far fall behind the SOTA. So I replaced the ge2e encoder with ECAPA-TDNN model. One difference between ge2e and ECAPA-TDNN is that the dimension of embedding is 192 in ECAPA-TDNN while 256 in ge2e. I changed the speaker embedding and batch sizeparameter in hparams.py and followed the synthesizer_train.py to train Tacotron synthesizer. The parameter I used are as follows:
tts_schedule = [(2, 1e-3, 10_000, 32), (2, 5e-4, 15_000, 32), (2, 2e-4, 20_000, 32), (2, 1e-4, 30_000, 32), (2, 5e-5, 40_000, 32), (2, 1e-5, 60_000, 32), (2, 5e-6, 160_000, 32), (2, 3e-6, 320_000, 32), (2, 3e-6, 640_000, 32)] speaker_embedding_size = 192
However, when i have trained the Tacotron 200k steps, i found my loss is 0.53 but the attention plot is blank. The mel output of each 500 steps is similar with the ground truth. The synthesized result with the 200k pretrained pt file is poor, but similar with the speaker used to synthesized. It is so weird.
Does anyone meet the same problem? Or do i need to change other parameter when i change the dimension of speaker embedding?

Before jumping to Taco2, how's your result regarding w/ training the encoder?

The speaker encoder i trained was tested based on vox2 test set and achieved an EER about 0.6.

赞(0）回复(0）举报 2022-11-02

我来回答

MockingBird Poor attention with a different speaker encoder

2条答案

相关问题

热门标签

最新问答