First, Thanks for the work by babysor!
I noticed that the speaker encoder used in this work is ge2e, performance of which is far fall behind the SOTA. So I replaced the ge2e encoder with ECAPA-TDNN model. One difference between ge2e and ECAPA-TDNN is that the dimension of embedding is 192 in ECAPA-TDNN while 256 in ge2e. I changed the speaker embedding and batch sizeparameter in hparams.py and followed the synthesizer_train.py to train Tacotron synthesizer. The parameter I used are as follows:
tts_schedule = [(2, 1e-3, 10_000, 32),
(2, 5e-4, 15_000, 32),
(2, 2e-4, 20_000, 32),
(2, 1e-4, 30_000, 32),
(2, 5e-5, 40_000, 32),
(2, 1e-5, 60_000, 32),
(2, 5e-6, 160_000, 32),
(2, 3e-6, 320_000, 32),
(2, 3e-6, 640_000, 32)]
speaker_embedding_size = 192
However, when i have trained the Tacotron 200k steps, i found my loss is 0.53 but the attention plot is blank. The mel output of each 500 steps is similar with the ground truth. The synthesized result with the 200k pretrained pt file is poor, but similar with the speaker used to synthesized. It is so weird.
Does anyone meet the same problem? Or do i need to change other parameter when i change the dimension of speaker embedding?
2条答案
按热度按时间owfi6suc1#
First, Thanks for the work by babysor! I noticed that the speaker encoder used in this work is ge2e, performance of which is far fall behind the SOTA. So I replaced the ge2e encoder with ECAPA-TDNN model. One difference between ge2e and ECAPA-TDNN is that the dimension of embedding is 192 in ECAPA-TDNN while 256 in ge2e. I changed the speaker embedding and batch sizeparameter in hparams.py and followed the synthesizer_train.py to train Tacotron synthesizer. The parameter I used are as follows:
tts_schedule = [(2, 1e-3, 10_000, 32), (2, 5e-4, 15_000, 32), (2, 2e-4, 20_000, 32), (2, 1e-4, 30_000, 32), (2, 5e-5, 40_000, 32), (2, 1e-5, 60_000, 32), (2, 5e-6, 160_000, 32), (2, 3e-6, 320_000, 32), (2, 3e-6, 640_000, 32)] speaker_embedding_size = 192
However, when i have trained the Tacotron 200k steps, i found my loss is 0.53 but the attention plot is blank. The mel output of each 500 steps is similar with the ground truth. The synthesized result with the 200k pretrained pt file is poor, but similar with the speaker used to synthesized. It is so weird.
Does anyone meet the same problem? Or do i need to change other parameter when i change the dimension of speaker embedding?
Before jumping to Taco2, how's your result regarding w/ training the encoder?
wqsoz72f2#
First, Thanks for the work by babysor! I noticed that the speaker encoder used in this work is ge2e, performance of which is far fall behind the SOTA. So I replaced the ge2e encoder with ECAPA-TDNN model. One difference between ge2e and ECAPA-TDNN is that the dimension of embedding is 192 in ECAPA-TDNN while 256 in ge2e. I changed the speaker embedding and batch sizeparameter in hparams.py and followed the synthesizer_train.py to train Tacotron synthesizer. The parameter I used are as follows:
tts_schedule = [(2, 1e-3, 10_000, 32), (2, 5e-4, 15_000, 32), (2, 2e-4, 20_000, 32), (2, 1e-4, 30_000, 32), (2, 5e-5, 40_000, 32), (2, 1e-5, 60_000, 32), (2, 5e-6, 160_000, 32), (2, 3e-6, 320_000, 32), (2, 3e-6, 640_000, 32)] speaker_embedding_size = 192
However, when i have trained the Tacotron 200k steps, i found my loss is 0.53 but the attention plot is blank. The mel output of each 500 steps is similar with the ground truth. The synthesized result with the 200k pretrained pt file is poor, but similar with the speaker used to synthesized. It is so weird.
Does anyone meet the same problem? Or do i need to change other parameter when i change the dimension of speaker embedding?
Before jumping to Taco2, how's your result regarding w/ training the encoder?
The speaker encoder i trained was tested based on vox2 test set and achieved an EER about 0.6.