Asynchronous Voice Anonymization by Learning from Speaker-Adversarial Speech
Authors: Rui Wang, Liping Chen, Kong Aik Lee, Zhen-Hua Ling
Abstract: This paper focuses on asynchronous voice anonymization, wherein machine perception of speaker
attributes within a speech utterance is obscured while human perception is preserved. We propose to transfer the
voice protection capabilities of speaker-adversarial speech to speaker embedding, thereby facilitating the
modification of speaker embedding extracted from original speech to generate anonymized speech. Experiments
conducted on the LibriSpeech dataset demonstrated that compared to the speaker-adversarial utterances, the generated
anonymized speech demonstrates improved transferability and voice-protection capability. Furthermore, the proposed
method enhances the human perception preservation capability of anonymized speech within the generative asynchronous
voice anonymization framework.
1. Supplementary materials
1.1 Comparison of Mel-spectrograms from original and speaker-adversarial speech utterances
Fig. 2 illustrates the Mel-spectrograms of an original speech utterance and its speaker-adversarial variants, generated with 5, 10, and
20 iterations, respectively. The comparison reveals that as the number of iterations increases, the perturbation effects on the
Mel-spectrograms rises. This indicates that the perturbation is captured by the Mel-spectrogram, thereby validating the loss function
defined in (3).
Fig. 2: Mel-spectrograms of an original speech utterance and its speaker-adversarial variants, generated
with 5, 10, and 20 iterations
1.2. Supplementary experiments
1.2.1 Effects of iteration number K
Extended experiments were performed to investigate the effects of the iteration number K for speaker-adversarial speech
generation
on the performance of asynchronous voice anonymization, with K set to 10 and 20. Evaluations were conducted in both ignorant and
informed scenarios, following their respective configurations outlined in Section V-B-1) and V-B-2 in the manuscript. Besides, the human
perception evaluation was performed following the configurations presented in Section V-C-2) in the manuscript. With the results, Table
I and II are expanded to be Table V and VI, respectively. The human perception preservation rates are presented in VIII.
TABLE V: The EERs (%) obtained in the ignorant scenario. The results include those obtained on the original, regenerated, and
speaker-adversarial (Adv) speech using MI-FGSM, NI-FGSM, and GRA methods, as well as the corresponding anonymized speech generated
with our proposed (Pro) method. The evaluations conducted on ECAPA-TDNN (ECAPA), ResNet (Res), and i-vector (ivec) are presented. In
this table, Pro represents the proposed method where the speaker-adversarial perturbations were generated with 15 iterations
as configured in the manuscript. Pro(10) and Pro(20) denote the proposed method wherein the speaker-adversarial perturbations were
generated with 10 and 20 iterations, respectively.
Ori
Regen
MI-FGSM
NI-FGSM
GRA
Adv
Pro
Pro(K=10)
Pro(K=20)
Adv
Pro
Pro(K=10)
Pro(K=20)
Adv
Pro
Pro(K=10)
Pro(K=20)
ECAPA
0.39
2.96
7.96
3.34
2.82
3.93
7.20
3.92
3.33
4.57
7.46
3.76
3.16
4.44
Res
0.38
3.02
1.17
3.51
2.60
4.54
1.85
4.23
3.22
5.36
1.78
4.17
3.10
4.69
ivec
0.66
2.55
1.48
2.89
2.30
3.19
1.33
2.59
2.40
3.11
1.41
2.82
2.61
3.44
As presented in Table I, the performance of the proposed method with varying numbers of iterations (K=10 and K=20)
demonstrates that
the number of iterations significantly impacts the anonymization effectiveness. In the three evaluation models, with K increasing
from
10 to 20, higher EERs were obtained. This indicates that increasing iteration number K achieves stronger obscuration of the
machine-discernible speaker attributes in the ignorant evaluation scenario.
TABLE VI: The EERs(%) on the original, speaker-adversarial, and anonymized speech generated with our proposed method in the informed
scenario. The adversarial utterances (Adv) obtained with MI-FGSM, NI-FGSM, and GRA are presented, together with the corresponding
anonymized utterances generated by the proposed method (Pro). Pro represents the proposed method where the speaker-adversarial
perturbations were generated with 15 iterations as configured in the manuscript. Pro(10) and Pro(20) denote the proposed method
wherein the speaker-adversarial perturbations were generated with 10 and 20 iterations, respectively.
Ori
MI-FGSM
NI-FGSM
GRA
3.53
Adv
Pro
Pro(K=10)
Pro(K=20)
Adv
Pro
Pro(K=10)
Pro(K=20)
Adv
Pro
Pro(K=10)
Pro(K=20)
3.83
5.73
4.83
6.69
3.86
5.44
4.90
6.32
4.03
5.13
5.94
6.23
From Table VI, it can be observed that with K increasing from 10 to 20, higher EERs were obtained. This demonstrates that higher
K
obtains
stronger obscuration of the machine-discernible speaker attributes in the informed evaluation scenario.
Table VII: Human perception preservation rates (%) on Regen-Proposed(Pro),
Regen-Pro(K=10) , Regen-Pro(K=20) pairs.
Regen-Pro
Regen-Pro(K=10)
Regen-Pro(K=20)
93.50
95.50
85.00
The results in Table VII indicate that as K increases from 10 to 20, the human perception rate declines. This suggests that a
higher K degrades the human perception preservation of the proposed asynchronous voice anonymization method. Considering the
machine perception protection capabilities shown in Tables V and VI, K was set to 15 in our proposed method to achieve a
trade-off
between human perception preservation and machine perception protection.
1.2.2 Subjective evaluations with original speech as reference
Next, subjective evaluations were conducted with original speech as reference to measure the human perception preservation capability.
The human perception preservation rates obtained between Original(Ori)-Regenerated(Regen), Ori-Proposed(Pro), Regen-Pro pairs
are presented in Table VIII. From the comparison, it can be observed that:
1) The regenerated
speech obtained a 76.50%
human perception rate when compared with the original speech, indicating the influence brought by the YourTTS model. As a result, the
regenerated speech was utilized as the reference in our subjective evaluation, with results presented in Table III (b) of our
manuscript.
2)
The anonymized speech demonstrated a lower human perception preservation rate when compared to the original speech than when compared to
the regenerated speech, i.e., 54.00% vs 93.50%.
Table VIII: Human perception preservation rates (%) on Ori-Regen,
Ori-Pro , Regen-Pro pairs.
Ori-Regen
Ori-Pro
Regen-Pro
76.50
54.00
93.50
1.2.3 Application of original speech for perceptual loss calculation in (4)
Finally, experiments were conducted by replacing the regenerated speech with the original speech for perceptual loss calculation in (4).
Experiments were conducted in both ignorant and informed scenarios, following their respective configurations
outlined in Section V-B-1) and V-B-2 in the manuscript. Besides, the human perception evaluation was performed following the
configurations presented in Section V-C-2) in the manuscript. With the results, Table I and II are expanded to be Table IX and X ,
respectively. The human perception preservation rates are presented in XI.
TABLE IX: The EERs (%) obtained in the ignorant scenario. The results include those obtained on the original, regenerated, and
speaker-adversarial (Adv) speech using MI-FGSM, NI-FGSM, and GRA methods, as well as the corresponding anonymized speech generated
with our proposed (Pro) method. The evaluations conducted on ECAPA-TDNN (ECAPA), ResNet (Res), and i-vector (ivec) are presented.
Pro(Ori) indicates the original speech used in perceptual loss computation.
Ori
Regen
MI-FGSM
NI-FGSM
GRA
Adv
Pro
Pro(ori)
Adv
Pro
Pro(ori)
Adv
Pro
Pro(ori)
ECAPA
0.39
2.96
7.96
3.34
3.54
7.20
3.92
4.22
7.46
3.76
4.20
Res
0.38
3.02
1.17
3.51
4.53
1.85
4.23
5.23
1.78
4.17
4.51
ivec
0.66
2.55
1.48
2.89
2.87
1.33
2.59
2.90
1.41
2.82
3.20
TABLE X: The EERs(%) on the original, speaker-adversarial, and anonymized speech generated with our proposed method in the informed
scenario. The adversarial utterances (Adv) obtained with MI-FGSM, NI-FGSM, and GRA are presented, together with the corresponding
anonymized utterances generated by the proposed method (Pro). Pro(Ori) indicates the original speech used in perceptual loss
computation.
Ori
MI-FGSM
NI-FGSM
GRA
3.53
Adv
Pro
Pro(ori)
Adv
Pro
Pro(ori)
Adv
Pro
Pro(ori)
3.83
5.73
6.40
3.86
5.44
6.32
4.03
5.13
6.14
Table XI: Human perception preservation rates (%) on Ori-Pro, Ori-Pro(ori), Regen-Pro , Regen-Pro(ori) pairs.
Ori-Pro
Ori-Pro(ori)
Regen-Pro
Regen-Pro(ori)
54.00
66.50
93.50
81.00
As shown in Tables IX and X, using the original speech in perceptual loss computation achieved higher EERs in both the ignorant and
informed evaluation scenarios than the regenerated speech. Furthermore, as shown in Table XI, Pro(ori) achieved a higher human
perception rate compared to Pro when the original speech was used as the reference. These results demonstrate that the original speech
is applicable for perceptual loss computation in our proposed strategy of learning the voice protection capability from
speaker-adversarial speech.
However, as indicated with the results presented in Table VIII, to ensure a fair comparison, regenerated speech was applied as the
reference for human perception preservation evaluation. When the regenerated speech was used as the reference, the anonymized speech
generated from the original speech for perceptual loss calculation exhibited a lower human perception preservation rate compared to that
generated from the regenerated speech, at 81.00% versus 93.50%. Thus, to explore the optimal performance of the proposed strategy of
generating asynchronously anonymized speech by learning from speaker-adversarial speech, the regenerated speech was used for perceptual
loss computation in the manuscript.