Abstract
Recent development in adversarial perturbation has shown its efficacy for voice privacy protection. This paper further explores the impact of speaker adversarial perturbation on speech in downstream automatic speech recognition (ASR) tasks. To be specific, the perturbation is generated by attacking a speaker embedding extractor in an untargeted manner and added to the original speech, resulting in the adversarial version. Additionally, we examine the efficacy of incorporating the supervision from an ASR model into the perturbation generation process. Experiments were conducted on the LibriSpeech dataset, where two ASR models with different levels of robustness were examined. Firstly, the results showed a decline in the ASR performance caused by the speaker adversarial perturbation, inferring the negative influence of the speaker perturbation on speech recognition. With the supervision of the ASR model during perturbation generation, its impact on speech recognition could be mitigated. Moreover, the ASR model with lower robustness level provided a better constraint for generating perturbations, compared to the one with higher robustness level.
Samples' Information
Here, we showcase the adversarial utterances generated by basic adversarial model B, the adversarial model C supervised by ASR^C and the adversarial model A supervised by ASR^A. We randomly selected one sentence each from 3 males and 3 females of librispeech test-clean dataset and using these original samples to generate adversarial samples. In this context, 'original' is defined as the original speech, 'adv^B' as the adversarial speech generated by model B, 'adv^C' as the adversarial speech generated by model C and 'adv^A' as the adversarial speech generated by model A.