Investigation of the machine and human perception inconsistency in speaker embedding for asynchronous voice anonymization

Based on the speech generation framework that disentangles and represents speaker attributes with an embedding, asynchronous voice anonymization can be accomplished by modifying the speaker embedding derived from the original speech. However, the inconsistency between machine and human perceptions of the speaker attribute within the speaker embedding remain unexplored, limiting its performance in asynchronous voice anonymization. To this end, this study investigates this inconsistency within the subspaces of the speaker embedding. Experiments conducted on the FACodec and Diff-HierVC speech generation models revealed a subspace whose removal alters machine perception of the speaker attribute while preserving its human perception. With these findings, an asynchronous voice anonymization is developed, achieving 100\% human perception preservation rate while modifying the machine-discernible speaker attribute.

2428-83699-0002.wav

Original:
Primary:
 =20:
 =50:
 =80:
Secondary
 =45:
 =90:
 =110:
Residual
 =10:
 =20:
 =40:
Regenerated:

5694-64029-0028.wav

Original:
Primary:
 =20:
 =50:
 =80:
Secondary
 =45:
 =90:
 =110:
Residual
 =10:
 =20:
 =40:
Regenerated:

84-121123-0000.wav

Original:
Primary:
 =20:
 =50:
 =80:
Secondary
 =45:
 =90:
 =110:
Residual
 =10:
 =20:
 =40:
Regenerated:

6319-64726-0010.wav

Original:
Primary:
 =20:
 =50:
 =80:
Secondary
 =45:
 =90:
 =110:
Residual
 =10:
 =20:
 =40:
Regenerated: