|
Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets.
In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar’s effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.
In this part, we display the multilingual content protection ability of SafeEar against content recovery adversaries (CRA) by audio samples below, i.e., SafeEar (Shuffled Acoustic Tokens) against CRA1 & CRA2, and SafeEar* (Acoustic Tokens) against CRA3. Notably, SafeEar's effectiveness in semantic-acoustic information decoupling is also demonstrated by decomposing the Original Audio into Semantic Tokens and SafeEar* (Acoustic Tokens). This is evident as we observe that while original audio samples from different speakers in various languages sound distinct, they sound identical at the Semantic Tokens level. In other words, SafeEar is capable of preserving the Semantic Information across different speakers and languages in VQ1, devoid of Acoustic Details.
This dataset represents a comprehensive multilingual audio deepfake collection aimed at advancing cross-language
deepfake detection research.
This dataset includes English, Chinese, German, French, and Italian audio
samples
sourced from CommonVoice, is enhanced by ground-truth transcriptions, making it valuable for deepfake detection and
content
protection studies. In line with the construction of the English-based WaveFake [1], ASVspoof 2019 [2] and ASVspoof
2021 [3], we employ five representative synthesis methods---Parallel WaveGAN,
Multi-band MelGAN, Style MelGAN, Griffin-Lim, and WORLD---
CVoiceFake closely mimics real-world deepfake strategies. These methods have been selected for their ability to
replicate original bonafide audio with minimal perceptible differences, resulting in highly convincing deepfakes.
Each of these methods contributes unique
qualities to the dataset, ranging from high-fidelity audio production and stable multilingual training to fine
control over speech nuances, thus addressing different aspects of deepfake generation and detection.
[1] Frank, Joel, and Lea Schönherr. "Wavefake: A data set to facilitate audio deepfake detection." arXiv preprint
arXiv:2111.02813 (2021).
[2] Wang, Xin, et al. "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech."
Computer Speech & Language 64 (2020): 101114.
[3] Yamagishi, Junichi, et al. "ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection."
arXiv preprint arXiv:2109.00537 (2021).