Safe Ear icon : Content Privacy-Preserving Audio Deepfake Detection

(ACM CCS 2024)

Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu Zhejiang University, Tsinghua University


Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets.


In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar’s effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.

Our Pipeline

Part1. Multilingual content protection

In this part, we display the multilingual content protection ability of SafeEar against content recovery adversaries (CRA) by audio samples below, i.e., SafeEar (Shuffled Acoustic Tokens) against CRA1 & CRA2, and SafeEar* (Acoustic Tokens) against CRA3. Notably, SafeEar's effectiveness in semantic-acoustic information decoupling is also demonstrated by decomposing the Original Audio into Semantic Tokens and SafeEar* (Acoustic Tokens). This is evident as we observe that while original audio samples from different speakers in various languages sound distinct, they sound identical at the Semantic Tokens level. In other words, SafeEar is capable of preserving the Semantic Information across different speakers and languages in VQ1, devoid of Acoustic Details.

English Language Samples

SafeEar (Shuffled Acoustick Tokens) SafeEar* (Acoustic Tokens) Semantic Tokens Original

Chinese Language Samples

SafeEar (Shuffled Acoustick Tokens) SafeEar* (Acoustic Tokens) Semantic Tokens Original

German Language Samples

SafeEar (Shuffled Acoustick Tokens) SafeEar* (Acoustic Tokens) Semantic Tokens Original

French Language Samples

SafeEar (Shuffled Acoustick Tokens) SafeEar* (Acoustic Tokens) Semantic Tokens Original

Italian Language Samples

SafeEar (Shuffled Acoustick Tokens) SafeEar* (Acoustic Tokens) Semantic Tokens Original

Part2. Multilingual CVoiceFake Dataset

This dataset represents a comprehensive multilingual audio deepfake collection aimed at advancing cross-language deepfake detection research. This dataset includes English, Chinese, German, French, and Italian audio samples sourced from CommonVoice, is enhanced by ground-truth transcriptions, making it valuable for deepfake detection and content protection studies. In line with the construction of the English-based WaveFake [1], ASVspoof 2019 [2] and ASVspoof 2021 [3], we employ five representative synthesis methods---Parallel WaveGAN, Multi-band MelGAN, Style MelGAN, Griffin-Lim, and WORLD--- CVoiceFake closely mimics real-world deepfake strategies. These methods have been selected for their ability to replicate original bonafide audio with minimal perceptible differences, resulting in highly convincing deepfakes. Each of these methods contributes unique qualities to the dataset, ranging from high-fidelity audio production and stable multilingual training to fine control over speech nuances, thus addressing different aspects of deepfake generation and detection.

[1] Frank, Joel, and Lea Schönherr. "Wavefake: A data set to facilitate audio deepfake detection." arXiv preprint arXiv:2111.02813 (2021).
[2] Wang, Xin, et al. "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech." Computer Speech & Language 64 (2020): 101114.
[3] Yamagishi, Junichi, et al. "ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection." arXiv preprint arXiv:2109.00537 (2021).

English Language Samples (from Multilingual CVoiceFake Dataset)

Bonafide Speech Parallel WaveGAN Multi-band MelGAN Style MelGAN Griffin-Lim WORLD Transcriptions

Durell was formerly a leading video games developer.

She searched until she found him.

See close back compressed vowel.

Chinese Language Samples (from Multilingual CVoiceFake Dataset)

Bonafide Speech Parallel WaveGAN Multi-band MelGAN Style MelGAN Griffin-Lim WORLD Transcriptions




German Language Samples (from Multilingual CVoiceFake Dataset)

Bonafide Speech Parallel WaveGAN Multi-band MelGAN Style MelGAN Griffin-Lim WORLD Transcriptions

Ihr Name leitet sich von Alberto Barton, dem Entdecker des Erregers, ab.

Die Rippen sind mit den Seiten der Rumpfwirbel verbunden.

Willst du auch ein Spiegelei haben?

French Language Samples (from Multilingual CVoiceFake Dataset)

Bonafide Speech Parallel WaveGAN Multi-band MelGAN Style MelGAN Griffin-Lim WORLD Transcriptions

Puisqu'il ne pleuvait pas nous n'étions pas allés au cinéma.

Route de Cassejoie, Faverges-de-la-Tour

On entend un branle-bas sourd dans la chambre.

Italian Language Samples (from Multilingual CVoiceFake Dataset)

Bonafide Speech Parallel WaveGAN Multi-band MelGAN Style MelGAN Griffin-Lim WORLD Transcriptions

Il peso della pubblicità indiretta è sempre maggiore anche in Italia.

Ma non è un sogno incoerente.

In seguito scrisse altri film.