Safe Ear : Content Privacy-Preserving Audio Deepfake Detection

(ACM CCS 2024)

Xinfeng Li, Kai Li, Yifan Zheng, Chen Yan, Xiaoyu Ji, Wenyuan Xu Zhejiang University, Tsinghua University

Abstract

Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets.

In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar’s effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.

Our Pipeline

Part1. Multilingual content protection

In this part, we display the multilingual content protection ability of SafeEar against content recovery adversaries (CRA) by audio samples below, i.e., SafeEar (Shuffled Acoustic Tokens) against CRA1 & CRA2, and SafeEar* (Acoustic Tokens) against CRA3. Notably, SafeEar's effectiveness in semantic-acoustic information decoupling is also demonstrated by decomposing the Original Audio into Semantic Tokens and SafeEar* (Acoustic Tokens). This is evident as we observe that while original audio samples from different speakers in various languages sound distinct, they sound identical at the Semantic Tokens level. In other words, SafeEar is capable of preserving the Semantic Information across different speakers and languages in VQ1, devoid of Acoustic Details.

English Language Samples

SafeEar (Shuffled Acoustick Tokens)	SafeEar* (Acoustic Tokens)	Semantic Tokens	Original

Chinese Language Samples

SafeEar (Shuffled Acoustick Tokens)	SafeEar* (Acoustic Tokens)	Semantic Tokens	Original

German Language Samples

SafeEar (Shuffled Acoustick Tokens)	SafeEar* (Acoustic Tokens)	Semantic Tokens	Original

French Language Samples

SafeEar (Shuffled Acoustick Tokens)	SafeEar* (Acoustic Tokens)	Semantic Tokens	Original

Italian Language Samples

SafeEar (Shuffled Acoustick Tokens)	SafeEar* (Acoustic Tokens)	Semantic Tokens	Original

Part2. Multilingual CVoiceFake Dataset

This dataset represents a comprehensive multilingual audio deepfake collection aimed at advancing cross-language deepfake detection research. This dataset includes English, Chinese, German, French, and Italian audio samples sourced from CommonVoice, is enhanced by ground-truth transcriptions, making it valuable for deepfake detection and content protection studies. In line with the construction of the English-based WaveFake [1], ASVspoof 2019 [2] and ASVspoof 2021 [3], we employ five representative synthesis methods---Parallel WaveGAN, Multi-band MelGAN, Style MelGAN, Griffin-Lim, and WORLD--- CVoiceFake closely mimics real-world deepfake strategies. These methods have been selected for their ability to replicate original bonafide audio with minimal perceptible differences, resulting in highly convincing deepfakes. Each of these methods contributes unique qualities to the dataset, ranging from high-fidelity audio production and stable multilingual training to fine control over speech nuances, thus addressing different aspects of deepfake generation and detection.

[1] Frank, Joel, and Lea Schönherr. "Wavefake: A data set to facilitate audio deepfake detection." arXiv preprint arXiv:2111.02813 (2021).
[2] Wang, Xin, et al. "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech." Computer Speech & Language 64 (2020): 101114.
[3] Yamagishi, Junichi, et al. "ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection." arXiv preprint arXiv:2109.00537 (2021).

English Language Samples (from Multilingual CVoiceFake Dataset)

Bonafide Speech	Parallel WaveGAN	Multi-band MelGAN	Style MelGAN	Griffin-Lim	WORLD	Transcriptions
						Durell was formerly a leading video games developer.
						She searched until she found him.
						See close back compressed vowel.

Chinese Language Samples (from Multilingual CVoiceFake Dataset)

Bonafide Speech	Parallel WaveGAN	Multi-band MelGAN	Style MelGAN	Griffin-Lim	WORLD	Transcriptions
						其中大多数提到的是真正的人类的语言
						卢瓦尔河畔加奈人口变化图示
						狭长的骨针与每个巨大的壳针相连接

German Language Samples (from Multilingual CVoiceFake Dataset)

Bonafide Speech	Parallel WaveGAN	Multi-band MelGAN	Style MelGAN	Griffin-Lim	WORLD	Transcriptions
						Ihr Name leitet sich von Alberto Barton, dem Entdecker des Erregers, ab.
						Die Rippen sind mit den Seiten der Rumpfwirbel verbunden.
						Willst du auch ein Spiegelei haben?