[20250303 Yi Zhu] Generalizing Audio Deepfake Detection via Style-Linguistics Alignment Pretraining
Action | Key |
---|---|
Play / Pause | K or space |
Mute / Unmute | M |
Toggle fullscreen mode | F |
Select next subtitles | C |
Select next audio track | A |
Toggle automatic slides maximization | V |
Seek 5s backward | left arrow |
Seek 5s forward | right arrow |
Seek 10s backward | shift + left arrow or J |
Seek 10s forward | shift + right arrow or L |
Seek 60s backward | control + left arrow |
Seek 60s forward | control + right arrow |
Seek 1 frame backward | alt + left arrow |
Seek 1 frame forward | alt + right arrow |
Decrease volume | shift + down arrow |
Increase volume | shift + up arrow |
Decrease playback rate | < |
Increase playback rate | > |
Seek to end | end |
Seek to beginning | beginning |
When subscribed to notifications, an email will be sent to you for all added annotations.
Your user account has no email address.
Information on this media
Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized by generative AI models. Existing ADD models suffer from generalization issues to unseen attacks, with a large performance discrepancy between in-domain and out-of-domain data. In this work, we introduce a new ADD model that explicitly uses the Style-LInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.