Emotional privacy-preserving of speech based on generative adversarial networks

Abstract

Consumer electronic devices with voice assistants are becoming increasingly popular in modern intelligent home. Nevertheless, directly uploading unprocessed speech data, which may contain sensitive attributes, to a cloud server poses a significant risk to user privacy. To address this privacy issue, this paper proposes a privacy-enhancing model to protect speech emotions based on generative adversarial networks (PSEGAN). The model aims to prevent the inference of emotional attributes while maintaining the accuracy and utility of speech features. PSEGAN benefits from three modules: (1) A pre-trained speaker matcher imposes generative constraints on the model during the training phase to ensure that the generated speech retains the essential information needed for speaker recognition. (2) Attribute adversarial networks can generate perturbed speech that transforms emotional attributes while preserving the utility of the speech. (3) Gated Recurrent Networks (GRN) can handle the long-short term dependencies of speech signals. PSEGAN model solves the problem of utility loss in traditional speech privacy preservation methods based on generative adversarial networks (GAN). Experimental results show that on the RAVDESS dataset, PSEGAN reduces emotion recognition accuracy by 80.7%, while speaker recognition accuracy only decreases by 1.1%. These findings demonstrate that PSEGAN effectively mitigates the leakage of emotional attributes while maintaining high utility.

Keywords

Speech emotion privacy voice assistants attribute recognition generative adversarial networks

Get full access to this article

View all access options for this article.

References

Huang

. A novel residual shrinkage block-based convolutional neural network for improving the recognition of motor imagery EEG signals. Int J Intell Comput Cybern 2022; 16: 420–442.

Zhang

Lin

Pan

, et al. Cache reallocation-based page-level flash translation layer for smartphones. IEEE Trans Consum Electron 2023; 69: 671–679.

Huang

. Manifold embedded global and local discriminative features selection for single-shot multi-categories clothing recognition and retrieval. Int J Intell Comput Cybern 2023; 17: 363–394. DOI: 10.1108/IJICC-10-2023-0302.

Kumar

Dhanalakshmi

. EYE-YOLO: A multi-spatial pyramid pooling and focal-EIOU loss inspired tiny YOLOv7 for fundus eye disease detection. Int J Intell Comput Cybern 202410.1108/IJICC-02-2024-0077.

Chen

Lin

Liu

, et al. NT-DPTC: A non-negative temporal dimension preserved tensor completion model for missing traffic data imputation. Inf Sci (Ny) 2023; 653: 119797.

Chen

Lin

, et al. Consistency and dependence-guided knowledge distillation for object detection in remote sensing images. Expert Syst Appl 2023; 229: 120519.

Zhong

Lin

. Dynamic multi-scale topological representation for enhancing network intrusion detection. Comput Secur 2023; 135: 103516.

Lin

Pan

Feng

, et al. MTDB: An LSM-tree-based key-value store using a multi-tree structure to improve read performance. J Supercomput 202410.1007/s11227-024-06382-5.

Zhong

Lin

Zhang

, et al. A survey on graph neural networks for intrusion detection systems: Methods, trends and challenges. Comput Secur 2024; 141. DOI: 10.1016/j.cose.2024.103821.

10.

Zhao

Al-Dubai

, et al. Routing schemes in software-defined vehicular networks: Design, open issues and challenges. IEEE Intell Transp Syst Mag 2021; 13: 217–226.

11.

Zhao

Qian

Hawbani

, et al. Overtaking feasibility prediction for mixed connected and connectionless vehicles. IEEE Trans Intell Transp Syst 2024; 1–16. DOI: 10.1109/TITS.2024.3398602.

12.

Zhao

Yang

Tan

, et al. Vehicular computation offloading for industrial mobile edge computing. IEEE Trans Ind Inf 2021; 17: 7871–7881.

13.

Zhao

Al-Dubai

, et al. A novel prediction-based temporal graph routing algorithm for software-defined vehicular networks. IEEE Trans Intell Transp Syst 2021; 23: 13275–13290.

14.

Zhao

Han

, et al. Intelligent digital twin-based software-defined vehicular networks. IEEE Netw 2020; 34: 178–184.

15.

Rubinstein

. Big data: The end of privacy or A new beginning?. Int Data Privacy Law 2013; 3: 74.

16.

Hadian

Altuwaiyan

Liang

, et al. Efficient and privacy-preserving voice-based search over mHealth data. In: International conference on connected health: applications, systems and engineering technologies, 2017, pp.96–101. DOI: 10.1109/CHASE.2017.66.

17.

Zhao

Kumar

, et al. Introduction to the special section on intelligence-empowered collaboration among space, air, ground, and sea mobile networks towards B5G. IEEE Trans Network Sci Eng 2021; 8: 2719–2721.

18.

Viorescu

, et al. 2018 reform of EU data protection rules. Eur J Law Public Adm 2017; 4: 27–39. https://eur-lex.europa.eu/eli/reg/2016/679/oj.

19.

Chen

, et al. A non-intrusive and adaptive speaker de-identification scheme using adversarial examples. In: Annual international conference on mobile computing and networking, 2022, pp.853–855. DOI: 10.1145/3495243.3558260.

20.

Chen

Wang

, et al. voiceCloak: Adversarial example enabled Voice de-identification with balanced privacy and utility. Proc ACM Interact Mobile Wearable and Ubiquitous Technol 2023; 7: 1–21.

21.

Tavi

Kinnunen

Hautamäki

. Improving speaker de-identification with functional data analysis of F0 trajectories. Speech Commun 2022; 140: 1–10.

22.

Liu

Zheng

, et al. Cross-domain sentiment aware word embeddings for review sentiment analysis. Int J Mach Learn Cybern 2021; 12: 343–354.

23.

Aloufi

Haddadi

Boyle

. Privacy preserving speech analysis using emotion filtering at the edge. In: Conference on embedded networked sensor systems, 2019, pp.426–427. DOI: 10.1145/3356250.3361947.

24.

Pascual

Bonafonte

Serra

. SEGAN: speech enhancement generative adversarial network. In: Interspeech, 2017, pp.3642–3646. DOI: 10.21437/Interspeech.2017-1428.

25.

Ericsson

Östberg

Zec

, et al. Adversarial representation learning for private speech generation. In: International conference on machine learning, 2020. https://icml.cc/virtual/2020/7189.

26.

Martinsson

Zec

Gillblad

, et al. Adversarial representation learning for synthetic replacement of private attributes. In: IEEE international conference on big data, 2021, pp.1291–1299. DOI: 10.1109/BigData52589.2021.9671802.

27.

Kumar

de Boissiere

, et al. MelGAN: generative adversarial networks for conditional waveform synthesis. In: International conference on neural information processing systems, 2019, p.12. DOI: 10.1109/radarconf2043947.2020.9266709.

28.

Tan

Chen

Wang

. Gated residual networks with dilated convolutions for monaural speech enhancement. IEEE/ACM Trans Audio Speech Lang Process 2019; 27: 189–198.

29.

Bińkowski

Donahue

Dieleman

, et al. High fidelity speech synthesis with adversarial networks. In: International conference on learning representations, 2020. https://openreview.net/forum?id=r1gfQgSFDr.

30.

Lavner

Porat

. Voice morphing using 3D waveform interpolation surfaces and lossless tube area functions. EURASIP J Adv Signal Process 2005; 2005: 142638.

31.

Variani

Lei

McDermott

, et al. Deep neural networks for small footprint text-dependent speaker verification. In: International conference on acoustics, speech and signal processing, 2014, pp.4052–4056. DOI: 10.1109/ICASSP.2014.6854363.

32.

Snyder

Garcia-Romero

Sell

, et al. X-vectors: Robust DNN embeddings for speaker recognition. In: International conference on acoustics, speech and signal processing, 2018, pp.5329–5333. DOI: 10.1109/ICASSP.2018.8461375.

33.

Srivastava

BML

Maouche

Sahidullah

, et al. Privacy and utility of X-vector based speaker anonymization. IEEE/ACM Trans Audio Speech Lang Process 2022; 30: 2383–2395.

34.

Perero-Codosero

Espinoza-Cuadros

Hernández-Gómez

. X-vector anonymization using autoencoders and adversarial training for preserving speech privacy. Comput Speech Lang 2022; 74: 101351.

35.

Yao

Wang

Zhang

, et al. NWPU-ASLP system for the voiceprivacy 2022 challenge. VoicePrivacy 2022 Challenge, 2022. DOI: 10.48550/ARXIV.2209.11969.

36.

Stoidis

Cavallaro

. Protecting gender and identity with disentangled speech representations. In: Interspeech, 2021, pp.1699–1703. DOI: 10.21437/Interspeech.2021-2163.

37.

Stoidis

Cavallaro

. Generating gender-ambiguous voices for privacy-preserving speech recognition. In: Interspeech, 2022, pp.4237–4241. DOI: 10.21437/interspeech.2022-11322.

38.

Prajapati

Singh

Amin

, et al. Voice privacy through X-vector and CycleGAN-based anonymization. In: Interspeech, 2021, pp.1684–1688. DOI: 10.21437/INTERSPEECH.2021-1573.

39.

Huang

Kairouz

Sankar

. Generative adversarial privacy: a data-driven approach to information-theoretic privacy. In: Asilomar conference on signals, systems, and computers, 2018, pp.2162–2166. DOI: 10.1109/ACSSC.2018.8645532.

40.

Ronneberger

Fischer

Brox

. U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention, 2015, pp.234–241. DOI: 10.1007/978-3-319-24574-4-28.

41.

Zhang

Ren

, et al. Identity mappings in deep residual networks. In: European conference on computer vision, Vol. 9908, 2016, pp.630–645. DOI: 10.1007/978-3-319-46493-0-38.

42.

Krizhevsky

Sutskever

Hinton

. ImageNet classification with deep convolutional neural networks. Commun ACM 2017; 60: 84–90.

43.

Zhang

Sabuncu

. Generalized cross entropy loss for training deep neural networks with noisy labels. In: International conference on neural information processing systems, Vol. 31, 2018. https://proceedings.neurips.cc/paper/2018/hash/f2925f97bc13ad2852a7a551802feea0-Abstract.html.

44.

Livingstone

Russo

. The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north American English. PLoS ONE 2018; 13: e0196391.

45.

Burkhardt

Paeschke

Rolfes

, et al., . A database of German emotional speech. In: Interspeech, Vol. 5, 2005, pp.1517–1520. DOI: 10.21437/INTERSPEECH.2005-446.

46.

van der Maaten

Hinton

. Visualizing data using T-SNE. J Mach Learn Res 2008; 9: 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html.