Differentially private synthetic mixed-type data generation for unsupervised learning

Abstract

We introduce the DP-auto-GAN framework for synthetic data generation, which combines the low dimensional representation of autoencoders with the flexibility of Generative Adversarial Networks (GANs). This framework can be used to take in raw sensitive data and privately train a model for generating synthetic data that will satisfy similar statistical properties as the original data. This learned model can generate an arbitrary amount of synthetic data, which can then be freely shared due to the post-processing guarantee of differential privacy. Our framework is applicable to unlabeled mixed-type data, that may include binary, categorical, and real-valued data. We implement this framework on both binary data (MIMIC-III) and mixed-type data (ADULT), and compare its performance with existing private algorithms on metrics in unsupervised settings. We also introduce a new quantitative metric able to detect diversity, or lack thereof, of synthetic data.

Keywords

Differential privacy synthetic data generation generative adversarial networks mixed-type data

Get full access to this article

View all access options for this article.

References

Narayanan

Shmatikov

. Robust De-anonymization of Large Sparse Datasets. In: Proceedings of the 2008 IEEE Symposium on Security and Privacy. Oakland S&P ’08; 2008. pp. 111-125.

Barbaro

Zeller

. A Face is Exposed for AOL Searcher No. 4417749. New York Times; 2006. [Online, Retrieved 9/25/2019]. New York Times. Available from: https//www.nytimes.com/2006/08/09/technology/09aol.html.

Ohm

. Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Review.2010; 57: 1701-1777.

Carlini

Liu

Erlingsson

Kos

Song

. The Secret Sharer: Evaluating and testing unintended memorization in neural networks. In: Proceedings of the 28th USENIX Security Symposium. USENIX Security ’19; 2019. pp. 267-284.

Dwork

McSherry

Nissim

Smith

. Calibrating noise to sensitivity in private data analysis. In: Proceedings of the 3rd Conference on Theory of Cryptography. TCC ’06; 2006. pp. 265-284.

Triastcyn

Faltings

. Generating artificial data for private deep learning. In: Proceedings of the PAL: Privacy-Enhancing Artificial Intelligence and Language Technologies. PAL ’18; 2018. pp. 33-40.

Blum

Ligett

Roth

. A learning theory approach to non-interactive database privacy. In: Proceedings of the 40th Annual ACM Symposium on Theory of Computing. STOC ’08; 2008. pp. 609-618.

Hardt

Rothblum

. A multiplicative weights mechanism for privacy-preserving data analysis. In: Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science. FOCS ’10; 2010. pp. 61-70.

Kingma

Welling

. Auto-encoding variational bayes; 2013. ArXiv preprint 1312.6114.

10.

Abay

Zhou

Kantarcioglu

Thuraisingham

Sweeney

. Privacy preserving synthetic data release using deep learning. In: Machine Learning and Knowledge Discovery in Databases (ECML PKDD ’18). vol. 11051 of Lecture Notes in Computer Science. Springer; 2018. pp. 510-526.

11.

Chen

Xiang

Xue

Borisov

Kaarfar

, et al. Differentially Private Data Generative Models; 2018. ArXiv preprint 1812.02274.

12.

Johnson

AEW

Pollard

Shen

Li-wei

Feng

Ghassemi

, et al. MIMIC-III, a freely accessible critical care database. Scientific Data.2016; 3: 160035.

13.

Dua

Graff

. UCI Machine Learning Repository; 2017. Available from: http://archive.ics.uci.edu/ml.

14.

Frigerio

de Oliveira

Gomez

Duverger

. Differentially private generative adversarial networks for time series, continuous, and discrete open data. In: International Conference on ICT Systems Security and Privacy Protection. IFIP SEC ’19 2019. pp. 151-164.

15.

Xie

Lin

Wang

Zhou

. Differentially private generative adversarial network; 2018. ArXiv preprint 1802.06739.

16.

Acs

Melis

Castelluccia

De Cristofaro

. Differentially private mixture of generative neural networks. IEEE Transactions on Knowledge and Data Engineering.2018; 31(6): 1109-1121.

17.

Hardt

Ligett

McSherry

. A simple and practical algorithm for differentially private data release. In: Advances in Neural Information Processing Systems 25, NIPS ’12; 2012. pp. 2339-2347.

18.

Gaboardi

Arias

EJG

Hsu

Roth

. Dual query: Practical private query release for high dimensional data. In: Proceedings of the 31st International Conference on Machine Learning. ICML ’14; 2014. pp. 1170-1178.

19.

Zhang

Cormode

Procopiuc

Srivastava

Xiao

. PrivBayes: Private data release via Bayesian networks. ACM Transactions on Database Systems (TODS).2017; 42(4): 25.

20.

Ping

Stoyanovich

Howe

. DataSynthesizer: Privacy-preserving synthetic datasets. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. SSDBM ’17; 2017. pp. 421-42:5.

21.

Surendra

Mohan

. A review of synthetic data generation methods for privacy preserving data publishing. International Journal of Scientific and Technology.2017; 6: 95-101.

22.

Abadi

Chu

Goodfellow

McMahan

Mironov

Talwar

, et al. Deep learning with differential privacy. In: Proceedings of the 2016 ACM Conference on Computer and Communications Security. CCS ’16; 2016. pp. 308-318.

23.

Mironov

. Rényi differential privacy. In: Proceedings of the 2017 IEEE 30th Computer Security Foundations Symposium. CSF ’17; 2017. pp. 263-275.

24.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

, et al. Generative Adversarial Nets. In: Advances in Neural Information Processing Systems 27, NIPS ’14, 2014. pp. 2672-2680.

25.

Mogren

. C-RNN-GAN: Continuous recurrent neural networks with adversarial training. Constructive Machine Learning Workshop (CML) at NeurIPS 2016, 2016.

26.

Saito

Matsumoto

Saito

. Temporal generative adversarial nets with singular value clipping. In: Proceedings of the IEEE International Conference on Computer Vision. ICCV ’17, 2017. pp. 2830-2839.

27.

Salimans

Goodfellow

Zaremba

Cheung

Radford

Chen

. Improved Techniques for Training GANs. In: Advances in Neural Information Processing Systems 29, NIPS ’16, 2016. pp. 2234-2242.

28.

Jang

Poole

. Categorical reparameterization with Gumbel-softmax. In: Proceedings of the 5th International Conference on Learning Representations. ICLR ’17; 2017. Available from: https//openreview.net/forum?id=rkE3y85ee.

29.

Kusner

Hernández-Lobato

. GANs for sequences of discrete elements with the Gumbel-softmax distribution; 2016. ArXiv preprint 1611.04051.

30.

Wang

Zhao

Zhang

, et al. GraphGAN: Graph representation learning with generative adversarial nets. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI ’18; 2018. pp. 2508-2515.

31.

Skoularidou

Cuesta-Infante

Veeramachaneni

. Modeling tabular data using Conditional GAN. In: Advances in Neural Information Processing Systems 32, NeurIPS ’19; 2019. pp. 7333-7343.

32.

Lim

Loo

Tran

Cheung

Roig

Elovici

. DOPING: Generative data augmentation for unsupervised anomaly detection with GAN. In: Proceedings of the 2018 IEEE International Conference on Data Mining. ICDM ’18; 2018. pp. 1122-1127.

33.

Park

Mohammadi

Gorde

Jajodia

Park

Kim

. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment.2018; 11(10): 1071-1083.

34.

Arjovsky

Chintala

Bottou

. Wasserstein GAN, 2017. ArXiv preprint 1701.07875.

35.

Gulrajani

Ahmed

Arjovsky

Dumoulin

Courville

. Improved Training of Wasserstein GANs. In: Advances in Neural Information Processing Systems 30, NIPS ’17; 2017. pp. 5767-5777.

36.

Alzantot

Srivastava

. Differential Privacy Synthetic Data Generation using WGANs, 2019. Available from: https://github.com/nesl/nist_differential_privacy_synthetic_data_challenge/.

37.

Mirza

Osindero

. Conditional generative adversarial nets, 2014. ArXiv preprint 1411.1784.

38.

Torkzadehmahani

Kairouz

Paten

. DP-CGAN: Differentially Private Synthetic Data and Label Generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019.

39.

Papernot

Abadi

Erlingsson

Goodfellow

Talwar

. Semi-supervised knowledge transfer for deep learning from private training data. In: International Conference on Learning Representations. ICLR ’17, 2017. Available from: https//openreview.net/forum?id=HkwoSDPgg.

40.

Papernot

Song

Mironov

Raghunathan

Talwar

Erlingsson

. Scalable private learning with PATE. In: International Conference on Learning Representations. ICLR ’18, 2018. Available from: https//openreview.net/forum?id=rkZB1XbRZ.

41.

Jordon

Yoon

van der Schaar

. PATE-GAN: generating synthetic data with differential privacy guarantees. In: Proceedings of the 7th International Conference on Learning Representations. ICLR ’19; 2019. Available from: https//openreview.net/forum?id=S1zk9iRqF7.

42.

Park

Foulds

Choudhary

Welling

. DP-EM: Differentially Private Expectation Maximization. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. AISTATS ’17; 2017. pp. 896-904.

43.

Zhang

Wang

. Differentially Private Releasing via Deep Generative Model (Technical Report); 2018. ArXiv preprint1801.01594.

44.

NIST. Contest: NIST DIfferential Privacy #3; 2019. National Institute of Standards and Technology, Public Safety Communications Research. TopCoder. Available from: https//community.topcoder.com/longcontest/?module=ViewProblemStatement&rd=17421&pm=15315.

45.

NIST. Differential Privacy Synthetic Data Challenge Algorithms; 2019. National Institute of Standards and Technology, Privacy Engineering Program. NIST Information Technology Laboratory/Applied Cybersecurity Division. Available from: https://www.nist.gov/itl/applied-cybersecurity/privacy-engineering/collaboration-space/browse/de-identification-tools#dpchallenge.

46.

Charest

. How can we analyze differentially-private synthetic datasets? Journal of Privacy and Confidentiality.2011; 2(2): 21-33.

47.

Choi

Biswal

Malin

Duke

Stewart

Sun

. Generating multi-label discrete patient records using generative adversarial networks. In: Proceedings of Machine Learning for Healthcare, 2017, pp. 286-305.

48.

McMahan

Andrew

. A general approach to adding differential privacy to iterative training procedures. PPML18: Privacy Preserving Machine Learning – NeurIPS 2018 Workshop. 2018.

49.

Google. TensorFlow Privacy; 2018. Available from: https://github.com/tensorflow/privacy.

50.

Van Erven

Harremos

. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory.2014; 60(7): 3797-3820.

51.

Wang

Balle

Kasiviswanathan

. Subsampled Rényi Differential Privacy and Analytical Moments Accountant. In: Proceedings of the 22th International Conference on Artificial Intelligence and Statistics. AISTATS ’19; 2019. pp. 1226-1235.

52.

Borji

. Pros and cons of GAN evaluation measures. Computer Vision and Image Understanding.2019; 179: 41-65.

53.

Bagdasaryan

Poursaeed

Shmatikov

. Differential privacy has disparate impact on model accuracy. In: Advances in Neural Information Processing Systems 32, NeurIPS ’19; 2019. pp. 15479-15488.

54.

Jeffreys

. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London Series A Mathematical and Physical Sciences.1946; 186(1007): 453-461.