Classification of credit scoring data with privacy constraints

Abstract

Modern data collections create vast opportunities for detecting useful hidden relationships. Also, increasingly, they fuel data privacy concerns. A trade-off between privacy protection and data usefulness is by now widely acknowledged. Real world data classification tasks, as for example credit scoring applications have to deal with such data security limitations by finding a way to effectively incorporate privacy preserving procedures. To this end we propose as a first stage to use a microaggregation procedure in order to anonymize data over personal credit client feature information. In a second stage we examine the performance of support vector machines (SVM) on such anonymized data. SVM are powerful and robust machine learning methods, having superior credit scoring classification performance when applied to original, non-anonymized data. We first partition the original credit scoring data set and construct anonymized data representatives, which are then used for credit client behavior forecasting models constructed by SVM and other comparable learning methods. The validation procedure for such models is adapted to the two-stage modeling approach. In order to assess the loss owing to data anonymization, the different classification models are evaluated against models that are trained on the original data.

Keywords

Data privacy microaggregation credit scoring support vector machines

Get full access to this article

View all access options for this article.