Abstract
Microaggregation is an effective data-driven protection method that permits us to achieve a good trade-off between disclosure risk and information loss. In this work we propose a method for microaggregation based on fuzzy c-means, that is appropriate when there are constraints (linear constraints) on the variables that describe the data. Our method leads to results that satisfy these constraints even when the data to be masked do not satisfy them.
Keywords
Introduction
Microaggregation is an effective masking method for statistical disclosure control. Given a dataset X it permits to build a masked data set X′ = ρ (X) with a good compromise between disclosure risk and information loss. Roughly speaking, microaggregation builds small clusters and replaces the original values by the centers of the clusters. In order to avoid disclosure, all clusters must have at least a predefined number of records. Then, in order to have an acceptable information loss it is usual to define a partition of the variables and microaggregate each variable independently.
The literature on microaggregation is vast. Microaggregation was proposed in [4]. Then, [5] defined an heuristic method, and [30] defined an approach for categorical data. Hansen and Mukherjee [9] defined a polynomial algorithm for univariate microaggregation. [21] proved that the problem is NP for multivariate microaggregation. Since then, different methods have been proposed to improve the effectiveness. See e.g. [13, 26]. The problem of selecting a good partition of the variables is discussed in [19]. Microaggregation to achieving k-anonymity [27] is discussed in [7].
The transparency principle (see e.g., [35]) establishes that when data is published we need to include information on all processes applied to the data. This naturally includes how data has been protected. When users know this information, they can use it to compute statistics from the data with better accuracy. Nevertheless, intruders can use this information to attack the data. Transparency attacks for microaggregation have been studied in [20, 37]. It has been proven that very effective attacks can be built for multivariate microaggregation.
In order to define a microaggregation method resistant to transparency attacks, fuzzy microaggregation was introduced in [6]. The method builds first a fuzzy partition where all clusters had at least k-elements and then data is replaced by cluster centers. As in fuzzy clustering elements can belong to different clusters, assignment to a cluster center was done at random. In this way, given a record intruders do not know for sure which cluster has been used in the replacement. This increases intruders uncertainty and decreases disclosure risk. Another fuzzy clustering based approach was proposed in [12] in terms of an optimization problem in line with [8]. In recent papers [33, 34] a simpler algorithm for fuzzy microaggregation was introduced.
Edit constraints correspond to restrictions established on the metadata, or the schema, of the database. They establish how some variables relate to each other. For example, we may have three variables in the database net, tax, gross and a constraint that specifies net + tax = gross. It is usual that data is edited before publication so that it satisfies the edit constraints. Nevertheless, when data is protected by means of a masking method it is possible that the resulting masked file violates the constraints. In this case the file has to be edited again but it may be not so easy this time. Note that if the masked file is such that means are the same as in the original file, and we have added random noise that has caused some ages to be negative, replacing these negatives values by zero would change the mean. The combination of edit constraints and masking methods has been studied by Shlomo and De Waal [28, 29], by Torra et al. [2, 31], and later by Kim et al. [10, 11].
In this paper we study the problem of defining a method for microaggregation that, following [34] is resitant to transparency attacks and at the same time is able to deal with edit constraints. The approach is based on fuzzy c-means. We introduce here two variations of this algorithm to deal with linear constraints. The algorithms are able to build a masked data set that satisfies the linear constraints even in the case that the original data does not satisfy them. Note that this is an important property, as we can then combine in a single step effectively data edition and data masking.
The structure of the paper is as follows. In Section 2 we review clustering and fuzzy c-means. In Section 3, we introduce two methods for achieving constrained fuzzy c-means and use them to define constrained microaggregation. We study the properties of the approaches. In Section 4 we give an example. The paper finishes with some conclusions and lines for future research.
Preliminaries
This section is divided in three parts and describes three topics that are needed later. We begin with a review of fuzzy clustering, which is used in our approach. Then we review how fuzzy clustering is used in microaggregation. Finally, we make a short summary of edit constraints and, more particularly, on linear constraints.
Fuzzy clustering
Clustering is an approach in statistics and machine learning (more specifically, in unsupervised machine learning) to extract relevant structures and patterns from data. There is a large number of methods for clustering that depend on different assumptions on the underlying model of the data and the type of structure built from the data (i.e., dendrograms, partitions, fuzzy partitions,...). In this paper, following [34] we will use a fuzzy clustering algorithm.
Fuzzy clustering algorithms [1, 15] typically build a fuzzy partition from the data. Recall that in a fuzzy set, membership to the set is partial instead of Boolean. This is usually represented by a membership function over the reference set u : X → [0, 1] where u (x) =0 means no membership and u (x) =1 means total membership. Then, fuzzy clustering builds partitions in which elements may belong to more than one cluster with non-null membership. We will use fuzzy partitions in the sense of Ruspini [1, 25] which can be interpreted in terms of probability distributions because memberships of an element to all clusters add to one (this is not necessarily the case in other types of fuzzy partitions).
Fuzzy c-means [1] is one of the most well known algorithms of this type. This method is defined in terms of an objective function to be minimized. It is a generalization of k-means resulting into a fuzzy partition instead of a crisp (standard) one. This is achieved by means of replacing the objective function of k-means by another one where memberships are involved, and including a parameter m that controls the fuzziness of the solution. Entropy-based fuzzy c-means is another fuzzy clustering method. Fuzziness is achieved by means of requiring membership functions to optimize the entropy. In this case, a parameter λ is used to control the degree of fuzziness.
The formulation of fuzzy c-means [1] follows. In this formulation x
k
for k = 1, …, N represent the records to be clustered, u
ki
the membership degree of record k to the ith cluster, and v
i
represents the cluster center of the ith cluster. Fuzzy c-means finds u
ki
and v
i
given x
k
and the parameter m solving the following optimization problem.
In this definition, m is such that m ≥ 1. When m = 1 the problem is equivalent to the one of k-means and solutions are crisp (i.e., elements x k are only assigned to one cluster and u ki ∈ {0, 1} instead of being in [0,1]). In contrast when m is large (e.g. m > 2.5) solutions tend to be extremely fuzzy and membership values u ik tend to be equal to 1/c.
Fuzzy c-means is usually solved using an iterative algorithm that interleaves two optimization problems. One that optimizes u given cluster centers v
i
, and another that optimizes v given memberships u
ki
. That is, the following algorithm is used to find the membership functions u
ki
and the cluster centers v: Generate c initial values for cluster centers v = (v1, …, v
c
) Calculate
Calculate
Iterate the last two steps until convergence.
The expressions for computing
The solution of
The solution of
This iterative approach converges to a local optima. It can be proven that the local optima obtained when m is large satisfies the following properties.
memberships u
ij
= 1/c for all objects x
j
∈ X and clusters j, and
cluster centers
There have been several extensions of this method as well as alternative approaches for defining fuzzy clustering. One of them is entropy-based fuzzy c-means (EFCM) [16], which is defined in a way similar to the one of fuzzy c-means. The function to be minimized is a regularized version of ∑∑u
ki
||x
k
- v
i
||2. Instead of adding m as in fuzzy c-means, a term based on the entropy is introduced in the optimization problem. This term is combined using a parameter λ. This parameter λ ≥ 0 is used to control the fuzziness of the solution. Formally, the entropy-based fuzzy c-means is defined in terms of the following optimization problem.
This objective function is subject to the constraints u
ki
∈ [0, 1] and
EFCM is also solved using an iterative algorithm. The expressions for u and v are as follows (see [16] for details):
The parameter λ plays a role similar to m in fuzzy c-means. Here, the smaller the λ, the fuzzier the solution.
memberships u
ij
= 1/c for all objects x
j
∈ X and clusters j, and
cluster centers
When λ tends to infinity, the second term becomes negligible and the algorithm yields to a crisp solution.
FCM and EFCM lead to different clusters for the same data. An important difference is about the membership values. For example, the centroids have a membership equal to one in the FCM while in the EFCM a lower membership is possible.
Microaggregation is about building small clusters (with at least k records) and replace each record in the set by the cluster center. When the whole set is microaggregated at once (considering all the variables at the same time), the resulting file is such that there are sets of at least k indistinguishable records. Note that all records that belong to the same cluster are replaced by the same cluster center.
As processing the file in this way causes a high information loss, it is usual to consider a partition of the variables, and apply microaggregation to each subset of variables. Proceeding in this way, records that are indistinguishable with respect to one set of variables (because they belong to the same cluster for these variables) may be distinguishable with respect to another set (because they belong to different clusters for these other variables). This causes that the protected file has no longer sets of k indistinguishable records.
Nevertheless, this approach can be attacked effectively (see e.g. [20]. Any intruder can exploit the fact that records are usually masked replacing values by cluster centers that are near. Attacks are specially effective for optimal univariate microaggregation as in this case we know with certainty which clusters can be used for replacement.
To avoid this type of (transparency) attacks, we proposed in [34] an approach for microaggregation based on fuzzy microaggregation. Algorithm 1 reproduces this method. The approach is based on computing fuzzy clusters and then replacing values by cluster centers using the membership values as a probability distribution. In this way, intruders cannot know which clusters have been used. The algorithm uses two parameters m1 and m2 in relation to fuzzy c-means. The first one is to compute the fuzzy clusters, and the second one to build the probability distribution.
It can be proven that the larger the m1, the larger the information loss. Similarly, the larger the m2, the larger the information loss. In addition, we can also prove that for large values of m2, and with c = |X|/k, the expected size of all clusters is k. Therefore, for large values of m2 when microaggregation is applied to the whole file, we have probabilistic k-anonymity.
Algorithm 1. Fuzzy Microaggregation with parameters c, m1, and m2
Constraints and linear constraints
A problem so-far not much considered in the literature in data privacy is the one of data with constraints. Up to our knowledge, only the three groups mentioned in the introduction (Shlomo and De Waal, Kim et al., and ourselves) have considered this problem.
With data with constraints we refer to the fact that some databases include metadata expressing that some of the variables in the data sets should satisfy some constraints. Some of these constraints are enforced when data is introduced, and others are checked once all the data are available.
When data are processed for ensuring privacy, data is modified, and if such constraints are not taken into consideration, incompatible data are generated. The usual approach is to proceed in this way: (i) protect the dataset ignoring the constraints, then (ii) check the constraints, and finally (iii) if constraints are violated, data is modified again so that it is compliant with the constraints.
Such approach can lead to inconsistences or larger information loss than necessary. Recall the example in the introduction. Protection with random noise is known to keep means. However, if variables are required to be positive, just transforming to zero all negative values can cause a rellevant positive bias to the data. In addition, when data is first edited, and then protected we modify the original data twice. This adds extra noise to the data.
The results introduced in this paper are to edit and protect in a single step. In this way, the noise introduced into the data for data protection is used at the same time to solve the inconsistencies of the data with respect to the constraints.
Constraints on the variables are known by edit constraints in the field of data privacy (in particular, in the subfield of statistical disclosure control). There are different types of constraints. E.g., constraints on the possible values, linear constraints, one variable that governs the values of another (e.g., sex=male, implies number of pregnancies=0). See [23, 31] for details on the classification of the constraints. In this paper, we will consider linear constraints. Note that some other types of equality constraints can be transformed into linear ones. Naturally, we have a linear constraint when a variable can be expressed as a linear combination of a set of other variables. For example, the following relation between family income, person income, and other persons income should hold:
EC-LC: person income + other person income = family income
In general, linear constraints can be expressed in its more general form as V = ∑ s α i V i + A, for some values α i , variables V i , the dependent variable V and a constant A. In this paper, we will rewrite them, equivalently, as ∑ s α i V i = A, or, in a vector form with v = (V1, …, V t ) and α = (α1, …, α t ), as α · v = A. Here · represents the inner product.
Constrained fuzzy clustering for data masking
As our goal is that the masked data satisfies the linear constraints (independently of whether the original data satisfies them or not), we investigate in this section clustering algorithms that lead to cluster centers that satisfy linear constraints.
We use again x
k
for k = 1, …, N to represent the set of elements to be clustered. Each element x
k
belongs to a t dimensional space (i.e.,
As stated above, linear constraints on the cluster centers are represented in terms of a vector α = (α1, …, α
t
) and the inner product ·, by means of the equation α · v
i
= A for all clusters i = 1, …, c. Of course, this means that,
Constrained FCM
We start considering the formalization of fuzzy c-means when we require the cluster centers to satisfy linear constraints. Therefore, using the notation above, the constrained fuzzy c-means with linear constraints corresponds to the following problem.
Note that this problem is the same we saw before for the fuzzy c-means adding linear constraints for each cluster center. To solve this optimization problem we can apply the alternative algorithm problem using new expressions for u and v. The following proposition gives these expressions.
The solution of
The solution of
We now present the proof of this proposition. We considered a simpler version of this problem in [32].
Proof. The solution of this problem is based on the Lagrange multipliers. We consider two sets of Lagrange multipliers. One set are for the constraints
In order to compute an expression for u
ki
, let us now consider the derivative of L with respect to u
ki
. We obtain
Making this expression equal to zero, and assuming that there is no x k = v i , we can obtain the following expression for u kj (for convenience we use j instead of i).
In order to eliminate λ
k
in this expression, let us consider the constraint
Now, if we compute the quotient between Equation 6 and 6.sumen1 we obtain the following:
This solution is a valid solution because it satisfies u ki ≥ 0. As the objective function is convex, this is the unique solution of the problem.
We now consider the derivative of L with respect to v k in order to compute an expression for the later. As v k is a vector, we consider one of its components: v is . We decompose L above in L1 + L2 + L3 and make explicit v is in these components:
It is easy to see that the derivative of L1 with respect to v
is
is
If we assign this expression to zero we obtain,
Therefore, we can say that v
is
should be as follows.
We study the case of α
s
≠ 0 (otherwise the variable s is not in the linear constraint). In this case, let us consider the equation A = α · v
i
, which corresponds to
If the ith cluster contains at least one element with some non null membership, we can obtain by algebraic transformation the following expression for ν
i
:
We use this expression for ν i in Expression 9, and then with further algebraic manipulation we obtain Equation 5 in the proposition.□
Note that for variables v
s
not involved in the linear constraint (i.e., variables v
s
such that α
s
= 0), v
is
will be computed in the same way as for the standard fuzzy c-means. This is so because Equation 5 reduces to
This algorithm will produce a set of clusters that satisfy the linear constraints. We can also prove that when the data satisfies these constraints, the expressions for v is above reduce to the ones of standard fuzzy c-means (Equation 1).
These two properties are established in the following proposition.
When α
s
= 0 (i.e., variable v
s
is not involved in the linear constraint), Equation 5 reduces to Equation 1 for all i = 1, …, c.
When data satisfy linear constraints, Equation 5 reduces to Equation 1 for all i = 1, …, c and all s = 1, …, t.
Proof. To prove 1 it is enough to replace α
s
by zero in Equation 5. To prove 2 observe that when data satisfy the linear constraints, then for all k = 1, …, N it holds
We focus now on entropy-based fuzzy c-means. We will obtain similar results. Considering the linear constraints for all cluster centers, we define the following optimization problem.
For this problem, the following result is obtained.
Proof. The proof of this proposition follows the same schema as the case of the fuzzy c-means. That is, we use the Lagrange multipliers to define an objective function that includes subexpressions for each constraint.
Derivating L with respect to u
ki
and making it equal to zero, we can later obtain the following expression for u
ki
:
The proof of the expression for v
is
starts derivating L with respect to v
is
. Then, when this derivative is zero we can obtain the following expression for v
is
:
Note that the expression for u ki is the same we had before when no linear constraints were considered. In contrast, the expression for v is is different as it includes terms corresponding to the constraints.
The next proposition establishes that the solution given in Proposition 6 corresponds to the standard solution in EFCM when data already satisfies the constraints.
The application of these algorithms to microaggregation consists in replacing in Algorithm 1 the fuzzy c-means in Step 1 by one of the two clustering methods discussed in this paper (constrained FCM and constrained EFCM). They will lead to solutions that satisfy the linear constraints. Algorithm 1 will have two parameters m1 and m2 for the constrained FCM, and two parameters λ1 and λ2 for the constrained EFCM.
The larger the m1, the larger the information loss. This follows from Proposition 2. Clusters will collide, cluster centers will be The larger the m2, the larger the information loss. Proposition 2 shows that in this case all memberships tend to be 1/c. Therefore, any cluster center can be used to replace a given value. The smaller the number of clusters c, the larger the information loss. Naturally, when c = 1, all data are replaced by a single cluster center. Minimum loss can be achieved when c = |X| and each data has its own cluster (and with m1 and m2 near to one). For large m2, and c = |X|/k for a given k, the expected size of all clusters is k. I.e., we have probabilistic k-anonymity. This follows from the uniform distribution with membership/probability 1/c, and the number of clusters.
These properties show that with an appropriate selection of m1 and m2 information loss ranges from no information loss (i.e., c = |X|, m1 = m2 = 1 with the masked data being equal to the original one) to maximum loss (i.e., c = 1). We can also obtain data that probabilistically satisfies k-anonymity (i.e., c = |X|/k and m2 large). In addition to these properties, the resulting file satisfies given linear constrains (edit linear constraints) even in the case that the original file does not satisfy them.
We have detailed the properties for the case of the constrained FCM. These properties can also be inferred for constrained EFCM. In that case, values of λ1 and λ2 near to zero play the role of large m1 and m2.
Original data set
Original data set
Original data set with noise following N (0, 1.5)
For illustration, we have applied our approach to a small data set. We have considered the example in [31] where a linear constraint involving three variables was considered. The data is reproduced in Tables 1 and 2. Variables v1, v2 and v3 stand for Expenditure at 16% , Expenditure at 7% , and Total Expenditure. Then, variables v1, v2 and v3 define the following linear constraint: V3 = α1V1 + α2V2. That is, v3 = 1.16v1 + 1.07v2.
Table 1 contains the original data. In addition, we have also considered the same data set with gaussian noise (Table 2). We have considered three cases with noise following N (0, 0.5), N (0, 1.0) and N (0, 1.5) and we have microaggregated the files and computed the cluster centers. We used c = 4 that corresponds to a parameter k, as understood in microaggregation, equal to k = |X|/4 = 12/4 =3. In the table we include the data with noise N (0, 1.5).
Centroids obtained for the original and noisy data
Centroids obtained for the original and noisy data
Distance between original centroids and centroids of noisy data. For each noisy data, first row corresponds to distances with cluster centers using standard fuzzy c-means and second row to distances with cluster centers obtained using constrained fuzzy c-means
The centroids using standard fuzzy c-means applied to the original data set satisfy the constraints, but this is not so when the data set with noise is considered. In this case, however, the alternative expressions developed in this paper lead to appropriate cluster centers (i.e., cluster centers satisfying the constraints).
Table 3 display the cluster centers we have obtained for the original data set without noise, and the ones obtained with noise. For these three last data sets we include both the cluster centers obtained by the standard fuzzy c-means and by the new method.
Comparison between the cluster centers of the noisy data with the ones of the original data show that the cluster centers are more similar to the original ones when constraints are considered. This is shown in Table 4. This table shows the differences, measured using the Euclidean distance, between the 4 cluster centers obtained using a particular data set and the clustering algorithm with respect to the cluster center using fuzzy c-means with the original (no-noise) data. It can be seen that for each of the noisy data, the nearest cluster is the one where constraints are considered (second row for each of the noisy data).
These results permits us to show on the one hand the suitability of the method for combining in the same step the edit constraints and the data protection method, and on the other hand, that our approach leads to protected data with less information loss. This is so because the cluster centers with constrained FCM are more similar to the original cluster centers than the cluster centers using standard FCM.
We have presented two variations of fuzzy c-means to deal with linear constraints and shown their use for data masking. We have given the solution of the optimization problem when linear constraints are considered on the data. The solution leads to cluster centers that satisfy the constraints even when the data does not. This permits to combine in the same step data editing and data masking. This approach is one of the first to combine these two steps. We have also given an example that shows how the method can be applied and that for noisy data we obtain cluster centers that are nearer to the ones without noise.
We have also discussed the effects of the parameters in information loss. Further work is about the appropriate selection of the parameters of the method, and comparison with other approaches as e.g. [10, 11]. That is, m1 and m2 for constrained FCM and λ1 and λ2 in constrained EFCM. We have seen however, that with an appropriate selection we can range from no information loss (and, thus, maximum disclosure risk) to maximum loss (and, thus, no risk).
Acknowledgments
Partial support of the project Swedish Research Council (Vetenskapsrådet) (grant number VR 2016-03346) is acknowledged.
