c
s
reasonably accurate data mining results when compared to those
attained when mining the original dataset.
The protection of sensitive data from disclosure has been exten-ively studied in the context of microdata release, where methodologies have been proposed for the protection of sensitive information
regarding individuals recorded in some dataset. In microdata, we consider each record in the dataset to represent an individual for whom
the values of a number of attributes are being recorded, e.g., name, date
of birth, residence, occupation, salary, etc. Among the complete set of
attributes, there exists some attributes that explicitly identify the individual, e.g., name, social security number, etc., as well as attributes that,
once combined together or with publicly available external resources,
may lead to the identification of the individual, e.g., address, gender,
age, etc. The first type of attributes, also known as identifiers, must be
removed from the data prior to its publishing. On the other hand, the
second type of attributes, also known as quasi-identifiers, have to be
handled by the privacy preservation algorithm in such a way that in the
sanitized dataset, the knowledge of their values regarding an individual
no longer poses a threat to the identification of his/her identity.
Existing methodologies for the protection of sensitive microdata
an be partitioned in two directions: (a) data modification approaches
and (b) synthetic data generation approaches. Willenborg & De Waal
[ 9] further partition the data modification approaches into
perturbative and non-perturbative approaches, depending on whether they
introduce false information in the attribute-values of the data, e.g., by
the addition of noise based on some data distribution, or they operate by altering the precision of the existing attribute-values, e.g., by
changing a value to an interval that contains it.
s
Secure Multiparty Computation
The approaches discussed so far aim at generating a sanitized dataset
from the original one, which can be safely shared with untrusted third
parties, as it contains only non-sensitive data. Secure Multiparty
Computation (SMC) provides an alternative family of approaches that
can effectively protect the sensitive data. SMC considers a set of collaborators who wish to collectively mine their data but are unwilling to
disclose their own datasets to each other. As it turns out, this distributed privacy preserving data mining problem can be reduced to the
secure computation of a function based on distributed inputs and is
thus solved by using cryptographic approaches. Pinkas [ 8] elaborates
on this close relation that exists between privacy-aware data mining
and cryptography. In SMC, each party contributes to the computation
of the secure function by providing its private input. A secure cryptographic protocol that is executed among the collaborating parties
ensures that the private input that is contributed by each party is not
disclosed to the others. Most of the applied cryptographic protocols
for multi-party computation result to some primitive operations that
have to be securely performed: secure sum, secure set union, and
secure scalar product. Clifton et al. [ 4] discuss these operations.
As a final remark, we should point out that the operation of the
ecure protocols in the course of distributed privacy preserving data
mining depends highly on the existing distribution of the data in the
sites of the collaborators. Two types of data distribution have been
investigated: In a horizontal data distribution, each collaborator
holds a number of records, and for each record, he/she has knowledge of the same set of attributes as his/her peers. On the other hand,
in a vertical partitioning of the data, each collaborator is aware of
different attributes referring to the same set of records.
Protecting the Sensitive Knowledge
In this section, we focus our attention on privacy preserving methodologies that protect the sensitive knowledge patterns that would otherwise be revealed after the course of the mining process. Akin to the
methodologies that we have presented for protecting sensitive data
prior to its mining, the approaches in this category also modify the
original dataset, but thy do so in such a way that certain sensitive
knowledge patterns are suppressed when mining the data. In what
follows, we briefly discuss some categories of methodologies that
have been proposed for hiding the sensitive knowledge in the context
of association and classification rule mining.
a
Association Rule Hiding
The association rule mining framework, along with some computationally efficient heuristics for the generation of association rules,
have been proposed in the work of Agrawal et al. [ 1] and Agrawal &
Srikant [ 2]. Briefly stated, the goal of association rule mining is to
produce a set of interesting and potentially useful rules, i.e., implications that hold in a dataset. The presence of a rule in a dataset is
judged on its statistical significance, quantified with the aid of two
measures: confidence and support. All the association rules whose
confidence and support are above some user-specified thresholds are
then mined. However, some of these rules may be sensitive from the
owner’s perspective. The association rule hiding methodologies aim
to sanitize the original dataset in a way that:
• All the sensitive rules (as indicated by the data holder) that appear
when mining the original dataset for association rules do not appear when mining the sanitized dataset for association rules at the
same (or higher) levels of support and confidence.
• All the non-sensitive (remainder) rules can be successfully mined
from the sanitized dataset at the same (or higher) levels of support
and confidence.
• No rule that was not found when mining the original dataset can
be found in its sanitized counterpart when mining the latter at the
same (or higher) levels of support and confidence.
The first goal simply states that all the sensitive association rules
re properly hidden in the sanitized dataset. The hiding of the sensitive knowledge comes at a cost of the utility of the sanitized outcome.
The second and the third goals aim at minimizing this cost.
Specifically, the second goal requires that only the sensitive knowledge is hidden in the sanitized dataset. Thus, no other non-sensitive
rules are lost due to side-effects of the sanitization process. The third
rule requires that no artifacts, i.e., false association rules, are generated by the sanitization process. To recapitulate, in association rule
hiding, the sanitization process has to be accomplished in a way that
minimally affects the original dataset, preserves the general patterns
and trends of the dataset, and achieves concealment of all the sensitive knowledge, as indicated by the data holder.
Association rule hiding has been studied along three principal
directions: (a) heuristic approaches, (b) border-based approaches,
and (c) exact approaches. The first direction collects time and mem-