This is the html version of the file https://www.eng.auburn.edu/~xqin/courses/comp7370/k-anonymity-pods98.pdf.
Google automatically generates html versions of documents as we crawl the web.
These search terms have been highlighted: generalizing data provide anonymity disclosing information abstract
Generalizing Data to Provide Anonymity when Disclosing Information
www.fgks.org   »   [go: up one dir, main page]

Page 1
Generalizing Data to Provide Anonymity when Disclosing Information
Extended Abstract
Pierangela Samarati
Computer Science Laboratory
SRI International
Menlo Park, CA 94025, USA
samarati@csl.sri.com
Latanya Sweeney
Laboratory for Computer Science
Massachusetts Institute of Technology
Cambridge, MA 02139, USA
sweeney@ai.mit.edu
Contact Author:
Latanya Sweeney, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139,
USA Telephone: 617{253{3539, Fax: 617{253{3539, Internet: sweeney@ai.mit.edu
The work of Pierangela Samarati was supported in part by National Science Foundation and by DARPA. The work of
Latanya Sweeney was supported in part by the Medical Informatics Training Grant IT15 LM07092 from the National Library
of Medicine.

Page 2
Abstract
The proliferation of information on the Internet and access to fast computers with large storage capaci-
ties has increased the volume of information collected and disseminated about individuals. The existence
os these other data sources makes it much easier to re-identify individuals whose private information is
released in data believed to be anonymous. At the same time, increasing demands are made on organi-
zations to release individualized data rather than aggregate statistical information. Even when explicit
identi ers, such as name and phone number, are removed or encrypted when releasing individualized
data, other characteristic data, which we term quasi-identi ers, can exist which allow the data recipient
to re-identify individuals to whom the data refer.
In this paper, we provide a computational disclosure technique for releasing information from a private
table such that the identity of any individual to whom the released data refer cannot be de nitively
recognized. Our approach protects against linking to other data. It is based on the concepts of gen-
eralization, by which stored values can be replaced with semantically consistent and truthful but less
precise alternatives, and of k-anonymity. A table is said to provide k-anonymity when the contained
data do not allow the recipient to associate the released information to a set of individuals smaller than
k. We introduce the notions of generalized table and of minimal generalization of a table with respect to
a k-anonymity requirement. As an optimization problem, the objective is to minimally distort the data
while providing adequate protection. We describe an algorithm that, given a table, e ciently computes
a preferred minimal generalization to provide anonymity.

Page 3
1 Introduction
In a globally-network society, there is greater demand by society for individual-speci c data, yet the widespread
availability of information makes it extremely di cult to release any information about individuals without breach-
ing privacy. The existence of extensive registers of business establishments in the hands of government agencies,
trade associations and rms like Dunn and Bradstreet has virtually ruled out the possibility of releasing database
information about businesses [6]. Even when released information has no explicit identi ers, such as name and phone
number, other characteristic data, such as birth date and ZIP code, often combine uniquely and can be linked to
publicly available information to re-identify individuals [11].
Denning showed that no release of data can be proven secure from any arbitrary inference attack [5]. In this paper
we provide an e ective guard against the speci c problem of linking to other data and show that as an optimization
problem, we must provide as much information as possible while still protecting privacy and con dentiality by
thwarting linking attempts. Recent computational disclosure systems, namely -Argus [7] and Data y [10], use
techniques that suppress information and generalize data to prevent linking, but Data y often distorts the data more
than is needed and -Argus often fails to provide adequate protection.
In this paper we focus our attention on generalizing information to provide anonymity in released data. This paper
has three main contributions: the formal de nition of anonymity in the context of protecting disclosed information
against linking; the formal de nition of minimally generalizing tables to satisfy given anonymity constraints; and an
e cient algorithm for determining minimal generalizations that retain preferred speci city in the released data.
We introduce the de nition of quasi-identi ers, which are attributes that can be exploited for linking, and of
k-anonymity, which characterizes the degree of protection of data with respect to inference by linking. We show how
k-anonymity can be ensured in information release by generalizing data to be disclosed and we introduce the con-
cepts of generalized table, minimal generalization, and preferred minimal generalization. Intuitively, a generalization
is minimal if data are not generalized more than necessary to provide k-anonymity. The de nition of preferred gen-
eralization allows the user to select among possible minimal generalizations those that satisfy particular conditions,
such as favoring certain attributes in the generalization process. We provide an algorithm that produces a preferred
minimal generalization of a given table. Although the number of possible generalizations of a table is exponential
in the number N of tuples to be generalized, the algorithm proves to operate linearly in its best and general cases,
degrading to N2 only in the worst case.
Related work in the area of information protection deals with access control systems [8] and statistical databases [1].
Our work, however, conceptually di ers from these proposals by providing a di erent, yet complementary, approaches
to protection. Access control systems address the problem of controlling speci c access to data with respect to rules
that state whether a piece of data can or cannot be released. This point of view di ers from that under consideration
in this work because it is not the disclosure of a speci c piece of data that must be protected, but rather the ability
to link that data to a particular entity (e.g., a patient in a hospital database, a tax-payer in the IRS database,
a customer organization in a generic business database). The closest work can be found in the area of statistical
databases. However, statistical database techniques [1, 9] address the problem of producing tabular data based on
queried information, ensuring that it is not possible for users to infer private data from the produced summary. In our
approach instead, we release generalized individual-speci c data on which users can produce summaries according to
their own needs. Emerging applications and technology such as data mining and the Internet are making tremendous
demands for the release of detailed information about individuals such that the identity of the individuals cannot be
recognized.
The remainder of this paper is organized as follows. We begin in Section 2 by illustrating the linking problem and
providing de nitions of quasi-identi er and k-anonymity. Section 3 introduces generalization to provide anonymity
and the concept of generalized tables and minimal generalization. Section 4 discusses the problem of e ciently
computing a minimal generalization for a table and presents an algorithm for that. Section 5 concludes the paper.
The paper also contains an Appendix reporting two supplementary algorithms discussed in the paper, and an example
illustrating an application of our approach.
2 The anonymity problem
Data holders often release personal information with all explicit identi ers, such as name, address, phone number,
and Social Security number, removed or encrypted in the incorrect belief that privacy is maintained because the
1

Page 4
Medical Data released as anonymous
SSN
Name
Ethnicity
Date Of Birth
Sex
ZIP
Marital Status Problem
1
black
09/27/64
male
02139
divorced
obesity
2
black
09/30/64
male
02139
divorced
hypertension
3
black
04/18/64
male
02139
married
chest pain
4
black
04/15/64
male
02139
married
chest pain
5
black
09/15/64
male
02138
married
shortness of breath
6
caucasian
03/13/63
male
02141
married
hypertension
7
caucasian
03/18/63
male
02141
married
shortness of breath
8
caucasian
09/13/64
female
02138
married
shortness of breath
9
caucasian
09/07/64
female
02138
married
obesity
10
caucasian
05/14/61
female
02138
single
chest pain
11
caucasian
05/08/61
female
02138
single
obesity
Voter List
Name
Address
City
ZIP
DOB
Sex
Race
Party
................
................
................
................ ........
........
........ ........
........
................
................
................
................ ........
........
........ ........
........
................
Jim A. Cosby
570, Laurel St. Cambridge
02138
9/15/64
male
black
democrat
................
................
................
................ ........
........
........ ........
........
................
Figure 1: Re-identifying anonymous data by linking to outside data
resulting data look anonymous. Simultaneously, municipalities sell and distribute population registers that include
information speci c to individuals, such as local census data, voter registration lists, city directories, as well as
information from motor vehicle agencies, tax assessors and real estate agencies. In most of these cases, population
registers can be used to re-identify individuals by linking or matching it to speci c attributes in released \anonymous"
data. As an example, the 1997 voting list for Cambridge, Massachusetts contains demographics on 54,805 voters.
Birth date alone can uniquely identify the name and address of 12% of the voters; birth date and gender, 29%; birth
date and a 5-digit ZIP code, 69%; and 97% (53,033 voters) can be identi ed when the full postal code and birth date
are used [11].
A data holder can often identify attributes in their data that also appear in outside sources, and these attributes
are candidates for linking. We term them, quasi-identi ers, and it is essentially the combinations of these quasi-
identi ers that must be protected.
De nition 2.1 (Quasi-identi er) Let T(A1;:::;An) be a table. A quasi-identi er of T is a set of attributes
fAi;:::;Ajg fA1;:::;Ang whose release must be controlled.
Figure 1 exempli es a table of released medical data that has been de-identi ed by suppressing names and SSNs
so as not to disclose the identities of individuals to whom the data refer. An example of a quasi-identi er for this
table is represented by attributes fZIP, date of birth, ethnicity, sex, marital statusg. As illustrated in the gure,
the rst four attributes can be retrieved by accessing a list of voters, which is publically available. other externally
available, but are not included in the quasi-identi er since they are assumed to be suppressed (or encrypted) in the
release. Quasi-identi ers combine with external knowledge to retrieve the identities of individuals. For instance, in
Figure 1, the voter list my have only one black male born on 9/15/64 and leaving in the 02138 area. This identi es
the corresponding bulleted tuple in the released data as pertaining to Jim Cosby. As stated earlier, birthdate alone
can re-identify 12% of the voters in Cambridge, MA.
Given a table T(A1;:::;An), a subset fAi;:::;Ajgof the attributes in T, and a tuple t 2T, we use t[Ai;:::;Aj]
to denote the sequence of the values of the Ai;:::;Aj in t. We use T[Ai;:::;Aj] to denote the projection, maintaining
duplicate tuples, of attributes Ai;:::;Aj in T. In this paper we assume the existence of a private table PT, already
de-identi ed and we assume sets of attributes are speci ed which represent quasi-identi ers. We denote with QIPT
the set of quasi-identi ers in PT.
One way to state an anonymity constraint involves making sure characteristics and combinations of characteristics
found in the data combine to match at least k individuals. To determine how many individuals each released tuple
matches requires combining the released data with externally available data. This is obviously an impossible task
for the data holder who releases information. Although we can assume the data holder knows what data are part of
external knowledge, and therefore what constitutes quasi-identi ers, the speci c values of data in external knowledge
cannot be assumed. The way to satisfy the k-anonymity requirement is therefore to force such a constraint on the
released data.
2

Page 5
Z2 = f02100g
Z1 = f02130;02140g
Z0 = f02138;20239; 02140; 02141g
6
6
DGHZ0
02100
02130
02140
02138
02139
02141
02142
*
HHHHY
,, @@I
@@I,,
VGHZ0
E1 = fpersong
E0 = fasian; black; caucasiang
DGHE0
6
person
asian
black
caucasian
VGHE0
*
HHHHY6
Figure 2: Examples of domain and value generalization hierarchies
De nition 2.2 (k-anonymity) Let T(A1;:::;An) be a table and QIT be the quasi-identi ers associated with it. T
is said to satisfy k-anonymity i for each quasi-identi er QI 2 QIT ,: each sequence of values in T[QI] appears at
least with k occurrences in T[QI].
It can be proved that if each quasi-identi er in the released data satis es k-anonymity, then the combination
of released data to external sources cannot match lower than k individuals. This property holds provided that all
attributes in the released table PT which are externally available in combination (i.e., appearing together in an
external table or in a possible join of external tables) are de ned in a quasi-identi er associated with PT.
3 Generalization to combat linking inference
There are many techniques to achieve k-anonymity, such as changing unusual information to typical values, inserting
complementary records, swapping entries, scrambling records and suppressing information [1, 3, 9]. We propose an
approach that produces a version of PT such that the k-anonymity requirement is satis ed by re-coding values to
make them more general.
In a classical relational database system, domains are used to describe the set of values that attributes assume.
For example, there might be a ZIP code domain, a number domain and a string domain. We extend this notion of
a domain to make it easier to describe how to generalize the values of an attribute. In the original database, where
every value is as speci c as possible, every attribute is in the ground domain. For example, 02139 is in the ground
ZIP code domain, Z0. In order to achieve k-anonymity we can make the ZIP code less informative. We do this by
saying there is a more general, less speci c, domain that can be used to describe ZIP codes, Z1, in which the last
digit has been replaced by a 0. There is also a mapping from Z0 to Z1, such as 02139 ! 02130. This mapping
between domains is stated by means of a generalization relationship, which represents a partial order on the set Dom
of domains, and which is required to satisfy the following conditions: 1) each domain Di has at most one direct
generalized domain; and 2) all maximal elements of Dom are singleton.1 The de nition of this generalization implies
the existence, for each domain D 2 Dom, of a hierarchy, which we term the domain generalization hierarchy
DGHD. Since generalized values can be used in place of more speci c ones, it is important that all domains in the
hierarchy be compatible. Compatibility can be ensured by using the same storage representation form for all domains
in a generalization hierarchy. A value generalization relationship is also de ned which associates with each value vi
in a domain Di a unique value in domain Dj direct generalization of Di. Such a relationship implies the existence, for
each domain D of a value generalization hierarchy VGHD. Figure 2 illustrates an example of domain and value
generalization hierarchies for domain Z0, representing ZIP codes for Cambridge, MA, and E0 representing ethnicity.
In the remainder of this paper we will often need to refer to a domain or value generalization hierarchy in terms
of the graph representing all and only the direct generalization relationships between the elements in it (i.e., implied
generalization relationships do not appear as arcs in the graph). We will use the term hierarchy indistinguishably to
denote either a partially ordered set or the graph representing the set and all the direct generalization relationships
between its elements. We will explicitly refer to the ordered set or to the graph when it is not otherwise clear from
context.
1The motivation behind condition 2 is to ensure that all values in each domain can be eventually generalized to a single
value.
3

Page 6
Eth:E0
ZIP:Z0
a
38
a
39
a
41
a
42
b
38
b
39
b
41
b
42
c
38
c
39
c
41
c
42
PT
Eth:E1
ZIP:Z0
p
38
p
39
p
41
p
42
p
38
p
39
p
41
p
42
p
38
p
39
p
41
p
42
GT[1;0]
Eth:E1
ZIP:Z1
p
30
p
30
p
40
p
40
p
30
p
30
p
40
p
40
p
30
p
30
p
40
p
40
GT[1;1]
Eth:E0
ZIP:Z2
a
00
a
00
a
00
a
00
b
00
b
00
b
00
b
00
c
00
c
00
c
00
c
00
GT[0;2]
Eth:E0
ZIP:Z1
a
30
a
30
a
40
a
40
b
30
b
30
b
40
b
40
c
30
c
30
c
40
c
40
GT[0;1]
Figure 3: Examples of generalized tables for PT
3.1 Generalized table and minimal generalization
Given a private table PT, our approach is to provide a table which respects k-anonymity by generalizing the values
stored in it. Generalization is e ective because substituting attribute values with generalized values maps more values
to the same generalized result. This typically decreases the number of distinct tuples and thus increases the number
of tuples having the same values. We enforce generalization at the attribute level. Generalizing an attribute means
substituting its values with corresponding values from a more general domain. Since attributes can change domain in
the generalization process, the simple classical notation Di for the domain of attribute Ai does not su ce. Instead,
we need to explicitly refer to the domain an attribute assumes in a speci c table. In the following, dom(Az;T)
denotes the domain of attribute Az in table T. Di = dom(Az;PT) denotes the domain associated with attribute Ai
in the de nition of the private table PT.
A table Tj(A1;:::;An) is said to be a generalization of a table Ti(A1;:::;An), written Ti Tj i : (1) Ti
and Tj have the same number of tuples; (2) the domain of each attribute Az in Tj is equal to or a generalization of
the domain of Az in Ti; and (3) each tuple ti in Ti has a corresponding tuple tj in Tj (and vice-versa) such that for
each attribute Az, tj[Az] is equal to or a generalization of ti[Az].
Example 3.1 Consider the table PT illustrated in Figure 3 and the domain and value generalization hierarchies
for E0 and Z0 illustrated in Figure 2. The remaining four tables in the gure are examples of generalized tables for
PT. For the clarity of the example, every table reports, together with each attribute, the domain for the attribute
in the table. With respect to k-anonymity: GT[1;0] and GT[0;1] satis es k-anonymity for k = 1;2; GT[0;2] satis es
k-anonymity for k = 1;:::;4, and GT[1;1] satis es k-anonymity for k = 1;:::;6:
It is easy to see that the number of di erent generalizations of a table is equal to the number of di erent
combinations of domains that the attributes in the tables can assume. Clearly, not all generalizations are equally
satisfactory. A trivial possible generalization, for instance, is the one that generalizes each attribute to the highest
possible level of generalization, thus collapsing all tuples in the table to the same list of values. This provides k-
anonymity at the price of a strong generalization of the data. Such an extreme generalization is not needed if a less
generalized table (i.e., containing more speci c values) exists which satis es k-anonymity. This concept is captured
by the following de nitions of k-minimal generalization.
De nition 3.1 (k-minimal generalization { wrt a quasi-identi er) Let Ti and Tj be two tables such that Ti
Tj. Tj is said to be a k-minimal generalization of a table Ti wrt to a quasi-identi er QI i :
1. Tj satis es k-anonymity wrt QI
2. 8Tz : Ti Tz;Tz Tj, Tz satis es k-anonymity wrt QI )Tz[QI] = Tj[QI]:
Intuitively, a table Tj generalization of Ti is k-minimal if it satis es k-anonymity and there does not exist any
generalization of Ti which satis es k-anonymity and of which Tj is a generalization.
Example 3.2 Consider table PT and its generalizations illustrated in Figure 3. Assume QI=(Eth,ZIP) to be a
quasi-identi er. It is easy to see that for k = 2 there exist two k-minimal generalizations, which are GT[1;0] and
4

Page 7
hE1;Z2i
hE1;Z1i hE0;Z2i
hE1;Z0i hE0;Z1i
hE0;Z0i
,, @@I
@@I,,
6 6HHHHY
DGHDT
hE1;Z2i
hE1;Z1i
hE1;Z0i
hE0;Z0i
6
6
6
GS1
hE1;Z2i
hE1;Z1i
hE0;Z1i
hE0;Z0i
6
6
6
GS2
hE1;Z2i
hE0;Z2i
hE0;Z1i
hE0;Z0i
6
6
6
GS3
Figure 4: Domain generalization hierarchy DGHDT and strategies for DT = hE0;Z0i
GT[0;1]. Table GT[0;2], which satis es the anonymity requirement is not minimal since it is a generalization of
GT[0;1]. Analogously, GT[1;1] cannot be minimal, being a generalization of both GT[1;0] and GT[0;1]. There are also
only two k-minimal generalized tables for k=3, which are GT[1;0] and GT[0;2].
De nition 3.2 (k-minimal generalization) Let Ti(A1;:::;An) and Tj(A1;:::;An) be two tables such that Ti
Tj. Tj is said to be a k-minimal generalization of Ti i :
1. Tj is k-minimal wrt every quasi-identi er QI in QITi.
2. 8k = 1;:::;n :69QI 2QITi;Az 2QI )dom(Az;Tj) = dom(Az;Ti):
According to De nition 3.2 a table Tj is a k-minimal generalization of table Ti i it is k-minimal with respect
to every quasi-identi er of Ti and attributes not belonging to any quasi-identi er are not generalized. It is trivial
to see that a table that satis es k-anonimity has a unique k-minimal generalization, which is itself. It is also easy
to see that the necessary and su cient condition for a table T to satisfy k-anonimity is that the cardinality of the
table must be at least k, as stated by the following theorem. The requirement that the maximal elements of Dom be
singleton ensures the su ciency of the condition.
Theorem 3.1 Let T be a table and k be a natural number. If jTj k, then there exists at least a k-minimal
generalization for T. If jTj< k there are no k-minimal generalization for T.
When di erent minimal generalizations exists preference criteria can be applied to choose a minimal preferred
generalization. For instance, tables which generalize or do not generalize speci c attributes or return the highest
number of distinct tuples can be preferred.For instance, for a k-anonymity requirement with k=2, GT[1;0] and GT[0;1]
are both minimal but GT[0;1] maybe preferred because it contains a larger number of distinct tuples.
3.2 Generalization hierarchies and strategies
Since quasi-identi ers may be composed of di erent attributes, it is useful to talk about generalization relationship
and hierarchies in terms of tuples composed of elements of Dom or of their values. Given a tuple DT = hD1;:::;Dni
such that Di 2Dom; i = 1;:::;n. we de ne the domain generalization hierarchy of DT as DGHDT = DGHD1
::: DGHDn, assuming the cartesian product is ordered by imposing coordinatewise order [4]. DGHDT de nes a
lattice whose minimal element is DT. For instance, Figure 4 illustrates the domain generalization hierarchy DGHE0;Z0
where the domain generalization hierarchies of E0 and Z0 are as illustrated in Figure 2.
The generalization hierarchy of a domain tuple DT de nes di erent ways in which DT can be generalized. In
particular each path from DT to the unique maximal element of DGHDT in the graph describing DGHDT de nes a
possible alternative path that can be followed in the generalization process. We refer to the set of nodes in each of
such paths together with the generalization relationships between them as a generalization strategy for DGHDT .
The di erent domain generalization strategies for DGHE0;Z0 above are illustrated in Figure 4. The number of di erent
possible strategies for a domain hierarchy is stated by the following theorem.
5

Page 8
Theorem 3.2 Let DT = hD1;:::;Dni be a tuple such that Di 2 Dom;i = 1;:::;n. The number of di erent
strategies for DT is:
hDT!
h1!:::hn!
, where each hi is the length of the path from Di to the top domain in DGHDi and
hDT =
Pn
i=1 hi.
2
For each strategy a minimal local generalization can be de ned as the table satisfying k-anonymity, whose
sequence of domains DT0 belongs to the strategy such that there are no other tables satisfying k-anonymity whose
sequence of domains DT00 is in the strategy and such that DT00 DT0. Since a strategy is a total order, the
minimal local generalization is always unique. The following theorem states the correspondence between k-minimal
generalization and the local minimal generalization with respect to a strategy.
Theorem 3.3 Let T(A1;:::;An) and let DT = hD1;:::;Dni be the tuple where Dz = dom(Az;T), z = 1;:::;n, be
a table to be generalized. Every k-minimal generalization of T is a local minimal generalization for some strategy of
DGHDT .
The converse is not true, i.e., a local minimal generalization with respect to a strategy may not correspond
to a k-minimal generalization. For instance, consider table PT and its generalized tables illustrated in Figure 3,
whose minimality has been discussed in Example 3.1. For k = 3 the minimal local generalizations are: GT[1;0] for
strategy 1, GT[1;1] for strategy 2, and GT[0;2] for strategy 3. However, as we have shown in Example 3.1, GT[1;1]
is not k-minimal for k=3. For k = 2 the minimal local generalizations are: GT[1;0] for strategy 1, and GT[0;1] for
strategies 2 and 3. From Directly from Theorem 3.3, a table has at most as many generalizations as the number
of generalization strategies of its domain sequence. The number of k-minimal generalizations can be smaller if the
generalized table, locally minimal with respect to a strategy, is a generalization of a table locally minimal to another
strategy (GT[1;1] for k = 3 in the example above), or if di erent strategies have the same local minimal generalization
(GT[0;1] for k=2 in the example above).
4 An optimal algorithm for determining a minimal generalization
In this section we present an e cient way to determine a minimal generalization of a table. Since the k-anonymity
property requires the existence of k occurrences for each sequence of values present in the table, though only for
those of quasi-identi ers, we can concentrate on the generalization of speci c quasi-identi ers within table PT. The
generalized table PT is obtained by enforcing generalization on each quasi identi er QI 2QIPT. The correctness of
the combination of the generalizations independently produced for each quasi-identi er is ensured by the fact that
the de nition of generalized table requires the correspondence of values across whole tuples and by the fact that
the quasi-identi ers of a table are disjoint. (This last constraint can be removed if generalization of non disjoint
quasi-identi ers be executed serially.) Our generalization algorithm is based on the de nition and use of distance
vectors, de ned in the following subsection.
4.1 Distance vectors
We introduce a distance vector metric with Euclidean properties that measures distances between tuples and tables
based on the number of generalization levels required to have the tuples or tables share the same generalized values.
We further show that these distance vectors represent generalization strategies.
De nition 4.1 (Distance vector) Let Ti(A1;:::;An) and Tj(A1;:::;An) be two tables such that Ti Tj. The
distance vector of Ti to Tj is the vector DVi;j = [d1;:::;dn] where each dz is the length of the unique path between
D =dom(Az;Ti) and dom(Az;Tj) in the domain generalization hierarchy DGHD.
Intuitively the distance vector captures how many generalization steps table Tj is from table Ti for each attribute.
To illustrate, consider table PT and its generalized tables illustrated in Figure 3. The distance vectors between PT
and its di erent generalizations are the vectors appearing as subscripts of each table.
The relationship between distance vectors and minimal generalizations, which is the basis of the correctness of
our approach, is stated by the following theorem.
Theorem 4.1 Let Ti and Tj be two tables such that Ti Tj and Tj satis es k-anonymity. Tj is k-minimal ,
69Tz;Tz 6= Tj, such that Ti Tz, Tz satis es k-anonymity, and DVi;z DVi;j.
6

Page 9
VectorCover(Outliers, All) /* It returns the set of sorted vectors representing all possible distances between tuples in Outliers
and tuples in All. Uses priority queue queue with elements hV,tpl-seti. Priority is based on dist(v) ? jtpl-setj:*/
1. queue
h[d1;:::;dn];Alli, with d1 = :::;dn = 0
2. vects = h[d1;:::;dn]i, with d1 = :::;dn = 0 /* sorted list of distance vectors, based on < */
3. while queue6= ; do
3.1 Select and remove hv;Neighborsi from queue
3.2 if Neighbors \ Outliers 6= ; then
3.2.1 select a tuple s from Neighbors \ Outliers
3.2.2 New-pairs
; /* keep tracks of new pairs to be added to queue */
3.2.3 for each tuple t 2 Neighbors do
3.2.3.1 Vs;t
compute distance between s and t
3.2.3.2 if 9hV;tuplesi2 New-pairs such that Vs;t V;dist(Vs;t) > 1 then tuples
tuples [ftg.
3.2.3.3 else if 9hV;tuplesi2 New-pairs such that V <Vs;t;dist(Vs;t) > 1 then V
Vs;t, tuples
tuples [ftg.
3.2.3.4 else if dist(Vs;t) > 1 then New-pairs
New-pairs [ fhVs;t;ftgig
3.2.4 queue
queue [fhV;tuplesijhV;tuplesi2 New-pairs and tuples not singletong
4. return vects
Figure 5: Procedure VectorCover
Intuitively, the the minimal generalizations of table Ti are exactly those tables Tj satisfying k-anonymity with
minimal distance vectors DVi;j. For instance, with reference to the generalized tables illustrated in Figure 3 we
have already noticed how, for k = 3, GT[1;1] cannot be minimal because GT[0;1] and GT[1;0] also satisfy k-anonimity.
(Remember that the subscript indicates the distance vector of the generalized table from PT).
Our minimal generalization algorithm also makes use of another form of the distance vector that captures how far
apart two tuples belonging to the same table are. Let T be a table and x; y 2T two tuples such that x = hv0
1;:::;v0
ni
and y = hv00
1 ;:::;v00
ni be two tuples where each vi;v0
i is a value in domain Di. The distance vector between x
and y is the vector Vx;y = [d1;:::;dn] where di is the length of the paths from v0
i and v00
i to their closest common
ancestor in the value generalization hierarchy VGHDi. For instance, with reference to the PT illustrated in Figure 3,
the distance between ha,39iand hb,39iis [1,0].
The relationship between distance vectors for tables and distance vectors for tuples is important. Intuitively, the
distance between two tuples x and y in table T is the distance vector between T and the table T0, with T T0 where
the domains of the attribute in T0 are the most speci c domains for which x and y generalize to the same tuple t.
This observation brings us to the following theorem.
Theorem 4.2 Let Ti and Tj be two tables such that Ti Tj. If Tj is k-minimal then DVj;j V , where V is
the least upper bound of all the distance vectors Vx;y where x; y are tuples in Ti and either x or y has a number of
occurrences smaller than k.
According to Theorem 4.2 the distance vector of a minimal generalization of a table cannot be greater than the
smallest vector dominating all possible distance vectors between the tuples in Ti. This property is exploited by the
minimal generalization algorithm to reduce the number of possible strategies to be evaluated, as illustrated in the
following subsection.
4.2 Reducing the number of strategies: VectorCover
As discussed in Section 3 each k-minimal generalization is locally minimal to a speci c generalization strategy.
Performing all possible generalizations in the domain generalization hierarchy would eventually reveal a minimal
generalization strategy in which an instance of data achieves a k-anonymity requirement. However, the large number
of possible strategies (see Theorem 3.2) makes this approach largely impractical. Theorem 4.2 above helps reduce the
number of generalizations (strategies) to be considered. Though reduced, this approach bears the cost of determining
all the possible vectors between outliers (tuples which have less than k occurrences) and every other tuple in the
table. In the worst case, where all tuples are outliers, the cost of this operation is O(N2) where N is the total
number of tuples in the table. Given this observation, instead of constructing all possible vectors, we present an
e cient algorithm named, VectorCover, that given the set of outliers and the set of all the tuples in the table, without
computing all pairwise combinations, determines the di erent distance vectors between the outliers and the other
7

Page 10
Preferred Minimal Generalization MinGen Algorithm
Input:
Private table PT; quasi-identi er QI = (A1;:::;An) and k-anonymity constraint; preference speci cations.
Output: A minimal generalization MGT of PT[QI] wrt k-anonymity chosen according to the preference speci cations
Method
1. MGT
PT[QI] /* create a copy of PT[QI] */
2. history
[d1;:::;dn], where di = 0;i = 1;:::;n.
3. while there exists an attribute Ai 2QI and value v 2MGT[Ai] such that freq(v;MGT[Ai]) < k do
3.1 let Vi = [d1;:::;dn] be the vector such that dj = 1 if j = i; dj = 0 otherwise
3.2 MGT
generalize(MGT;Vi;history)
3.3 history
history + Vi /* where [x1;:::;xn]+[y1;:::;yn]=[x1 + y1;:::;xn + yn]*/
4. All
ft j t is a distinct tuple in MGTg
5. Outliers
ft j t 2All;freq(t;MGT[QI]) < kg
6. if Outliers 6= ; do
6.1 Distances
VectorCover(Outliers,All) /* sorted list of vectors to be considered */
6.2 Strategies
Levelsets(Distances)
/* Orders sets of vectors in levels */
6.3 lmin
binsearch(Strategies,MGT,history,k)
6.4 Allmins
fV j hlmin;Vsetlmini2 Strategies;V 2Vsetlmin, and 8t 2GV ;freq(t; GV ) k where GV = generalize(MGT;V;history)g
6.5 pref-vec
preferred(MGT,AllMins,history) /* select a vector corresponding to preference speci caton */
6.6 MGT
generalize(MGT;pref-vec;history).
7. return MGT.
Figure 6: The Preferred Minimal Generalization (MinGen) Algorithm
tuples. Techniques that interpret distance vectors use a distance function dist based on the sum of the elements in
the vector. More formally, given a vector V = [d1;:::;dn], dist(V ) =
Pn
i=1 di. It is easy to see that the distance
function satis es the properties of Euclidean geometry, i.e., for all possible tuples x, y, and z: 1) dist(Vx;y) 0; 2)
dist(Vx;y)=0i x = y; 3) dist(Vx;y) = dist(Vy;x); and 4) dist(Vx;y) dist(Vx;z) + dist(Vz;y).
VectorCover analogizes the problem of nding generalization strategies stored as vectors, to that of nding a
minimum spanning tree on a clique. The nodes are the tuples and the weights on the edges are the distance vectors
between the tuples. A technique for nding an optimum cover was presented by Agrawal et al. in[2]. However,
the algorithm proposed in [2] does not apply e ciently to cliques (the performance is O(N2) where N is the total
number of tuples). Moreover, with respect to [2] we have an additional requirement for the spanning tree, which is
that the set of edges contained in the spanning tree must include at least one occurrence of each distinct distance
vector found in the clique. Despite that, we can do better than O(N2) by exploiting the Euclidean properties of the
distance function de ned on vectors. In particular, VectorCover does not automatically visit all incident unvisited
neighbors, but instead uses the distances to cluster tuples together so that tuples that are far away from a given
tuple may be close to each other. Like in [2], VectorCover removes old edges and adds new ones to reduce the overall
sum of the weights on all the edges. An algorithm implementing VectorCover is illustrated in Figure 5. For space
requirements, we refer the reader to Figure 5 for more details on the algorithm. Here we only note that the key to
implementing VectorCover e ciently is to make it easy to select an unvisited edge that is likely to have a distance
vector less than known distances. To this purpose we exploit the Euclidean nature of the distance function which
ensures that whenever two tuples, x and y, incident to a third tuple, z, the vector representing the least upper bound
between Vz;x and Vz;y places an upper bound on Vx;y. When known distances place an upper bound on an unvisited
edge, the edge is visited and distances to neighboring tuples that have similar distances from a common tuple are
computed. The correctness of the procedure is stated by the following theorem.
Theorem 4.3 Let Outliers and All be two sets of tuples. VectorCover(Outliers,All) nds all and only the distinct
vectors Vx;y such that x 2 Outliers and y 2 All.
The performance of VectorCover depends on the nature of the data. Let N be the number of tuples in PT. It is
N in the best case, O(N) in the general case, deteriorating to O(N2) only in the worst case where the only minimal
generalization is the one to the most general domains.
8

Page 11
4.3 The Preferred Minimal Generalization Algorithm MinGen
Figure 6 illustrated an algorithm, called MinGen, which, given a table PT, a quasi-identi er QI 2 PTQI, and a k-
anonymity constraint, produces a table MGT which is a k-minimal generalization of PT[QI]. It assumes that k < jPTj,
which is a necessary and su cient condition for the existence of a minimal generalized table (see Theorem 3.1).
The algorithm makes use of two basic functions: preferred, which selects a vector that correspond to a preferred
generalization according to the given speci cations; and generalize, which enforces generalization. The generalize
function requires as parameters a table Ti and two distance vectors v and h, and returns the table Tj, generalization
of Ti, with DVi;j = s. The third parameter h, constructed by the algorithm, represent the distance of Ti from the
private table PT provided as input, and is used to ensure that the generalization is absolute with respect to the
domains in PT. The algorithm uses also function VectorCover discussed in the previous subsection, and functions
Levelsets and binsearch, reported in the appendix, which we explain here. Given a set of vectors ordered in increasing
distance, Levelsets de nes an assignment of levels to vectors as follows. Vectors are considered in the order in which
they appear in the list. Each new vector V considered is assigned to the lowest level at which there does not exists
any vector V 0 < V . Once all the vectors have been considered, vector Vu representing the lowest appear bound of
the vectors in the considered set is determined. If Vu does not belong to the set a new higher level is created to
which Vu is assigned. Moreover, the levels are examined and possibly updated as follows. Each vector V appearing
at a level l is replicated at all the levels between the lowest the highest level (this excluded) at which there exists a
vector dominated by V and the lowest level (this excluded) at which there is a vector dominating V . This \padding"
process ensures that, if we represent the order between vectors as a graph, all paths between the zero vector and all
the vectors at a given level have the same distance.
For the sake of space we omit the complete explanation of the algorithm and refer the reader to Figure 6 for
details. Here, we outline only the major steps. Step 3 generalizes each single attribute to satisfy k-anonymity
since k-anonymity of each attribute is a necessary condition for the k-anonymity of the quasi-identi er. Step 6 is
the core of the algorithm. Substep 6.1 calls procedure VectorCover to retrieve the distance vectors corresponding
to generalization strategies to be evaluated. Distance vectors are returned in a list sorted according to increasing
distance. Step 6.2 calls Levelsets to retrieve set Strategies representing the assignment of levels to vectors returned
by VectorCover. Step 6.3 calls binsearch which performs a binary search on the levels and returns the lowest level
at which there exists a vector corresponding to a minimal generalization.
The correctness of the algorithm relies on the correctness of function VectorCover and on the particular way levels
are constructed in Levelsets. In particular, it can be proved that, for how levels are de ned, if a vector corresponding
to a minimal generalization appears at level l than no other vectors corresponding to a minimal generalization can
exists at level lower than l. This property allows us to e ciently retrieve the interested level by performing the
binary search.
With respect to the complexity we make the following observations. Let N be the number of tuples in PT. The
loop to achieve k-anonymity at each attribute (step 3) is at most O(N jQIj). VectorCover has a best case of N
and a worst case of O(N2). The construction of strategies can be done in logarithmic time. The binary search is
performed in O(log(
Pn
i=1 hi)) where hi is the height of the hierarchy of the domain of attribute Ai. The complexity
of the test in binsearch to determine whether a level achieves k-anonymity is O(jdistancesj jMGTj). In the worst case
jdistancesj=
Qn
i=1 hi; however VectorCover guarantees jdistancesjto be minimal. These observations imply that the
number n of attributes in QI is an important factor a ecting the computational cost. To consider this computational
cost with respect to general cases of application, we surveyed real-word data [11]. The number of attributes in quasi-
identi ers found in many releases of census data, voter lists, public health data, and medical records ranged from
3 to 5 attributes and the heights of the hierarchies ranged from 1 to 8. Because MinGen searches only over the
generalizations speci c to the data, in the general case the number of candidate generalizations dramatically reduced
to a small fraction of all possible generalizations. In 300 pediatric medical records with 7617 visits and 285 elds
stored in over 12 relational database tables. We found that generalizations in trygen were 3-7% of the total possible
number of generalizations when we considered 4 independent sets of quasi-identi ers known to link to public and
semi-public information.
5 Conclusions
The practical signi cance of releasing individualized-data such that linking the data to other sources to re-identify
individuals cannot be done, o ers many bene ts to our electronic society. This work provides an e ective and optimal
9

Page 12
solution to this problem. Open questions remain. The size of and conditions for k necessary to ensure k-anonymity
must be further investigated. The quality of generalized data is best when the attributes most important to the
recipient do not belong to any quasi-identi er. For public-use les this is acceptable, but determining the quality
and usefulness in other settings must be further researched. Finally, data are dynamic. Tuples are added, changed,
and removed constantly. As a result, releases of generalized data over time can be subject to a temporal inference
attack, which need to be carefully investigated.
Acknowledgements
The authors thank Steve Dawson, at SRI, for discussions and support; Sylvia Barrett, at Harvard, for support; and,
Eric Jordan, Patrick Thompson and Mojdeh Mohtashemi, at MIT, for editorial suggestions.
References
[1] N.R. Adam and J.C. Wortman. Security-control methods for statistical databases: A comparative study. ACM
Computing Surveys, 21:515{556, 1989.
[2] Rakesh Agrawal, Alex Borgida, and H.V. Jagadish. E cient management of transitive relationships in large
data and knowledge bases. In Bruce Linsday James Cli ord and David Maier, editors, Proc. of the 1989 ACM
SIGMOD Int'l Conf. on Management of Data, pages 253{262, Portland, Oregon, June 1989.
[3] P.C. Chu. Cell suppression methodology: The importance of suppressing marginal totals. IEEE Trans. on
Knowledge Data Systems, 4(9):513{523, July/August 1997.
[4] B.A. Davey and H.A. Priestley. Introduction to Lattices and Order. Cambridge University Press, 1990.
[5] D. Denning, P. Denning, and M.D. Schwartz. The tracker: A threat to statistical database security. ACM
Trans. on Database Systems, 4(1):76{96, March 1979.
[6] N. Kirkendall et. al. Report on statistical disclosure limitation methodology. Statistical policy working paper,
no. 22, Washington: O ce of Management and Budget, 1994.
[7] A. Hundepool and L. Willenborg. - and -Argus: Software for statistical disclosure control. In Third Interna-
tional Seminar on Statistical Con dentiality, Bled., 1996.
[8] Sushil Jajodia, Pierangela Samarati, V.S. Subrahmanian, and Elisa Bertino. A uni ed framework for enforcing
multiple access control policies. In Proc. of the 1997 ACM SIGMOD Int'l Conf. on Management of Data,
Tucson, AZ, U.S.A, May 1997.
[9] Malvestuto, F. Moscarini, and M. Rafanelli. Suppressing marginal cells to protect sensitive information in a two-
dimensional statistical table. In Proceedings of the Tenth ACM Symposium on Principles of Database Systems,
Denver, CO, May 1991.
[10] Latanya Sweeney. Guaranteeing anonymity when sharing medical data, the Data y system. In Bureau of the
Census, Record Linkage Bulletin., Washington,DC: Bureau of the Census,, 1997.
[11] Latanya Sweeney. Weaving technology and policy together to maintain con dentiality. Journal of Law, Medicine,
& Ethics, 25(2{3):98{110, 1997.
[12] Tian Zhang, Raghu Ramakrishan, and Miron Livny. BIRCH: An e cient data clustering method for very
large databases. In H.V. Jagadish and I.S. Mumick, editors, Proc. of the 1996 ACM SIGMOD Int'l Conf. on
Management of Data, pages 103{114, Montreal, Canada, June 1996.
10

Page 13
A Additional Algorithms
In this section we report the functions Levelsets and binsearch used by algorithm MinGen, whose behavior has been
illustaretd in the paper. The semantics of Levelsets has been illustrated in the paper. The algorithm below enforces
an optimized version of the level construction based on the use of a graph.
LevelSets(Distances)
2. D
f(dists, E)g /* build DAG where dists = distances, and ex;y element of E i dx;dy
element of distances and dx < dy g
/* make all path lengths same by padding nodes as follows */
3. for all d1;d2;d3 element of dists and e13;e23 2E
3.1 l1
length of path from [0; :::;0] to d1 in D
3.2 l2
length of path from [0; :::;0] to d2 in D
3.3 if l1 > l2 then dists
dists [ d20; E
E [fe2;20g
3.4 else if l2 > l1 then dists
dists [ fd0
1g; E
E [fe1;10g /* d0 and d have the same distance vectors but d 6= d0 */
4. lsets
;
5. maxpath
max path length in D
6. for i=1,...,maxpath do
6.1 lsets
lsets [fhi; Siijd 2Si i d 2 dists and length of path from [0,...,0] to d in D is i g
7. lsets
lsets [fhlmx;Smxijlmx = maxpath+1; and vm = di dj 8di;dj 2Sm where hmaxpath;Smi2 lsetsg
/* [x1;:::;xn] [y1;:::;yn] = [max(x1;y1);:::;max(xn;yn)] */
8. return lsets
binsearch(Strategies,MGT,history,k) /* determines the lowest level with a minimal generalization satisfying k-anonymity*/
1. min
1; max
maxfijhi; Gii2Strategiesg
2. while min < max do
2.1 try
bmin+max
2
c
2.2 trygens
ftryvecsjh try,tryvecsi2Strategiesg
2.3 reach k
false
2.4 while trygens 6= ;^reach k 6= true do
2.4.1 Select a distance vector V from trygens
2.4.2 trygens
trygens ,fV g
2.4.3 Gen try
generalize(MGT;V;history)
2.4.4 if all tuples t 2 Gen try have freq(t;Gen try) k then reach k
true
2.5 if reach k = true then max
try else min
try + 1
3. return min
11

Page 14
B An Example
In this section we illustrate the behavior of the algorithm in the generalization of the private table illustrated
in Figure 1. Let QI=fEthnicity, Date Of Birth, Sex, ZIP, Marital Statusg. The resulting table PT[QI] to be k-
anonymized is reported in Figure 7. For the sake of space, in Figure 7 names and values of attributes are abbreviated.
We assume the hierarchies for ZIP and Ethnicity to be as illustrated in Figure 2. Domain hierarchies for Date of Birth
(assuming values in the sixties), Sex, and Marital Status are illustrated in Figure 8. The hierarchy for dates maps
speci c dates the form mm/dd/yy in domain D0, into months mm/yy in domain D1, years (yy) in domain D2, 5-year
intervals in D3 and the complete 10-year interval in D4. (Note that although the hierarchy can seem to change
the format of the data, compatibility of speci c and generalized values can be ensured by using the same storage
representation form. For instance, generalization of the month can be represented by mm/1/yy, generalization to the
year by 1/1/yy and so on.) The value hierarchy for Sex maps each possible value of S0 into value not released of
S1. The hierarchy for Marital Status M0 is illustrated in Figure 8.
The ordered set of vectors returned by VectorCover and the levels constructed by Levelsets are reported below.
Vectors in italic represent vectors replicated in the \padding" process. The vectors in bold face represents the least
upper bound of all the vectors in Distances. Note that these generalization vectors are relative with respect to vector
history [0;1;0;0;0] representing the generalization performed on attribute DOB in step 3 of the algorithm.
Distances
00000
00011
01001
01010
02002
02120
02122
10100
10111
11110
12020
12021
12102
12112
Level 1
Level 2
Level 3
Level 4
Level 5
Level 6
00000
00011
02002
12102
12112
12122
01001
12020
02122
12122
01010
02120
12021
12021
01010
10111
10111
10100
11110
11110
Figure 9 illustrates the k-minimal generalizations which can be returned by the algorithm for k=2 in dependence
of the preference speci ed.
12

Page 15
Eth
DOB
Sex
ZIP
Mar
1
b
09/27/64
m
02139
div
2
b
09/30/64
m
02139
div
3
b
04/18/64
m
02139
mar
4
b
04/15/64
m
02139
mar
5
b
09/15/64
m
02138
mar
6
c
03/23/63
m
02141
mar
7
c
03/18/63
m
02141
mar
8
c
09/13/64
f
02138
mar
9
c
09/07/64
f
02138
mar
10
c
05/14/61
f
02138
sin
11
c
05/08/61
f
02138
sin
Figure 7: Stored Values for the quasi-identi er of PT of Figure 1
D4 = f[60 ,70]g
D3 = f[60 ,64];[65 ,70]g
D2 = f61;:::;70g
D1 = f1=61;:::;12=70g
D0 = f1=1=60;:::;12=31=70g
6
6
6
6
DGHD0
S1 = fnot releasedg
S0 = fmale;femaleg
DGHS0
6
M2 = fnot releasedg
M1 = fonce married; never marriedg
M0 = fmarried; divorced; widow; singleg
6
6
DGHM0
not released
once married
never married
married divorced widow
single
*
HHHHY
36QQQk
6
VGHM0
Figure 8: Examples of Domain and Value Generalization Hierarchies
Eth
DOB
Sex
ZIP
Mar
1
b
64
m
02130
div
2
b
64
m
02130
div
3
b
64
m
02130
mar
4
b
64
m
02130
mar
5
b
64
m
02130
mar
6
c
63
m
02140
mar
7
c
63
m
02140
mar
8
c
64
f
02130
mar
9
c
64
f
02130
mar
10
c
61
f
02130
sin
11
c
61
f
02130
sin
GT[0;2;0;1;0]
Eth
DOB
Sex
ZIP
Mar
1
b
09/64
m
02130
once
2
b
09/64
m
02130
once
3
b
04/64
m
02130
once
4
b
04/64
m
02130
once
5
b
09/64
m
02130
once
6
c
03/63
m
02140
once
7
c
03/63
m
02140
once
8
c
09/64
f
02130
once
9
c
09/64
f
02130
once
10
c
05/61
f
02130
never
11
c
05/61
f
02130
never
GT[0;1;0;1;1]
Eth
DOB
Sex
ZIP
Mar
1
p
09/64
n r
02139
div
2
p
09/64
n r
02139
div
3
p
04/64
n r
02139
mar
4
p
04/64
n r
02139
mar
5
p
09/64
n r
02138
mar
6
p
03/63
n r
02141
mar
7
p
03/63
n r
02141
mar
8
p
09/64
n r
02138
mar
9
p
09/64
n r
02138
mar
10
p
05/61
n r
02138
sin
11
p
05/61
n r
02138
sin
GT[1;1;1;0;0]
Figure 9: Example of minimal generalized tables for the table of Figure 7
13