Generalizing Data to Provide Anonymity when Disclosing Information

Page 1

Generalizing Data to Provide Anonymity when Disclosing Information

Extended Abstract

Pierangela Samarati

Computer Science Laboratory

SRI International

Menlo Park, CA 94025, USA

samarati@csl.sri.com

Latanya Sweeney

Laboratory for Computer Science

Massachusetts Institute of Technology

Cambridge, MA 02139, USA

sweeney@ai.mit.edu

Contact Author:

Latanya Sweeney, Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139,

USA Telephone: 617{253{3539, Fax: 617{253{3539, Internet: sweeney@ai.mit.edu

The work of Pierangela Samarati was supported in part by National Science Foundation and by DARPA. The work of

Latanya Sweeney was supported in part by the Medical Informatics Training Grant IT15 LM07092 from the National Library

of Medicine.

Page 2

Abstract

The proliferation of information on the Internet and access to fast computers with large storage capaci-

ties has increased the volume of information collected and disseminated about individuals. The existence

os these other data sources makes it much easier to re-identify individuals whose private information is

released in data believed to be anonymous. At the same time, increasing demands are made on organi-

zations to release individualized data rather than aggregate statistical information. Even when explicit

identiers, such as name and phone number, are removed or encrypted when releasing individualized

data, other characteristic data, which we term quasi-identiers, can exist which allow the data recipient

to re-identify individuals to whom the data refer.

In this paper, we provide a computational disclosure technique for releasing information from a private

table such that the identity of any individual to whom the released data refer cannot be denitively

recognized. Our approach protects against linking to other data. It is based on the concepts of gen-

eralization, by which stored values can be replaced with semantically consistent and truthful but less

precise alternatives, and of k-anonymity. A table is said to provide k-anonymity when the contained

data do not allow the recipient to associate the released information to a set of individuals smaller than

k. We introduce the notions of generalized table and of minimal generalization of a table with respect to

a k-anonymity requirement. As an optimization problem, the objective is to minimally distort the data

while providing adequate protection. We describe an algorithm that, given a table, e ciently computes

a preferred minimal generalization to provide anonymity.

Page 3

1 Introduction

In a globally-network society, there is greater demand by society for individual-specic data, yet the widespread

availability of information makes it extremely di cult to release any information about individuals without breach-

ing privacy. The existence of extensive registers of business establishments in the hands of government agencies,

trade associations and rms like Dunn and Bradstreet has virtually ruled out the possibility of releasing database

information about businesses [6]. Even when released information has no explicit identiers, such as name and phone

number, other characteristic data, such as birth date and ZIP code, often combine uniquely and can be linked to

publicly available information to re-identify individuals [11].

Denning showed that no release of data can be proven secure from any arbitrary inference attack [5]. In this paper

we provide an e ective guard against the specic problem of linking to other data and show that as an optimization

problem, we must provide as much information as possible while still protecting privacy and condentiality by

thwarting linking attempts. Recent computational disclosure systems, namely -Argus [7] and Data y [10], use

techniques that suppress information and generalize data to prevent linking, but Data y often distorts the data more

than is needed and -Argus often fails to provide adequate protection.

In this paper we focus our attention on generalizing information to provide anonymity in released data. This paper

has three main contributions: the formal denition of anonymity in the context of protecting disclosed information

against linking; the formal denition of minimally generalizing tables to satisfy given anonymity constraints; and an

e cient algorithm for determining minimal generalizations that retain preferred specicity in the released data.

We introduce the denition of quasi-identiers, which are attributes that can be exploited for linking, and of

k-anonymity, which characterizes the degree of protection of data with respect to inference by linking. We show how

k-anonymity can be ensured in information release by generalizing data to be disclosed and we introduce the con-

cepts of generalized table, minimal generalization, and preferred minimal generalization. Intuitively, a generalization

is minimal if data are not generalized more than necessary to provide k-anonymity. The denition of preferred gen-

eralization allows the user to select among possible minimal generalizations those that satisfy particular conditions,

such as favoring certain attributes in the generalization process. We provide an algorithm that produces a preferred

minimal generalization of a given table. Although the number of possible generalizations of a table is exponential

in the number N of tuples to be generalized, the algorithm proves to operate linearly in its best and general cases,

degrading to N2 only in the worst case.

Related work in the area of information protection deals with access control systems [8] and statistical databases [1].

Our work, however, conceptually di ers from these proposals by providing a di erent, yet complementary, approaches

to protection. Access control systems address the problem of controlling specic access to data with respect to rules

that state whether a piece of data can or cannot be released. This point of view di ers from that under consideration

in this work because it is not the disclosure of a specic piece of data that must be protected, but rather the ability

to link that data to a particular entity (e.g., a patient in a hospital database, a tax-payer in the IRS database,

a customer organization in a generic business database). The closest work can be found in the area of statistical

databases. However, statistical database techniques [1, 9] address the problem of producing tabular data based on

queried information, ensuring that it is not possible for users to infer private data from the produced summary. In our

approach instead, we release generalized individual-specic data on which users can produce summaries according to

their own needs. Emerging applications and technology such as data mining and the Internet are making tremendous

demands for the release of detailed information about individuals such that the identity of the individuals cannot be

recognized.

The remainder of this paper is organized as follows. We begin in Section 2 by illustrating the linking problem and

providing denitions of quasi-identier and k-anonymity. Section 3 introduces generalization to provide anonymity

and the concept of generalized tables and minimal generalization. Section 4 discusses the problem of e ciently

computing a minimal generalization for a table and presents an algorithm for that. Section 5 concludes the paper.

The paper also contains an Appendix reporting two supplementary algorithms discussed in the paper, and an example

illustrating an application of our approach.

2 The anonymity problem

Data holders often release personal information with all explicit identiers, such as name, address, phone number,

and Social Security number, removed or encrypted in the incorrect belief that privacy is maintained because the

Page 4

Medical Data released as anonymous

SSN

Name

Ethnicity

Date Of Birth

Sex

ZIP

Marital Status Problem

black

09/27/64

male

02139

divorced

obesity

black

09/30/64

male

02139

divorced

hypertension

black

04/18/64

male

02139

married

chest pain

black

04/15/64

male

02139

married

chest pain

black

09/15/64

male

02138

married

shortness of breath

caucasian

03/13/63

male

02141

married

hypertension

caucasian

03/18/63

male

02141

married

shortness of breath

caucasian

09/13/64

female

02138

married

shortness of breath

caucasian

09/07/64

female

02138

married

obesity

caucasian

05/14/61

female

02138

single

chest pain

caucasian

05/08/61

female

02138

single

obesity

Voter List

Name

Address

City

ZIP

DOB

Sex

Race

Party

................

................ ........

........

........ ........

........

................

................ ........

........

........ ........

........

................

Jim A. Cosby

570, Laurel St. Cambridge

02138

9/15/64

male

black

democrat

................

................ ........

........

........ ........

........

................

Figure 1: Re-identifying anonymous data by linking to outside data

resulting data look anonymous. Simultaneously, municipalities sell and distribute population registers that include

information specic to individuals, such as local census data, voter registration lists, city directories, as well as

information from motor vehicle agencies, tax assessors and real estate agencies. In most of these cases, population

registers can be used to re-identify individuals by linking or matching it to specic attributes in released \anonymous"

data. As an example, the 1997 voting list for Cambridge, Massachusetts contains demographics on 54,805 voters.

Birth date alone can uniquely identify the name and address of 12% of the voters; birth date and gender, 29%; birth

date and a 5-digit ZIP code, 69%; and 97% (53,033 voters) can be identied when the full postal code and birth date

are used [11].

A data holder can often identify attributes in their data that also appear in outside sources, and these attributes

are candidates for linking. We term them, quasi-identiers, and it is essentially the combinations of these quasi-

identiers that must be protected.

Denition 2.1 (Quasi-identier) Let T(A1;:::;An) be a table. A quasi-identier of T is a set of attributes

fAi;:::;Ajg fA1;:::;Ang whose release must be controlled.

Figure 1 exemplies a table of released medical data that has been de-identied by suppressing names and SSNs

so as not to disclose the identities of individuals to whom the data refer. An example of a quasi-identier for this

table is represented by attributes fZIP, date of birth, ethnicity, sex, marital statusg. As illustrated in the gure,

the rst four attributes can be retrieved by accessing a list of voters, which is publically available. other externally

available, but are not included in the quasi-identier since they are assumed to be suppressed (or encrypted) in the

release. Quasi-identiers combine with external knowledge to retrieve the identities of individuals. For instance, in

Figure 1, the voter list my have only one black male born on 9/15/64 and leaving in the 02138 area. This identies

the corresponding bulleted tuple in the released data as pertaining to Jim Cosby. As stated earlier, birthdate alone

can re-identify 12% of the voters in Cambridge, MA.

Given a table T(A1;:::;An), a subset fAi;:::;Ajgof the attributes in T, and a tuple t 2T, we use t[Ai;:::;Aj]

to denote the sequence of the values of the Ai;:::;Aj in t. We use T[Ai;:::;Aj] to denote the projection, maintaining

duplicate tuples, of attributes Ai;:::;Aj in T. In this paper we assume the existence of a private table PT, already

de-identied and we assume sets of attributes are specied which represent quasi-identiers. We denote with QIPT

the set of quasi-identiers in PT.

One way to state an anonymity constraint involves making sure characteristics and combinations of characteristics

found in the data combine to match at least k individuals. To determine how many individuals each released tuple

matches requires combining the released data with externally available data. This is obviously an impossible task

for the data holder who releases information. Although we can assume the data holder knows what data are part of

external knowledge, and therefore what constitutes quasi-identiers, the specic values of data in external knowledge

cannot be assumed. The way to satisfy the k-anonymity requirement is therefore to force such a constraint on the

released data.

Page 5

Z2 = f02100g

Z1 = f02130;02140g

Z0 = f02138;20239; 02140; 02141g

DGHZ0

02100

02130

02140

02138

02139

02141

02142

HHHHY

,, @@I

@@I,,

VGHZ0

E1 = fpersong

E0 = fasian; black; caucasiang

DGHE0

person

asian

black

caucasian

VGHE0

HHHHY6

Figure 2: Examples of domain and value generalization hierarchies

Denition 2.2 (k-anonymity) Let T(A1;:::;An) be a table and QIT be the quasi-identiers associated with it. T

is said to satisfy k-anonymity i for each quasi-identier QI 2 QIT ,: each sequence of values in T[QI] appears at

least with k occurrences in T[QI].

It can be proved that if each quasi-identier in the released data satises k-anonymity, then the combination

of released data to external sources cannot match lower than k individuals. This property holds provided that all

attributes in the released table PT which are externally available in combination (i.e., appearing together in an

external table or in a possible join of external tables) are dened in a quasi-identier associated with PT.

3 Generalization to combat linking inference

There are many techniques to achieve k-anonymity, such as changing unusual information to typical values, inserting

complementary records, swapping entries, scrambling records and suppressing information [1, 3, 9]. We propose an

approach that produces a version of PT such that the k-anonymity requirement is satised by re-coding values to

make them more general.

In a classical relational database system, domains are used to describe the set of values that attributes assume.

For example, there might be a ZIP code domain, a number domain and a string domain. We extend this notion of

a domain to make it easier to describe how to generalize the values of an attribute. In the original database, where

every value is as specic as possible, every attribute is in the ground domain. For example, 02139 is in the ground

ZIP code domain, Z0. In order to achieve k-anonymity we can make the ZIP code less informative. We do this by

saying there is a more general, less specic, domain that can be used to describe ZIP codes, Z1, in which the last

digit has been replaced by a 0. There is also a mapping from Z0 to Z1, such as 02139 ! 02130. This mapping

between domains is stated by means of a generalization relationship, which represents a partial order on the set Dom

of domains, and which is required to satisfy the following conditions: 1) each domain Di has at most one direct

generalized domain; and 2) all maximal elements of Dom are singleton.1 The denition of this generalization implies

the existence, for each domain D 2 Dom, of a hierarchy, which we term the domain generalization hierarchy

DGHD. Since generalized values can be used in place of more specic ones, it is important that all domains in the

hierarchy be compatible. Compatibility can be ensured by using the same storage representation form for all domains

in a generalization hierarchy. A value generalization relationship is also dened which associates with each value vi

in a domain Di a unique value in domain Dj direct generalization of Di. Such a relationship implies the existence, for

each domain D of a value generalization hierarchy VGHD. Figure 2 illustrates an example of domain and value

generalization hierarchies for domain Z0, representing ZIP codes for Cambridge, MA, and E0 representing ethnicity.

In the remainder of this paper we will often need to refer to a domain or value generalization hierarchy in terms

of the graph representing all and only the direct generalization relationships between the elements in it (i.e., implied

generalization relationships do not appear as arcs in the graph). We will use the term hierarchy indistinguishably to

denote either a partially ordered set or the graph representing the set and all the direct generalization relationships

between its elements. We will explicitly refer to the ordered set or to the graph when it is not otherwise clear from

context.

1The motivation behind condition 2 is to ensure that all values in each domain can be eventually generalized to a single

value.

Page 6

Eth:E0

ZIP:Z0

Eth:E1

ZIP:Z0

GT[1;0]

Eth:E1

ZIP:Z1

GT[1;1]

Eth:E0

ZIP:Z2

GT[0;2]

Eth:E0

ZIP:Z1

GT[0;1]

Figure 3: Examples of generalized tables for PT

3.1 Generalized table and minimal generalization

Given a private table PT, our approach is to provide a table which respects k-anonymity by generalizing the values

stored in it. Generalization is e ective because substituting attribute values with generalized values maps more values

to the same generalized result. This typically decreases the number of distinct tuples and thus increases the number

of tuples having the same values. We enforce generalization at the attribute level. Generalizing an attribute means

substituting its values with corresponding values from a more general domain. Since attributes can change domain in

the generalization process, the simple classical notation Di for the domain of attribute Ai does not su ce. Instead,

we need to explicitly refer to the domain an attribute assumes in a specic table. In the following, dom(Az;T)

denotes the domain of attribute Az in table T. Di = dom(Az;PT) denotes the domain associated with attribute Ai

in the denition of the private table PT.

A table Tj(A1;:::;An) is said to be a generalization of a table Ti(A1;:::;An), written Ti Tj i : (1) Ti

and Tj have the same number of tuples; (2) the domain of each attribute Az in Tj is equal to or a generalization of

the domain of Az in Ti; and (3) each tuple ti in Ti has a corresponding tuple tj in Tj (and vice-versa) such that for

each attribute Az, tj[Az] is equal to or a generalization of ti[Az].

Example 3.1 Consider the table PT illustrated in Figure 3 and the domain and value generalization hierarchies

for E0 and Z0 illustrated in Figure 2. The remaining four tables in the gure are examples of generalized tables for

PT. For the clarity of the example, every table reports, together with each attribute, the domain for the attribute

in the table. With respect to k-anonymity: GT[1;0] and GT[0;1] satises k-anonymity for k = 1;2; GT[0;2] satises

k-anonymity for k = 1;:::;4, and GT[1;1] satises k-anonymity for k = 1;:::;6:

It is easy to see that the number of di erent generalizations of a table is equal to the number of di erent

combinations of domains that the attributes in the tables can assume. Clearly, not all generalizations are equally

satisfactory. A trivial possible generalization, for instance, is the one that generalizes each attribute to the highest

possible level of generalization, thus collapsing all tuples in the table to the same list of values. This provides k-

anonymity at the price of a strong generalization of the data. Such an extreme generalization is not needed if a less

generalized table (i.e., containing more specic values) exists which satises k-anonymity. This concept is captured

by the following denitions of k-minimal generalization.

Denition 3.1 (k-minimal generalization { wrt a quasi-identier) Let Ti and Tj be two tables such that Ti

Tj. Tj is said to be a k-minimal generalization of a table Ti wrt to a quasi-identier QI i :

1. Tj satises k-anonymity wrt QI

2. 8Tz : Ti Tz;Tz Tj, Tz satises k-anonymity wrt QI )Tz[QI] = Tj[QI]:

Intuitively, a table Tj generalization of Ti is k-minimal if it satises k-anonymity and there does not exist any

generalization of Ti which satises k-anonymity and of which Tj is a generalization.

Example 3.2 Consider table PT and its generalizations illustrated in Figure 3. Assume QI=(Eth,ZIP) to be a

quasi-identier. It is easy to see that for k = 2 there exist two k-minimal generalizations, which are GT[1;0] and

Page 7

hE1;Z2i

hE1;Z1i hE0;Z2i

hE1;Z0i hE0;Z1i

hE0;Z0i

,, @@I

@@I,,

6 6HHHHY

DGHDT

hE1;Z2i

hE1;Z1i

hE1;Z0i

hE0;Z0i

GS1

hE1;Z2i

hE1;Z1i

hE0;Z1i

hE0;Z0i

GS2

hE1;Z2i

hE0;Z2i

hE0;Z1i

hE0;Z0i

GS3

Figure 4: Domain generalization hierarchy DGHDT and strategies for DT = hE0;Z0i

GT[0;1]. Table GT[0;2], which satises the anonymity requirement is not minimal since it is a generalization of

GT[0;1]. Analogously, GT[1;1] cannot be minimal, being a generalization of both GT[1;0] and GT[0;1]. There are also

only two k-minimal generalized tables for k=3, which are GT[1;0] and GT[0;2].

Denition 3.2 (k-minimal generalization) Let Ti(A1;:::;An) and Tj(A1;:::;An) be two tables such that Ti

Tj. Tj is said to be a k-minimal generalization of Ti i :

1. Tj is k-minimal wrt every quasi-identier QI in QITi.

2. 8k = 1;:::;n :69QI 2QITi;Az 2QI )dom(Az;Tj) = dom(Az;Ti):

According to Denition 3.2 a table Tj is a k-minimal generalization of table Ti i it is k-minimal with respect

to every quasi-identier of Ti and attributes not belonging to any quasi-identier are not generalized. It is trivial

to see that a table that satises k-anonimity has a unique k-minimal generalization, which is itself. It is also easy

to see that the necessary and su cient condition for a table T to satisfy k-anonimity is that the cardinality of the

table must be at least k, as stated by the following theorem. The requirement that the maximal elements of Dom be

singleton ensures the su ciency of the condition.

Theorem 3.1 Let T be a table and k be a natural number. If jTj k, then there exists at least a k-minimal

generalization for T. If jTj< k there are no k-minimal generalization for T.

When di erent minimal generalizations exists preference criteria can be applied to choose a minimal preferred

generalization. For instance, tables which generalize or do not generalize specic attributes or return the highest

number of distinct tuples can be preferred.For instance, for a k-anonymity requirement with k=2, GT[1;0] and GT[0;1]

are both minimal but GT[0;1] maybe preferred because it contains a larger number of distinct tuples.

3.2 Generalization hierarchies and strategies

Since quasi-identiers may be composed of di erent attributes, it is useful to talk about generalization relationship

and hierarchies in terms of tuples composed of elements of Dom or of their values. Given a tuple DT = hD1;:::;Dni

such that Di 2Dom; i = 1;:::;n. we dene the domain generalization hierarchy of DT as DGHDT = DGHD1

::: DGHDn, assuming the cartesian product is ordered by imposing coordinatewise order [4]. DGHDT denes a

lattice whose minimal element is DT. For instance, Figure 4 illustrates the domain generalization hierarchy DGHE0;Z0

where the domain generalization hierarchies of E0 and Z0 are as illustrated in Figure 2.

The generalization hierarchy of a domain tuple DT denes di erent ways in which DT can be generalized. In

particular each path from DT to the unique maximal element of DGHDT in the graph describing DGHDT denes a

possible alternative path that can be followed in the generalization process. We refer to the set of nodes in each of

such paths together with the generalization relationships between them as a generalization strategy for DGHDT .

The di erent domain generalization strategies for DGHE0;Z0 above are illustrated in Figure 4. The number of di erent

possible strategies for a domain hierarchy is stated by the following theorem.

Page 8

Theorem 3.2 Let DT = hD1;:::;Dni be a tuple such that Di 2 Dom;i = 1;:::;n. The number of di erent

strategies for DT is:

hDT!

h1!:::hn!

, where each hi is the length of the path from Di to the top domain in DGHDi and

hDT =

i=1 hi.

For each strategy a minimal local generalization can be dened as the table satisfying k-anonymity, whose

sequence of domains DT0 belongs to the strategy such that there are no other tables satisfying k-anonymity whose

sequence of domains DT00 is in the strategy and such that DT00 DT0. Since a strategy is a total order, the

minimal local generalization is always unique. The following theorem states the correspondence between k-minimal

generalization and the local minimal generalization with respect to a strategy.

Theorem 3.3 Let T(A1;:::;An) and let DT = hD1;:::;Dni be the tuple where Dz = dom(Az;T), z = 1;:::;n, be

a table to be generalized. Every k-minimal generalization of T is a local minimal generalization for some strategy of

DGHDT .

The converse is not true, i.e., a local minimal generalization with respect to a strategy may not correspond

to a k-minimal generalization. For instance, consider table PT and its generalized tables illustrated in Figure 3,

whose minimality has been discussed in Example 3.1. For k = 3 the minimal local generalizations are: GT[1;0] for

strategy 1, GT[1;1] for strategy 2, and GT[0;2] for strategy 3. However, as we have shown in Example 3.1, GT[1;1]

is not k-minimal for k=3. For k = 2 the minimal local generalizations are: GT[1;0] for strategy 1, and GT[0;1] for

strategies 2 and 3. From Directly from Theorem 3.3, a table has at most as many generalizations as the number

of generalization strategies of its domain sequence. The number of k-minimal generalizations can be smaller if the

generalized table, locally minimal with respect to a strategy, is a generalization of a table locally minimal to another

strategy (GT[1;1] for k = 3 in the example above), or if di erent strategies have the same local minimal generalization

(GT[0;1] for k=2 in the example above).

4 An optimal algorithm for determining a minimal generalization

In this section we present an e cient way to determine a minimal generalization of a table. Since the k-anonymity

property requires the existence of k occurrences for each sequence of values present in the table, though only for

those of quasi-identiers, we can concentrate on the generalization of specic quasi-identiers within table PT. The

generalized table PT is obtained by enforcing generalization on each quasi identier QI 2QIPT. The correctness of

the combination of the generalizations independently produced for each quasi-identier is ensured by the fact that

the denition of generalized table requires the correspondence of values across whole tuples and by the fact that

the quasi-identiers of a table are disjoint. (This last constraint can be removed if generalization of non disjoint

quasi-identiers be executed serially.) Our generalization algorithm is based on the denition and use of distance

vectors, dened in the following subsection.

4.1 Distance vectors

We introduce a distance vector metric with Euclidean properties that measures distances between tuples and tables

based on the number of generalization levels required to have the tuples or tables share the same generalized values.

We further show that these distance vectors represent generalization strategies.

Denition 4.1 (Distance vector) Let Ti(A1;:::;An) and Tj(A1;:::;An) be two tables such that Ti Tj. The

distance vector of Ti to Tj is the vector DVi;j = [d1;:::;dn] where each dz is the length of the unique path between

D =dom(Az;Ti) and dom(Az;Tj) in the domain generalization hierarchy DGHD.

Intuitively the distance vector captures how many generalization steps table Tj is from table Ti for each attribute.

To illustrate, consider table PT and its generalized tables illustrated in Figure 3. The distance vectors between PT

and its di erent generalizations are the vectors appearing as subscripts of each table.

The relationship between distance vectors and minimal generalizations, which is the basis of the correctness of

our approach, is stated by the following theorem.

Theorem 4.1 Let Ti and Tj be two tables such that Ti Tj and Tj satises k-anonymity. Tj is k-minimal ,

69Tz;Tz 6= Tj, such that Ti Tz, Tz satises k-anonymity, and DVi;z DVi;j.

Page 9

VectorCover(Outliers, All) /* It returns the set of sorted vectors representing all possible distances between tuples in Outliers

and tuples in All. Uses priority queue queue with elements hV,tpl-seti. Priority is based on dist(v) ? jtpl-setj:*/

1. queue

h[d1;:::;dn];Alli, with d1 = :::;dn = 0

2. vects = h[d1;:::;dn]i, with d1 = :::;dn = 0 /* sorted list of distance vectors, based on < */

3. while queue6= ; do

3.1 Select and remove hv;Neighborsi from queue

3.2 if Neighbors \ Outliers 6= ; then

3.2.1 select a tuple s from Neighbors \ Outliers

3.2.2 New-pairs

; /* keep tracks of new pairs to be added to queue */

3.2.3 for each tuple t 2 Neighbors do

3.2.3.1 Vs;t

compute distance between s and t

3.2.3.2 if 9hV;tuplesi2 New-pairs such that Vs;t V;dist(Vs;t) > 1 then tuples

tuples [ftg.

3.2.3.3 else if 9hV;tuplesi2 New-pairs such that V <Vs;t;dist(Vs;t) > 1 then V

Vs;t, tuples

tuples [ftg.

3.2.3.4 else if dist(Vs;t) > 1 then New-pairs

New-pairs [ fhVs;t;ftgig

3.2.4 queue

queue [fhV;tuplesijhV;tuplesi2 New-pairs and tuples not singletong

4. return vects

Figure 5: Procedure VectorCover

Intuitively, the the minimal generalizations of table Ti are exactly those tables Tj satisfying k-anonymity with

minimal distance vectors DVi;j. For instance, with reference to the generalized tables illustrated in Figure 3 we

have already noticed how, for k = 3, GT[1;1] cannot be minimal because GT[0;1] and GT[1;0] also satisfy k-anonimity.

(Remember that the subscript indicates the distance vector of the generalized table from PT).

Our minimal generalization algorithm also makes use of another form of the distance vector that captures how far

apart two tuples belonging to the same table are. Let T be a table and x; y 2T two tuples such that x = hv0

1;:::;v0

and y = hv00

1 ;:::;v00

ni be two tuples where each vi;v0

i is a value in domain Di. The distance vector between x

and y is the vector Vx;y = [d1;:::;dn] where di is the length of the paths from v0

i and v00

i to their closest common

ancestor in the value generalization hierarchy VGHDi. For instance, with reference to the PT illustrated in Figure 3,

the distance between ha,39iand hb,39iis [1,0].

The relationship between distance vectors for tables and distance vectors for tuples is important. Intuitively, the

distance between two tuples x and y in table T is the distance vector between T and the table T0, with T T0 where

the domains of the attribute in T0 are the most specic domains for which x and y generalize to the same tuple t.

This observation brings us to the following theorem.

Theorem 4.2 Let Ti and Tj be two tables such that Ti Tj. If Tj is k-minimal then DVj;j V , where V is

the least upper bound of all the distance vectors Vx;y where x; y are tuples in Ti and either x or y has a number of

occurrences smaller than k.

According to Theorem 4.2 the distance vector of a minimal generalization of a table cannot be greater than the

smallest vector dominating all possible distance vectors between the tuples in Ti. This property is exploited by the

minimal generalization algorithm to reduce the number of possible strategies to be evaluated, as illustrated in the

following subsection.

4.2 Reducing the number of strategies: VectorCover

As discussed in Section 3 each k-minimal generalization is locally minimal to a specic generalization strategy.

Performing all possible generalizations in the domain generalization hierarchy would eventually reveal a minimal

generalization strategy in which an instance of data achieves a k-anonymity requirement. However, the large number

of possible strategies (see Theorem 3.2) makes this approach largely impractical. Theorem 4.2 above helps reduce the

number of generalizations (strategies) to be considered. Though reduced, this approach bears the cost of determining

all the possible vectors between outliers (tuples which have less than k occurrences) and every other tuple in the

table. In the worst case, where all tuples are outliers, the cost of this operation is O(N2) where N is the total

number of tuples in the table. Given this observation, instead of constructing all possible vectors, we present an

e cient algorithm named, VectorCover, that given the set of outliers and the set of all the tuples in the table, without

computing all pairwise combinations, determines the di erent distance vectors between the outliers and the other

Page 10

Preferred Minimal Generalization MinGen Algorithm

Input:

Private table PT; quasi-identier QI = (A1;:::;An) and k-anonymity constraint; preference specications.

Output: A minimal generalization MGT of PT[QI] wrt k-anonymity chosen according to the preference specications

Method

1. MGT

PT[QI] /* create a copy of PT[QI] */

2. history

[d1;:::;dn], where di = 0;i = 1;:::;n.

3. while there exists an attribute Ai 2QI and value v 2MGT[Ai] such that freq(v;MGT[Ai]) < k do

3.1 let Vi = [d1;:::;dn] be the vector such that dj = 1 if j = i; dj = 0 otherwise

3.2 MGT

generalize(MGT;Vi;history)

3.3 history

history + Vi /* where [x1;:::;xn]+[y1;:::;yn]=[x1 + y1;:::;xn + yn]*/

4. All

ft j t is a distinct tuple in MGTg

5. Outliers

ft j t 2All;freq(t;MGT[QI]) < kg

6. if Outliers 6= ; do

6.1 Distances

VectorCover(Outliers,All) /* sorted list of vectors to be considered */

6.2 Strategies

Levelsets(Distances)

/* Orders sets of vectors in levels */

6.3 lmin

binsearch(Strategies,MGT,history,k)

6.4 Allmins

fV j hlmin;Vsetlmini2 Strategies;V 2Vsetlmin, and 8t 2GV ;freq(t; GV ) k where GV = generalize(MGT;V;history)g

6.5 pref-vec

preferred(MGT,AllMins,history) /* select a vector corresponding to preference specicaton */

6.6 MGT

generalize(MGT;pref-vec;history).

7. return MGT.

Figure 6: The Preferred Minimal Generalization (MinGen) Algorithm

tuples. Techniques that interpret distance vectors use a distance function dist based on the sum of the elements in

the vector. More formally, given a vector V = [d1;:::;dn], dist(V ) =

i=1 di. It is easy to see that the distance

function satises the properties of Euclidean geometry, i.e., for all possible tuples x, y, and z: 1) dist(Vx;y) 0; 2)

dist(Vx;y)=0i x = y; 3) dist(Vx;y) = dist(Vy;x); and 4) dist(Vx;y) dist(Vx;z) + dist(Vz;y).

VectorCover analogizes the problem of nding generalization strategies stored as vectors, to that of nding a

minimum spanning tree on a clique. The nodes are the tuples and the weights on the edges are the distance vectors

between the tuples. A technique for nding an optimum cover was presented by Agrawal et al. in[2]. However,

the algorithm proposed in [2] does not apply e ciently to cliques (the performance is O(N2) where N is the total

number of tuples). Moreover, with respect to [2] we have an additional requirement for the spanning tree, which is

that the set of edges contained in the spanning tree must include at least one occurrence of each distinct distance

vector found in the clique. Despite that, we can do better than O(N2) by exploiting the Euclidean properties of the

distance function dened on vectors. In particular, VectorCover does not automatically visit all incident unvisited

neighbors, but instead uses the distances to cluster tuples together so that tuples that are far away from a given

tuple may be close to each other. Like in [2], VectorCover removes old edges and adds new ones to reduce the overall

sum of the weights on all the edges. An algorithm implementing VectorCover is illustrated in Figure 5. For space

requirements, we refer the reader to Figure 5 for more details on the algorithm. Here we only note that the key to

implementing VectorCover e ciently is to make it easy to select an unvisited edge that is likely to have a distance

vector less than known distances. To this purpose we exploit the Euclidean nature of the distance function which

ensures that whenever two tuples, x and y, incident to a third tuple, z, the vector representing the least upper bound

between Vz;x and Vz;y places an upper bound on Vx;y. When known distances place an upper bound on an unvisited

edge, the edge is visited and distances to neighboring tuples that have similar distances from a common tuple are

computed. The correctness of the procedure is stated by the following theorem.

Theorem 4.3 Let Outliers and All be two sets of tuples. VectorCover(Outliers,All) nds all and only the distinct

vectors Vx;y such that x 2 Outliers and y 2 All.

The performance of VectorCover depends on the nature of the data. Let N be the number of tuples in PT. It is

N in the best case, O(N) in the general case, deteriorating to O(N2) only in the worst case where the only minimal

generalization is the one to the most general domains.

Page 11

4.3 The Preferred Minimal Generalization Algorithm MinGen

Figure 6 illustrated an algorithm, called MinGen, which, given a table PT, a quasi-identier QI 2 PTQI, and a k-

anonymity constraint, produces a table MGT which is a k-minimal generalization of PT[QI]. It assumes that k < jPTj,

which is a necessary and su cient condition for the existence of a minimal generalized table (see Theorem 3.1).

The algorithm makes use of two basic functions: preferred, which selects a vector that correspond to a preferred

generalization according to the given specications; and generalize, which enforces generalization. The generalize

function requires as parameters a table Ti and two distance vectors v and h, and returns the table Tj, generalization

of Ti, with DVi;j = s. The third parameter h, constructed by the algorithm, represent the distance of Ti from the

private table PT provided as input, and is used to ensure that the generalization is absolute with respect to the

domains in PT. The algorithm uses also function VectorCover discussed in the previous subsection, and functions

Levelsets and binsearch, reported in the appendix, which we explain here. Given a set of vectors ordered in increasing

distance, Levelsets denes an assignment of levels to vectors as follows. Vectors are considered in the order in which

they appear in the list. Each new vector V considered is assigned to the lowest level at which there does not exists

any vector V 0 < V . Once all the vectors have been considered, vector Vu representing the lowest appear bound of

the vectors in the considered set is determined. If Vu does not belong to the set a new higher level is created to

which Vu is assigned. Moreover, the levels are examined and possibly updated as follows. Each vector V appearing

at a level l is replicated at all the levels between the lowest the highest level (this excluded) at which there exists a

vector dominated by V and the lowest level (this excluded) at which there is a vector dominating V . This \padding"

process ensures that, if we represent the order between vectors as a graph, all paths between the zero vector and all

the vectors at a given level have the same distance.

For the sake of space we omit the complete explanation of the algorithm and refer the reader to Figure 6 for

details. Here, we outline only the major steps. Step 3 generalizes each single attribute to satisfy k-anonymity

since k-anonymity of each attribute is a necessary condition for the k-anonymity of the quasi-identier. Step 6 is

the core of the algorithm. Substep 6.1 calls procedure VectorCover to retrieve the distance vectors corresponding

to generalization strategies to be evaluated. Distance vectors are returned in a list sorted according to increasing

distance. Step 6.2 calls Levelsets to retrieve set Strategies representing the assignment of levels to vectors returned

by VectorCover. Step 6.3 calls binsearch which performs a binary search on the levels and returns the lowest level

at which there exists a vector corresponding to a minimal generalization.

The correctness of the algorithm relies on the correctness of function VectorCover and on the particular way levels

are constructed in Levelsets. In particular, it can be proved that, for how levels are dened, if a vector corresponding

to a minimal generalization appears at level l than no other vectors corresponding to a minimal generalization can

exists at level lower than l. This property allows us to e ciently retrieve the interested level by performing the

binary search.

With respect to the complexity we make the following observations. Let N be the number of tuples in PT. The

loop to achieve k-anonymity at each attribute (step 3) is at most O(N jQIj). VectorCover has a best case of N

and a worst case of O(N2). The construction of strategies can be done in logarithmic time. The binary search is

performed in O(log(

i=1 hi)) where hi is the height of the hierarchy of the domain of attribute Ai. The complexity

of the test in binsearch to determine whether a level achieves k-anonymity is O(jdistancesj jMGTj). In the worst case

jdistancesj=

i=1 hi; however VectorCover guarantees jdistancesjto be minimal. These observations imply that the

number n of attributes in QI is an important factor a ecting the computational cost. To consider this computational

cost with respect to general cases of application, we surveyed real-word data [11]. The number of attributes in quasi-

identiers found in many releases of census data, voter lists, public health data, and medical records ranged from

3 to 5 attributes and the heights of the hierarchies ranged from 1 to 8. Because MinGen searches only over the

generalizations specic to the data, in the general case the number of candidate generalizations dramatically reduced

to a small fraction of all possible generalizations. In 300 pediatric medical records with 7617 visits and 285 elds

stored in over 12 relational database tables. We found that generalizations in trygen were 3-7% of the total possible

number of generalizations when we considered 4 independent sets of quasi-identiers known to link to public and

semi-public information.

5 Conclusions

The practical signicance of releasing individualized-data such that linking the data to other sources to re-identify

individuals cannot be done, o ers many benets to our electronic society. This work provides an e ective and optimal

Page 12

solution to this problem. Open questions remain. The size of and conditions for k necessary to ensure k-anonymity

must be further investigated. The quality of generalized data is best when the attributes most important to the

recipient do not belong to any quasi-identier. For public-use les this is acceptable, but determining the quality

and usefulness in other settings must be further researched. Finally, data are dynamic. Tuples are added, changed,

and removed constantly. As a result, releases of generalized data over time can be subject to a temporal inference

attack, which need to be carefully investigated.

Acknowledgements

The authors thank Steve Dawson, at SRI, for discussions and support; Sylvia Barrett, at Harvard, for support; and,

Eric Jordan, Patrick Thompson and Mojdeh Mohtashemi, at MIT, for editorial suggestions.

References

[1] N.R. Adam and J.C. Wortman. Security-control methods for statistical databases: A comparative study. ACM

Computing Surveys, 21:515{556, 1989.

[2] Rakesh Agrawal, Alex Borgida, and H.V. Jagadish. E cient management of transitive relationships in large

data and knowledge bases. In Bruce Linsday James Cli ord and David Maier, editors, Proc. of the 1989 ACM

SIGMOD Int'l Conf. on Management of Data, pages 253{262, Portland, Oregon, June 1989.

[3] P.C. Chu. Cell suppression methodology: The importance of suppressing marginal totals. IEEE Trans. on

Knowledge Data Systems, 4(9):513{523, July/August 1997.

[4] B.A. Davey and H.A. Priestley. Introduction to Lattices and Order. Cambridge University Press, 1990.

[5] D. Denning, P. Denning, and M.D. Schwartz. The tracker: A threat to statistical database security. ACM

Trans. on Database Systems, 4(1):76{96, March 1979.

[6] N. Kirkendall et. al. Report on statistical disclosure limitation methodology. Statistical policy working paper,

no. 22, Washington: O ce of Management and Budget, 1994.

[7] A. Hundepool and L. Willenborg. - and -Argus: Software for statistical disclosure control. In Third Interna-

tional Seminar on Statistical Condentiality, Bled., 1996.

[8] Sushil Jajodia, Pierangela Samarati, V.S. Subrahmanian, and Elisa Bertino. A unied framework for enforcing

multiple access control policies. In Proc. of the 1997 ACM SIGMOD Int'l Conf. on Management of Data,

Tucson, AZ, U.S.A, May 1997.

[9] Malvestuto, F. Moscarini, and M. Rafanelli. Suppressing marginal cells to protect sensitive information in a two-

dimensional statistical table. In Proceedings of the Tenth ACM Symposium on Principles of Database Systems,

Denver, CO, May 1991.

[10] Latanya Sweeney. Guaranteeing anonymity when sharing medical data, the Data y system. In Bureau of the

Census, Record Linkage Bulletin., Washington,DC: Bureau of the Census,, 1997.

[11] Latanya Sweeney. Weaving technology and policy together to maintain condentiality. Journal of Law, Medicine,

& Ethics, 25(2{3):98{110, 1997.

[12] Tian Zhang, Raghu Ramakrishan, and Miron Livny. BIRCH: An e cient data clustering method for very

large databases. In H.V. Jagadish and I.S. Mumick, editors, Proc. of the 1996 ACM SIGMOD Int'l Conf. on

Management of Data, pages 103{114, Montreal, Canada, June 1996.

Page 13

A Additional Algorithms

In this section we report the functions Levelsets and binsearch used by algorithm MinGen, whose behavior has been

illustaretd in the paper. The semantics of Levelsets has been illustrated in the paper. The algorithm below enforces

an optimized version of the level construction based on the use of a graph.

LevelSets(Distances)

2. D

f(dists, E)g /* build DAG where dists = distances, and ex;y element of E i dx;dy

element of distances and dx < dy g

/* make all path lengths same by padding nodes as follows */

3. for all d1;d2;d3 element of dists and e13;e23 2E

3.1 l1

length of path from [0; :::;0] to d1 in D

3.2 l2

length of path from [0; :::;0] to d2 in D

3.3 if l1 > l2 then dists

dists [ d20; E

E [fe2;20g

3.4 else if l2 > l1 then dists

dists [ fd0

1g; E

E [fe1;10g /* d0 and d have the same distance vectors but d 6= d0 */

4. lsets

;

5. maxpath

max path length in D

6. for i=1,...,maxpath do

6.1 lsets

lsets [fhi; Siijd 2Si i d 2 dists and length of path from [0,...,0] to d in D is i g

7. lsets

lsets [fhlmx;Smxijlmx = maxpath+1; and vm = di dj 8di;dj 2Sm where hmaxpath;Smi2 lsetsg

/* [x1;:::;xn] [y1;:::;yn] = [max(x1;y1);:::;max(xn;yn)] */

8. return lsets

binsearch(Strategies,MGT,history,k) /* determines the lowest level with a minimal generalization satisfying k-anonymity*/

1. min

1; max

maxfijhi; Gii2Strategiesg

2. while min < max do

2.1 try

bmin+max

2.2 trygens

ftryvecsjh try,tryvecsi2Strategiesg

2.3 reach k

false

2.4 while trygens 6= ;^reach k 6= true do

2.4.1 Select a distance vector V from trygens

2.4.2 trygens

trygens ,fV g

2.4.3 Gen try

generalize(MGT;V;history)

2.4.4 if all tuples t 2 Gen try have freq(t;Gen try) k then reach k

true

2.5 if reach k = true then max

try else min

try + 1

3. return min

Page 14

B An Example

In this section we illustrate the behavior of the algorithm in the generalization of the private table illustrated

in Figure 1. Let QI=fEthnicity, Date Of Birth, Sex, ZIP, Marital Statusg. The resulting table PT[QI] to be k-

anonymized is reported in Figure 7. For the sake of space, in Figure 7 names and values of attributes are abbreviated.

We assume the hierarchies for ZIP and Ethnicity to be as illustrated in Figure 2. Domain hierarchies for Date of Birth

(assuming values in the sixties), Sex, and Marital Status are illustrated in Figure 8. The hierarchy for dates maps

specic dates the form mm/dd/yy in domain D0, into months mm/yy in domain D1, years (yy) in domain D2, 5-year

intervals in D3 and the complete 10-year interval in D4. (Note that although the hierarchy can seem to change

the format of the data, compatibility of specic and generalized values can be ensured by using the same storage

representation form. For instance, generalization of the month can be represented by mm/1/yy, generalization to the

year by 1/1/yy and so on.) The value hierarchy for Sex maps each possible value of S0 into value not released of

S1. The hierarchy for Marital Status M0 is illustrated in Figure 8.

The ordered set of vectors returned by VectorCover and the levels constructed by Levelsets are reported below.

Vectors in italic represent vectors replicated in the \padding" process. The vectors in bold face represents the least

upper bound of all the vectors in Distances. Note that these generalization vectors are relative with respect to vector

history [0;1;0;0;0] representing the generalization performed on attribute DOB in step 3 of the algorithm.

Distances

00000

00011

01001

01010

02002

02120

02122

10100

10111

11110

12020

12021

12102

12112

Level 1

Level 2

Level 3

Level 4

Level 5

Level 6

00000

00011

02002

12102

12112

12122

01001

12020

02122

12122

01010

02120

12021

01010

10111

10100

11110

Figure 9 illustrates the k-minimal generalizations which can be returned by the algorithm for k=2 in dependence

of the preference specied.

Page 15

Eth

DOB

Sex

ZIP

Mar

09/27/64

02139

div

09/30/64

02139

div

04/18/64

02139

mar

04/15/64

02139

mar

09/15/64

02138

mar

03/23/63

02141

mar

03/18/63

02141

mar

09/13/64

02138

mar

09/07/64

02138

mar

05/14/61

02138

sin

05/08/61

02138

sin

Figure 7: Stored Values for the quasi-identier of PT of Figure 1

D4 = f[60 ,70]g

D3 = f[60 ,64];[65 ,70]g

D2 = f61;:::;70g

D1 = f1=61;:::;12=70g

D0 = f1=1=60;:::;12=31=70g

DGHD0

S1 = fnot releasedg

S0 = fmale;femaleg

DGHS0

M2 = fnot releasedg

M1 = fonce married; never marriedg

M0 = fmarried; divorced; widow; singleg

DGHM0

not released

once married

never married

married divorced widow

single

HHHHY

36QQQk

VGHM0

Figure 8: Examples of Domain and Value Generalization Hierarchies

Eth

DOB

Sex

ZIP

Mar

02130

div

02130

div

02130

mar

02130

mar

02130

mar

02140

mar

02140

mar

02130

mar

02130

mar

02130

sin

02130

sin

GT[0;2;0;1;0]

Eth

DOB

Sex

ZIP

Mar

09/64

02130

once

09/64

02130

once

04/64

02130

once

04/64

02130

once

09/64

02130

once

03/63

02140

once

03/63

02140

once

09/64

02130

once

09/64

02130

once

05/61

02130

never

05/61

02130

never

GT[0;1;0;1;1]

Eth

DOB

Sex

ZIP

Mar

09/64

n r

02139

div

09/64

n r

02139

div

04/64

n r

02139

mar

04/64

n r

02139

mar

09/64

n r

02138

mar

03/63

n r

02141

mar

03/63

n r

02141

mar

09/64

n r

02138

mar

09/64

n r

02138

mar

05/61

n r

02138

sin

05/61

n r

02138

sin

GT[1;1;1;0;0]

Figure 9: Example of minimal generalized tables for the table of Figure 7