sourced from fewer than three people
are suppressed. In addition, complementary suppression has been applied to prevent subtraction attacks
on the small cells.
Encoding the constraints. The database can be reconstructed by treating the attributes of the persons living on the block as a collection of
variables. A set of constraints is then
extracted from the published table.
The database reconstruction finds a
set of attributes that are consistent
with the constraints. If the statistics
are highly constraining, then there
will be a single possible reconstruction, and the reconstructed microdata will necessarily be the same as
the underlying microdata used to
create the original statistical publication. Note that there must be at
least one solution because the table
is known to be formulated from a
real database.
For example, statistic 2B states
that three males live in the geography.
This fictional statistical agency has
previously published technical speci-
fications that its computers internally
represent each person’s age as an in-
teger. The oldest verified age of any
human being was 122.14 If we allow
for unreported supercentenarians
and consider 125 to the oldest possi-
ble age of a human being, there are
only a finite number of possible age
combinations, specifically:
Within the 317,750 possible age com-
binations, however, there are only 30
combinations that satisfy the constraints
of having a median of 30 and a mean of
44, as reported in Table 1. (Notice that
the table does not depend on the oldest
possible age, so long as it is 101 or over.)
Applying the constraints imposed by the
published statistical tables, the possible
combinations of ages for the three males
can be reduced from 317,750 to 30. Table
2 shows the 30 possible ages for which
the median is 30 and the mean is 44.
To mount a full reconstruction attack, an attacker extracts all of these
constraints and then creates a single
mathematical model embodying all
constraints. An automated solver can
then find an assignment of the variables that satisfies these constraints.
To continue with the example, statistic 1A establishes the universe of the
constraint system. Because the block
contains seven people, and each has
four attributes (age, sex, race, and marital status), that creates 28 variables, representing those four attributes for each
person. These variables are A1… A7
(age), S1… S7 (sex), R1… R7 (race), and
M1… M7 (marital status), as shown in
Table 3. The table shows the variables
associated with the DRA. The coding of
the categorical attributes is presented
in the key.
Because the mean age is 38, we
know that:
A1 + A2 + A3 + A4 + A5 + A6 + A7 =
7 ×
38
The language Sugar13 is used to encode the constraints in a form that can
be processed by a SAT (satisfiability)
solver. Sugar represents constraints as
s-expressions.
11 For example, the age
combination equation can be represented as:
; First define the integer
variables, with the range
0..125
(int A1 0 125)
(int A2 0 125)
(int A3 0 125)
(int A4 0 125)
(int A5 0 125)
(int A6 0 125)
(int A7 0 125)
; Statistic 1A: Mean age is 38
(= (+ A1 A2 A3 A4 A5 A6 A7)
(* 7 38)
)
Once the constraints in the statistical table are turned into s-expressions,
the SAT solver solves them with a brute-force algorithm. Essentially, the solver
explores every possible combination of
the variables, until a combination is
found that satisfies the constraints. Using a variety of heuristics, SAT solvers
are able to rapidly eliminate many
combinations of variable assignments.
Despite their heuristic complexity,
The Boolean SAT problem was the first to be proven NP-complete.
9 This problem asks,
for a given Boolean formula, whether replacing each variable with either true or false
can make the formula evaluate to true. Modern SAT solvers work well and reasonably
quickly in a variety of SAT problem instances and up to reasonably large instance sizes.
Many modern SAT solvers use a heuristic technique called CDCL (conflict-driven
clause learning).
10 Briefly, a CDCL algorithm:
1. Assigns a value to a variable arbitrarily.
2. Uses this assignment to determine values for the other variables in the formula
(a process known as unit propagation).
3. If a conflict is found, backtracks to the clause that made the conflict occur and
undoes variable assignments made after that point.
4. Adds the negation of the conflict-causing clause as a new clause to the master
formula and resumes from step 1.
This process is fast at solving SAT problems because adding conflicts as new clauses
has the potential to avoid wasteful “repeated backtracks.” Additionally, CDCL and its
predecessor algorithm, DPLL (Davis–Putnam–Logemann–Loveland), are both provably
complete algorithms: they will always return either a solution or “Unsatisfiable” if given
enough time and memory. Another advantage is that CDCL solvers reuse past work
when producing the universe of all possible solutions.
A wide variety of SAT solvers are available to the public for minimal or no cost.
Although a SAT solver requires the user to translate the problem into Boolean formulae
before use, programs such as Naoyuki Tamura’s Sugar facilitate this process by translating
user-input mathematical and English constraints into Boolean formulae automatically.
SAT and SAT Solvers
Sugar input is given in a standard constraint satisfaction problem (CSP) file format.
A constraint must be given on a single line of the file, but here we separate most
constraints into multiple lines for readability. Constraint equations are separated by
comments describing the statistics they encode.
Input for the model in this article is available at https://queue.acm.org/appendices/
Garfinkel_SugarInput.txt.
Sugar Input