3. 4. Computing joins
There are two kinds of joins supported by CryptDB:
equi-joins, in which the join predicate is based on equality, and
range joins, which involve order checks. To perform an
equi-join of two encrypted columns, the columns should
be encrypted with the same key so that the server can see
matching values between the two columns. At the same
time, to provide better privacy, the DBMS server should not
be able to join columns for which the application did not
request a join, so columns that are never joined should not
be encrypted with the same keys.
If the queries that can be issued, or the pairs of columns
that can be joined, are known a priori, equi-join is easy to
support: CryptDB can use the DET encryption scheme with
the same key for each group of columns that are joined
together. However, the challenging case is when the proxy
does not know the set of columns to be joined a priori, and
hence does not know which columns should be encrypted
with matching keys.
To solve this problem, we introduce a new cryptographic
primitive, JOIN-ADJ (adjustable join), which allows the
DBMS server to adjust the key of each column at runtime.
Intuitively, JOIN-ADJ can be thought of as a “keyed random hash” with the additional property that hashes can be
adjusted to change their key without access to the plaintext.
JOIN-ADJ is a deterministic function of its input, which
means that if two plaintexts are equal, the corresponding
JOIN-ADJ values are also equal. With JOIN-ADJ, initially,
each column uses a different key for the JOIN layer, thus preventing any joins between columns. When a query requests
a join, the proxy gives the DBMS server an “adjustment”
key to adjust the JOIN-ADJ values in one of the two columns
(the first column in lexicographic order), so that it matches
the JOIN-ADJ key of the other column. After the adjustment,
the columns share the same JOIN-ADJ key, allowing the
DBMS server to join them for equality (for this or future queries). Our previous publications18, 19 describe the JOIN-ADJ
scheme formally and prove its security guarantees.
For range joins, a similar dynamic readjustment scheme
is difficult to construct due to the lack of structure in OPE
schemes. Instead, CryptDB requires that pairs of columns
that will be involved in such joins be declared by the application ahead of time, so that matching keys are used for layer
OPE-JOIN of those columns; otherwise, the same key will be
used for all columns at layer OPE-JOIN. Fortunately, range
joins are rare; they are not used in any of our example applications, and are used in only 50 out of 128,840 columns in a
large SQL query trace we describe in Section 5.
3. 5. other queries and limitations
CryptDB supports most relational queries and aggregates
on standard data types, such as integers and text/varchar
types. Additional operations can be added to CryptDB by
extending its existing onions, or adding new onions for specific data types (e.g., spatial and multidimensional range
queries21). Alternatively, in some cases, it may be possible to
map complex unsupported operations to simpler ones (e.g.,
extracting the month out of an encrypted date is easier if the
date’s day, month, and year fields are encrypted separately).
There are certain computations CryptDB cannot support on
encrypted data. For example, it does not support order comparison with a summation, such as WHERE salary > age + 10. One
could support such a query by splitting it into different queries
and having the proxy re-encrypt intermediate results.
Most other DBMS mechanisms, such as transactions and
indexing, work the same way over encrypted data as they do
over plaintext, with no modifications.
The CryptDB proxy is built on top of mysql-proxy, and
consists of a C++ library and a Lua module. The C++ library
consists of a query parser; a query encryptor/rewriter, which
encrypts fields or includes UDFs in the query; and a result
decryption module. The query rewriter operates on the
abstract syntax tree (AST) of the SQL query. Given an expression, the rewriter produces replacement expressions for
the value of the original expression encrypted with different encryption types (e.g., RND, DET, or just “plaintext”).
We use NTL and OpenSSL for cryptographic operations.
Our prototype consists of 18,000 lines of C++ code and
~150 lines of Lua code, with another 10,000 lines of test
code. CryptDB is portable; we have implemented versions
for both Postgres 9.018 and MySQL 5. CryptDB requires
only UDF support from the DBMS and does not change the
DBMS server software.
5. eXPeRimentaL eVaLuation
In this section, we evaluate three aspects of CryptDB: what
types of queries and applications does CryptDB support,
what is the level of security that CryptDB provides, and what
is the performance impact of using CryptDB?
We analyze the functionality and security of CryptDB
on five applications and one large trace: phpBB (an open-source Web forum application), HotCRP (a conference
management system), grad-apply (the MIT EECS graduate
admission application), Open-EMR (an electronic medical
records application storing patient medical data), TPC-C
(an industry-standard database benchmark), and a trace
of SQL queries from a popular MySQL server at MIT, sql.
mit.edu. This server is used primarily by Web applications
running on scripts.mit.edu, a shared Web application
hosting service operated by MIT’s Student Information
Processing Board (SIPB). In addition, this SQL server is used
by a number of applications that run on other machines
and use sql.mit.edu only to store their data. Our query trace
spans about ten days, and includes approximately 126 million queries over 1193 databases and 18,162 queries. Each
database is likely to be a separate instance of an application.
All these applications and the large SQL trace contain sensitive information that should be protected (e.g., medical
records, student grades, and private messages).
In the first four applications (not counting TPC-C and
the large trace), we manually identify which columns are
likely to be sensitive and encrypt only those. Some fields
were clearly sensitive (e.g., grades, private messages, and
medical information), but others were only marginally so
(e.g., the time at which a message was posted). There was
no clear threshold between sensitive or not, but it was clear