soundness of the compiler (Theorem 2) guarantees that the
resulting data structures are correct by construction.
7. Discussion anD ReLateD WoRK
7. 1. Databases and relational programming
Literature on lightweight databases also advocates a programming model based on relations, which are implemented in the backend using container data structures. 2, 3, 7, 22
A novel aspect of our approach is that our relations can
have specified restrictions (specifically, functional
dependencies). These restrictions, together with the
fact that in our compilation context the set of possible
queries is known in advance, enable a wider range of
possible implementations than a traditional database,
particularly representations using complex patterns of
sharing. Also new in our work is the notion of adequate
decompositions and a proof that operations on adequate
decompositions are sound with respect to their relational
specifications. Our definition of adequacy is informed in
particular by the classic Boyce-Codd Normal Form in relational schema design.
Unlike previous lightweight database work, we describe
a dynamic auto-tuner that can automatically synthesize the
best decomposition for a particular relation, and we present our experience with a full implementation of these
techniques in practice. The auto-tuner framework has a
similar goal to AutoAdmin. 6 AutoAdmin takes a set of tables,
together with a distribution of input queries, and identifies
a set of indices that are predicted to produce the best overall performance under the query optimizer’s cost model.
The details differ because our decomposition and query languages are unlike those of a conventional database.
Many authors propose adding relations to both gen-eral- and special-purpose programming languages. 5, 17, 19,
23 We focus on the orthogonal problem of specifying and
implementing the underlying representations for relational data. Data models such as E/R diagrams and UML
also rely heavily on relations. One potential application of
our technique is to close the gap between modeling languages and implementations.
7. 2. synthesizing data representations
The problem of automatic data structure selection was
explored in SETL20 and has also been pursued for Java collection
implementations. 21 The SETL representation sublanguage8
maps abstract SETL set and map objects to implementations,
although the details are quite different from our work. Unlike
SETL, we handle relations of arbitrary arity, using functional
dependencies to enforce complex sharing invariants. In SETL,
set representations are dynamically embedded into carrier
sets under the control of the runtime system, while by contrast
our compiler synthesizes low-level representations for a specific decomposition with no runtime overhead.
Synthesizing specialized data representations has previously been considered in other domains. Ahmed et al. 1
proposed transforming dense matrix computations into
implementations tailored to specific sparse representations
as a technique for handling the proliferation of complicated
sparse representations.
7. 3. synthesis versus verification approaches
A key advantage of data representation synthesis over
handwritten implementations is the synthesized operations are correct by construction, subject to the correctness of the compiler. We assume the existence of a library
of data structures; the data structures themselves can be
proved correct using existing techniques. 24 Our system provides a modular way to assemble individually correct data
structures into a complete and correct representation of a
program’s data.
The Hob system uses abstract sets of objects to specify and verify end-to-end properties of systems using
multiple data structures that share objects. 14 Monotonic
typestates enable aliased objects to monotonically
change their typestates in the presence of sharing without violating type safety. 10 Researchers have developed
systems that have mechanically verified data structures
(e.g., hash tables) that implement binary relational
interface. 24 The relation implementation presented in
this paper is more general (it can implement relations of
arbitrary arity) and solves problems orthogonal to those
addressed in previous research.
7. 4. specifying and inferring shared representations
The decomposition language provides a “functional”
description of the heap that separates the problem of modeling individual data structures from the problem of modeling
the heap as a whole. Unlike previous work, decompositions
allow us to state and reason about complex sharing invariants that are difficult to state and impossible to verify using
previous techniques. Separation logic allows elegant specifications of disjoint data structures, 18 and mechanisms have
been added to separation logic to express some types of
sharing. 4, 9 Some static analysis algorithms infer some sharing between data structures in low level code12, 13, 15, 16; however, verifying overlapping shared data structures in general
remains an open problem for such approaches. The combination of relations and functional dependencies allows us
to reason about sharing that is beyond current static analysis techniques.
8. concLusion
We have presented a system for specifying and operating
on data at a high level as relations while correctly compiling those relations into a composition of low-level data
structures. Most unusual is our ability to express, and prove
correct, the use of complex sharing in the low-level representation. We show using three real-world systems that data
representation synthesis leads to code that is simpler, correct by construction, and comparable in performance to
existing handwritten code.
acknowledgments
This work was supported by NSF grants CCF-0702681 and
CNS-050955. The fourth author thanks Viktor Kuncak,
Patrick Lam, Darko Marinov, Alex Sălcianu, and Karen Zee
for discussions in 2001–2002 on programming with relations, with the relations implemented by automatically
generated linked data structures. The second author thanks