An Introduction to Data
Representation Synthesis
abstract
We consider the problem of specifying combinations of
data structures with complex sharing in a manner that
is declarative and results in provably correct code. In our
approach, abstract data types are specified using relational algebra and functional dependencies. We describe
a language of decompositions that permits the user to
specify different concrete representations for relations,
and show that operations on concrete representations
soundly implement their relational specification. We also
describe an auto-tuner that automatically identifies the
best decomposition for a particular workload. It is easy to
incorporate data representations synthesized by our compiler into existing systems, leading to code that is simpler,
correct by construction, and comparable in performance
to the code it replaces.
1. intRoDuction
One of the first things a programmer must do when
implementing a system is to commit to particular data
structure choices. For example, consider a simple operating system process scheduler. Each process has an ID
pid, a state (running or sleeping), and a variety of statistics such as the cpu time consumed. Since we need to
find and update processes by ID, we store processes in
a hash table indexed by pid; as we also need to enumerate processes in each state, we simultaneously maintain
a linked list of running processes and a separate list of
sleeping processes.
Whatever our choice of data structures, it has a pervasive
influence on the subsequent code. Moreover, as requirements evolve it is difficult and tedious to change the data
structures to match. For example, suppose we add virtu-alization support by allowing processes with the same
pid number to exist in different namespaces ns, together
with the ability to enumerate processes in a namespace.
Extending the existing data structures to support the
new requirement may require many changes distributed
throughout the code.
Furthermore, invariants on multiple, overlapping data
structures that represent different views of the same data
are hard to state, difficult to enforce, and easy to get wrong.
For the scheduler, we require that each process appears
in both the hash table indexed by process ID and exactly
one of the running or sleeping lists. Such invariants must
be enforced by every piece of code that manipulates the
scheduler’s data structures. It is easy to forget a case, say
by failing to add a process to the appropriate list when it
changes state or by failing to delete a hash table entry when
a process terminates. Invariants of this nature require
deep knowledge about the heap’s structure and are diffi-
cult to enforce through existing static analysis or verifica-
tion techniques.
figure 1. Data representation synthesis.
áns, pid, state, cpuñ
ns, pid ® state, cpu
Relational
Specification
x
yz
w
cpu
ns state
pid ns, pid
RELC
Synthesis
class relation {
void insert(...);
bool query(...);
void update(...);
void remove(...);
};
Low-Level
Implementation
The original version of this paper was published in
the Proceedings of Programming Language Design and
Implementation (PLDI), 2011, ACM.