high-Level Data structures
By Yannis Smaragdakis
it is a defining moment for many a CS
student when they start thinking about
data structures relationally; when
maps, n-tuples, sets, and bags become
more natural terms of discourse than
lists, vectors, hash tables, and binary
trees. I often see a student’s eyes light
up when realizing that what they need
is, say, a mapping from some Xs to an
unordered set of Ys, with the correctness of the rest of their program unaffected by whether this mapping is
implemented by hash tables and lists,
trees and vectors, or something else.
This difference in thinking is crucial
for programming projects with any
amount of data structure complexity, from the real world all the way
down to university coursework. In a
compiler course project, for instance,
the difference is striking. Students
who manage to grasp the separation
of relational and low-level thinking
have no trouble adapting to complex
requirements. Students who cannot
escape their preoccupation with low-level data structures often spend all
their time changing the intertwined
implementations of a mess of lexically
scoped symbol tables, member tables,
and global type structures.
This lifting of data structure thinking to the relational level has long inspired computer scientists. Relational
databases have given us a vocabulary
for data management, but have also
hidden the internals of the implementation from the end programmer.
General-purpose programming has
largely remained low-level. Even when a
library offers an abstraction with different implementations—for example, a
map that can be implemented via either
a hash table or a binary tree—standard
relational reasoning is not supported
automatically. A map from a pair of X
and Y values to Z values is equivalent to
a map from X values to maps from Ys to
Zs, for instance, but the programmer
cannot represent both forms equivalently and has to structure the code for
one or the other. As a result, data structure code is almost universally too low-
hawkins et al.
make a contribution
to a long line of
work in elevating
level to be amenable to compiler analysis, for performance or correctness.
In the following paper, the authors
aim at elevating data structure programming to the relational realm. The
separation of the relational view from
the low-level implementation view of
the data structure is performed in a
disciplined way, and even automated. Given a relational specification
of data to be stored, the system derives and verifies a “decomposition:”
a combination of data structures that
together implement the relational
specification. Insertions, deletions,
and lookups on the data structure are
performed via actions that remain unchanged for different decompositions.
Eventually the system automatically
produces code that correctly implements the desired query. Key novelties
of the approach are the use of functional dependencies over data and a
type system that verifies the adequacy
of a decomposition for representing
the relations and dependencies.
In a greater context, Hawkins et al.
make a contribution to a long line of
work in elevating data manipulation.
The SETL language4 pioneered high-
level data structures by offering a pro-
gramming model based on set theory,
together with automatic selection of
concrete data structures at compile
time. Even closer to the authors’ work
in terms of programming model, Ba-
tory introduced a sequence of special-
ized languages1, 2, 5 exploring the refine-
ment of a relational specification to
concrete, intertwined data structures.
The program is written in terms of que-
ries oblivious to the specific data struc-
tures used. Once the data structures are
selected (manually, in a single configu-
ration statement2, 5 or automatically via
search in the space of data structures1),
a query is classified as a scan, an ordered
range lookup, or a direct map lookup,
and efficient code is generated for the
given structures. Additionally, other in-
teresting angles on the relational view
of data structures have been examined,
such as the Collection Programming
Language (CPL) by Buneman et al. 3 CPL
offers declarative whole-structure op-
erations—an approach of great power,
but probably more difficult to integrate
in general-purpose code with impera-
tive updates and a traversal state.
1. Batory, d., Chen, g., Robertson, E. and Wang, T. design
wizards and visual programming environments for
genVoca generators. IEEE Transactions on Software
Engineering 26, 5 (2000), 441–452.
2. Batory, d. and Thomas, J. P2: A lightweight dBmS
generator. Journal of Intelligent Information Systems
9 (1995), 107–123.
3. Buneman, P., Naqvi, S., Tannen, V. and Wong, L.
Principles of programming with complex objects and
collection types. Theoretical Computer Science 149, 1
4. Schwartz, J., dewar, R., Schonberg, E. and dubinsky,
E. Programming With Sets: An Introduction to SE TL.
5. Smaragdakis, Y. and Batory, d. diSTiL: A transformation
library for data structures. In Proceedings of the
Conference on Domain-Specific Languages. USENIX, 1997.
Yannis Smaragdakis ( firstname.lastname@example.org) is an
associate professor in the department of Informatics at
the University of Athens, greece.