practice
Doi: 10.1145/2160718.2160735
Article development led by
queue.acm.org
Web and mobile applications are increasingly
composed of asynchronous and real-time
streaming services and push notifications.
By eRiK meiJeR
your mouse
is a Database
AMoNG THe Ho TTeST buzzwords in the IT industry
these days is “big data,” but the “big” is something of a
misnomer: big data is not just about volume, but also
about velocity and variety:
4
˲ The volume of data ranges from a small number
of items stored in the closed world of a conventional
RDBMS (relational database management system) to
a large number of items spread out over a large cluster
of machines or across the entire World Wide Web.
˲ The velocity of data ranges from the consumer
synchronously pulling data from the source to the
source asynchronously pushing data to its clients, at
different speeds, ranging from millisecond-latency
push-based streams of stock quotes to reference data
pulled by an application from a central repository
once a month.
˲ The variety of data ranges from SQL-style
relational tuples with foreign-/primary-key
relationships to coSQL6-style objects or graphs with
key-value pointers, or even binary data such as videos
and music.
If we draw a picture of the design
space for big data along these three
dimensions of volume, velocity, and
variety, then we get the big-data cube
shown in Figure 1. Each of the eight
corners of the cube corresponds to a
(well-known) database technology.
For example, the traditional RDBMS
is at the top-back corner with coordinates (small, pull, fk/pk), meaning that the data sets are small; it
assumes a closed world that is under full control by the database, clients synchronously pull rows out of
the database after they have issued
a query, and the data model is based
on Codd’s relational model. Hadoop-based systems such as HBase are on
the front-left corner with coordinates
(big, pull, fk/pk). The data model is
still fundamentally rectangular with
rows, columns, and primary keys, and
results are pulled by the client out of
the store, but the data is stored on a
cluster of machines using some partitioning scheme.
When moving from the top plane
to the bottom plane, the data model
changes from rows with primary and
foreign keys to objects and pointers.
On the bottom-left corner at coordinates (small, pull, k/v) are traditional
O/R (object/relational) mapping solutions such as LINQ to SQL, Entity
Framework, and Hibernate, which put
an OO (object-oriented) veneer on top
of relational databases. In the front
of the cube is LINQ to Objects with
coordinates (big, pull, k/v). It virtual-izes the actual data source using the
IEnumerable<T> interface, which allows for an infinite collection of items
to be generated on the fly. To the right,
the cube changes from batch processing to streaming data where the
data source asynchronously pushes a
stream of items to its clients. Streaming database systems with a rows-and-columns data model such as Percolator, StreamBase, and StreamInsight
occupy the top-right axis.
Finally, on the bottom right at coordinates (big, push, k/v), is Rx (
Reactive Extensions), or as it is sometimes