Strictly speaking, stream processing
is not a data management task; it is a
data-filtering task. That is, data is produced at some source and sent streaming to recipients that filter the stream
for “interesting” events. For example,
financial institutions watch stock tickers looking for hotly traded items and/
or stocks that aren’t being traded as
heavily as expected.
The reason that these stream-processing applications are included
here is a linguistic one: the filters that
are typically desired in these environments look like SQL; however, while
SQL was designed to operate on persistently stored tables, these queries act
upon a real time stream of data values.
Stonebraker explains in some depth
how poorly equipped databases are for
this task. Perhaps the bigger surprise
is not that database systems are poorly
equipped to address this task, but that
because SQL appears to be the “right”
query language, developers use relational database systems for applications that have no persistent storage!
Stream processing represents a
class of applications that could benefit
from a SQL-like query language atop a
data management system with properties that are radically different from
an RDBMS. Since streaming queries
frequently operate on data observed
during a time window, some transient
local storage is necessary, but this storage needn’t be persistent, transactional, or support complex query processing. Instead, it must be blindingly fast.
Although relational databases are well-equipped to handle dynamic queries
over relatively static or slowly changing
data, this application class is characterized by a fairly static query set over
highly dynamic data.
flexible Solutions
Relational systems have been designed
to satisfy online transaction processing (OLTP) workloads characterized by
ad hoc queries, significant write traffic,
and the need for strong transactional
and integrity guarantees. In contrast,
the applications described here are almost all read-dominated, and streaming applications don’t even take advantage of persistent data, just an SQL-like
query language. Few of these applications require transactional guarantees,
and there is little inherently relational
there are
fundamentally two
properties that
a solution must
possess to address
the wide range
of application
needs emerging
today: modularity
and configurability.
about the data being accessed. Thus,
the data management question becomes how best to satisfy the needs of
these different types of applications.
We claim (like Stonebraker) that there
really is no single right answer. Instead, we must focus on flexible solutions that can be tailored to the needs
of a particular application.
There are several ways to deliver flexibility in today’s changing data environment. The back-to-basics approach is
to require that every single application
build its own data storage service. This
option, while seemingly simple, is impractical in all but the simplest of applications. Some data-intensive applications running today, however, are built
upon simple, homegrown solutions.
The second way to address the need
for flexibility is to provide a smorgasbord of data management options,
each of which addresses a particular
application class. We see this approach
emerging in the traditional relational
market, where the SQL veneer is used to
hide the different capabilities required
for OLTP and data warehousing.
The third approach to flexibility is to
produce a storage engine that is more
configurable so that it can be tuned to
the requirements of individual applications. This solution has the advantage
of allowing concentrated investment
in a single storage system, improving quality. Configurability, however,
makes new demands of developers
who use the database, since they must
understand the configuration options
and then integrate the data management component properly into their
product designs.
In fact, the solution emerging in the
marketplace is to have a handful of reasonably configurable storage systems,
each of which is useful across a broad
application class.
There are fundamentally two properties that a solution must possess to
address the wide range of application
needs emerging today: modularity
and configurability. Few applications
require all the functionality possible
in a data management system. If an
application doesn’t need functionality, it should not have to “pay” for
that functionality in size (footprint,
memory consumption, disk utilization, and so on), complexity, or cost.
Therefore, a flexible engine must allow