practice
Doi: 10.1145/1629175.1629195
Article development led by
queue.acm.org
How streaming SQL technology can help solve
the Web 2.0 data crunch.
BY JuliAn HYDe
Data in
flight
weB AppLiCAtions pRoDUCe data at colossal rates, and
those rates compound every year as the Web becomes
more central to our lives. other data sources such
as environmental monitoring and location-based
services are a rapidly expanding part of our day-to-day
experience. Even as throughput is increasing, users
and business owners expect to see their data with ever-decreasing latency. Advances in computer hardware
(cheaper memory, cheaper disk, and more processing
cores) are helping somewhat, but not enough to keep
pace with the twin demands of rising throughput and
decreasing latency.
the technologies for powering Web applications
must be fairly straightforward for two reasons:
first, because it must be possible to evolve a Web
application rapidly and then to deploy it at scale with
a minimum of hassle; second, because the people
writing Web applications are generalists and are not
prepared to learn the kind of complex,
hard-to-tune technologies used by systems programmers.
The streaming query engine is a
new technology that excels in processing rapidly flowing data and producing
results with low latency. It arose out of
the database research community and
therefore shares some of the characteristics that make relational databases
popular, but it is most definitely not a
database. In a database, the data arrives first and is stored on disk; then users apply queries to the stored data. In
a streaming query engine, the queries
arrive before the data. The data flows
through a number of continuously executing queries, and the transformed
data flows out to applications. One
might say that a relational database processes data at rest, whereas a streaming
query engine processes data in flight.
Tables are the key primitive in a relational database. A table is populated
with records, each of which has the
same record type, defined by a number
of named, strongly typed columns. Records have no inherent ordering. Queries, generally expressed in SQL, retrieve records from one or more tables,
transforming them using a small set of
powerful relational operators.
Streams are the corresponding
primitive in a streaming query engine. A stream has a record type, just
like a table, but records flow through
a stream rather than being stored. Records in a streaming system are inherently ordered; in fact, each record has a
time stamp that indicates when it was
created. The relational operations supported by a relational database have
analogues in a streaming system and
are sufficiently similar that SQL can be
used to write streaming queries.
To illustrate how a streaming query
engine can solve problems involving data
in flight, consider the following example.
streaming Queries for
click-stream Processing
Suppose we want to monitor the most
popular pages on a Web site. Each Web
server request generates a line to the