Web server’s log file describing the
time, the URI of the page, and the IP
address of the requester; and an adapter can continuously parse the log file
and populate a stream with records.
This query computes the number of
requests for each page each minute, as
shown in the accompanying table.
The example here is expressed in
SQLstream’s query language, as are
others in this article. The language is
standard SQL plus streaming extensions.
4 Other streaming query engines
have similar capabilities.
SELECT STREAM ROWTIME,
uri,
COUNT(*)
FROM PageRequests
GROUP BY
FLOOR(ROWTIME TO MINUTE),
uri;
The only SQL extensions used in
this particular query are the STREAM
keyword and the ROWTIME system column. If you removed the STREAM keyword and converted PageRequests
to a table with a column called ROWTIME, you could execute the query in a
conventional database such as Oracle
or MySQL. That query would analyze
all requests that have ever occurred up
until the current moment. If PageRequests is a stream, however, the
STREAM keyword tells SQLstream to
attach to the PageRequests stream
and to apply the operation to all future records. Streaming queries run
forever.
Every minute this query emits a set
of rows, summarizing the traffic for
each page during that minute. The
output rows time-stamped 10:00:00
summarize all requests between 10:00
and 10:01 (including the 10:00 end
point but not including 10:01). Rows in
the PageRequests stream are sorted
by their ROWTIME system column, so
the 10:00:00 output rows are literally
pushed out by the arrival of the first
row time-stamped 10:01:00 or later.
A streaming query engine tends to
process data and deliver results only
when new data arrives, so it is said to
use push-based processing. This is in
contrast to a relational database’s pull-based approach where the application
must poll repeatedly for new results.
The example in Figure 1 computes
URIs for which the number of requests
is much higher than normal. First, the
PageRequests WithCount view computes the number of requests per hour
for each URI over the past hour and
averaged over the past 24 hours. Then
a query selects URIs for which the rate
over the past hour was more than three
times the hourly rate over the past 24
hours.
Unlike the previous query that used
a GROUP BY clause to aggregate many
records into one total per time period,
this query uses windowed aggregate
expressions (aggregate-function OVER
window) to add analytic values to each
row. Because each row is annotated
with its trailing hour’s and day’s statistics, you need not wait for a batch of
rows to be complete. You can use such
a query to continuously populate a
“Most popular pages” list on your Web
site, or an e-commerce site could use
it to detect products selling in higher
than normal volumes.
figure 1. streaming query to find Web pages with higher than normal volume.
CREATE VIEW PageRequestsWithCount AS
SELECT STREAM ROWTIME,
uri,
COUNT(*) OVER lastHour AS hourlyRate,
COUNT(*) OVER lastDay / 24 AS hourlyRateLastDay
FROM PageRequests
WINDOW lastHour AS (
PARTITION BY uri
RANGE INTERVAL ‘ 1’ HOUR PRECEDING)
lastDay AS (
PARTITION BY uri
RANGE INTERVAL ‘ 1’ DAY PRECEDING);
SELECT STREAM
FROM PageRequestsWithCount
WHERE rate > hourlyRateLastDay 3;
comparing Databases and
streaming Query engines
A database and a streaming query engine have similar SQL semantics, but if
you use the two systems for problems
involving data in flight, they behave
very differently. Why is a streaming
query engine more efficient for such
problems? To answer that question, it
helps to look at its pedigree.
Some use the term streaming database, which misleadingly implies
that the system is storing data. That
said, streaming query engines have
very strong family connections with
databases. Streaming query engines
have roots in database research, in
particular the Stanford STREAMS project,
1 the Aurora project at MIT/Brown/
Brandeis,
2 and the Telegraph project
at Berkeley.
3 Streaming query engines
are based on the relational model that
underlies relational databases and, as
we shall see, those underpinnings give
them power, flexibility, and industry
acceptance.
The relational model, first described by E.F. Codd in 1970, is a
simple and uniform way of describing
the structure of databases. It consists
of relations (named collections of records) and a set of simple operators
for combining those relations: select,
project, join, aggregate, and union. A
relational database naturally enforces
data independence, the separation between the logical structure of data and
the physical representation. Because
the query writer does not know how
data is physically organized, a query
optimizer is an essential component
of a relational database, to choose
among the many possible algorithms
for a query.
SQL was first brought to market in
the late 1970s. Some say it is not theoretically pure (and it has since been
extended to encompass nonrelational
concepts such as objects and nested
tables), but SQL nevertheless embodies the key principles of the relational
model. It is declarative, which enables
the query to be optimized, so you (or
the system) can tune an application
without rewriting it. You can therefore
defer tuning a new database schema