until the application is mostly written,
and you can safely refactor an existing
database schema. SQL is simple, reliable, and forgiving, and many developers understand it.
Streams introduce a time dimension into the relational model. You can
still apply the basic operators (select,
project, join, and so forth), but you can
also ask, “If I executed that join query
a second ago, and I execute it again
now, what would be the difference in
the results?”
This allows us to approach problems in a very different way. As an
analogy, consider how you would measure the speed of a car traveling along
the freeway. You might look out the
window for a mile marker, write down
the time, and when you reach the next
mile marker, divide the distance between the mile markers by the elapsed
time. Alternatively, you might use a
speedometer, a device where a needle
is moved based on a generated current that is proportional to the angular velocity of the car’s wheels, which
in turn is proportional to the speed of
the car. The mile-marker method converts position and time into speed,
whereas the speedometer measures
speed directly using a set of quantities
proportional to speed.
Position and speed are connected
quantities; in the language of calculus, speed is the differential of position with respect to time. Similarly,
a stream is the time differential of
a table. Just as the speedometer is
the more appropriate solution to the
problem of measuring a car’s speed, a
streaming query engine is often much
more efficient than a relational database for data-processing applications
involving rapidly arriving time-depen-dent data.
output from query.
RoWTime uri
10:00:00 / index.html
10:00:00 /images/ logo.png
10:00:00 / orders.html
10:01:00 / index.html
10:01:00 /images/ logo.png
10:01:00 / sitemap.html
...
counT(*)
15
19
6
20
18
2
streaming Advantage
Why is a streaming query engine more
efficient than a relational database for
data-in-flight problems?
First, the systems express the problems in very different ways. A database
stores data and applications fire queries
(and transactions) at the data. A streaming query engine stores queries, and the
outside world fires data at the queries.
There are no transactions as such, just
data flowing through the system.
The database needs to load and index the data, run the query on the whole
dataset, and subtract previous results. A
streaming query system processes only
new data. It holds only the data that it
needs (for example, the latest minute),
and since that usually fits into memory
easily, no disk I/O is necessary.
A relational database operates under the assumption that all data is
equally important, but in a business
application, what happened a minute
ago is often more important than what
happened yesterday, and much more
important than what happened a year
ago. As the database grows, it needs
to spread the large dataset across disk
and create indexes so that all of the
data can be accessed in constant time.
A streaming query engine’s working sets are smaller and can be held
in memory; and because the queries
contain window specifications and
are created before the data arrives, the
streaming query engine does not have
to guess which data to store.
A streaming query engine has other
inherent advantages for data in flight:
reduced concurrency control overhead
and efficiencies from processing data
asynchronously. Since a database is
writing to data structures that other applications can read and write, it needs
mechanisms for concurrency control;
in a streaming query engine there is no
contention for locks, because incoming data from all applications is placed
on a queue and processed when the
streaming query engine is ready for it.
In other words, the streaming query
engine processes data asynchronously.
Asynchronous processing is a feature of
many high-performance server applications, from transaction processing to
email processing, as well as Web crawling and indexing. It allows a system to
vary its unit of work—from a record at a
time when the system is lightly loaded
to batches of many rows when the load
is heavier—to achieve efficiency benefits such as locality-of-reference. One
might think an asynchronous system
has a slower response time, because it
processes the data “when it feels like
it,” but an asynchronous system can
achieve a given throughput at much
lower system load, and therefore have
a better response time than a synchronous system. Not only is a relational database synchronous, but it also tends
to force the rest of the application into
a record-at-a-time mode.
It should be clear by now that push-based processing is more efficient for
data in flight; however, a streaming query engine is not the only way to achieve it.
Streaming SQL does not make anything
possible that was previously impossible. For example, you could implement
many problems using a message bus,
messages encoded in XML, and a procedural language to take messages off the
bus, transform them, and put them back
onto the bus. You would, however, encounter problems of performance (
parsing XML is expensive), scalability (how
to split a problem into sub-problems
that can be handled by separate threads
or machines), algorithms (how to combine two streams efficiently, correlate
two streams on a common key, or aggregate a stream), and configuration (how
to inform all of the components of the
system if one of the rules has changed).
Most modern applications choose to
use a relational database management
system to avoid dealing with data files
directly, and the reasons to use a streaming query system are very similar.
other Applications of
streaming Query systems
Just as relational databases are a horizontal technology, used for everything
from serving Web pages to transaction processing and data warehousing,
streaming SQL systems are being applied to a variety of problems.
Application areas include complex event processing (CEP), monitoring, population data warehouses, and
middleware. A CEP query looks for sequences of events on a single stream
or on multiple streams that, together,
match a pattern and create a “complex
event” of interest to the business. Applications of CEP include fraud detection and electronic trading.