aries between messaging, continuous
ETL, and database technologies by applying SQL throughout.
OLAP
Alerts
Dashboard
Maintain
OLAP
Cache
Operational
database
Data
warehouse
Streaming query system
CEP has been used within the industry as a blanket term to describe the
entire field of streaming query systems.
This is regrettable because it has resulted in a religious war between SQL-based and non-SQL-based vendors and,
in overly focusing on financial services
applications, has caused other application areas to be neglected.
The click-stream queries here are a
simple example of a monitoring application. Such an application looks for
trends in the transactions that represent
the running business and alerts the operations staff if things are not running
smoothly. A monitoring query finds insights by aggregating large numbers of
records and looking for trends, in contrast to a CEP query that looks for patterns among individual events. Monitoring applications may also populate
real-time dashboards, a business’s
equivalent of your car’s speedometer,
thermometer, and oil pressure gauge.
Because of their common SQL language, streaming queries have a natural synergy with data warehouses. The
data warehouse holds the large amount
of historical data necessary for a “
rear-view mirror” analysis of the business,
while the streaming query system
continuously populates the data warehouse and provides forward-looking
insight to “steer the company.”
The streaming query system performs the same function as an ETL
(extract, transform, load) tool but operates continuously. A conventional
ETL process is a sequence of steps
invoked as a batch job. The cycle time
of the ETL process limits how current
the data warehouse is, and it is difficult to get that cycle time below a
few minutes. For example, the most
data-intensive steps are performed
by issuing queries on the data warehouse: looking up existing values in a
dimension table, such as customers
who have made a previous purchase,
and populating summary tables. A
streaming query system can cache the
information required to perform these
steps, offloading the data warehouse,
whereas the ETL process is too short-lived to benefit from caching.
Figure 2 shows the architecture of a
real-time business intelligence system.
In addition to performing continuous
ETL, the streaming query system populates a dashboard of business metrics,
generates alerts if metrics fall outside
acceptable bounds, and proactively
maintains the cache of an OLAP (
online analytical processing) server that
is based upon the data warehouse.
Today, much “data in flight” is
transmitted by message-oriented middleware. Like middleware, streaming
query systems can deliver messages
reliably, and with high throughput
and low latency; further, they can apply SQL operations to route, combine,
and transform messages in flight. As
streaming query systems mature, we
may see them stepping into the role of
middleware and blurring the bound-
conclusion
Streaming query engines are based on
the same technology as relational databases but are designed to process data
in flight. Streaming query engines can
solve some common problems much
more efficiently than databases because they match the time-based nature of the problems, they retain only
the working set of data needed to solve
the problem, and they process data
asynchronously and continuously.
Because of their shared SQL language, streaming query engines and
relational databases can collaborate to
solve problems in monitoring and real-time business intelligence. SQL makes
them accessible to a large pool of people with SQL expertise.
Just as databases can be applied to a
wide range of problems, from transaction processing to data warehousing,
streaming query systems can support
patterns such as enterprise messaging,
complex event processing, continuous
data integration, and new application
areas that are still being discovered.
Related articles
on queue.acm.org
A Call to Arms
Jim Gray and Mark Compton
http://queue.acm.org/detail.cfm?id=1059805
Beyond Relational Databases
Margo Seltzer
http://queue.acm.org/detail.cfm?id=1059807
A Conversation with Michael Stonebraker
and Margo Seltzer
http://queue.acm.org/detail.cfm?id=1255430
References
1. arasu, a., babu, s., Widom, J. the CQl Continuous
Query language: semantic Foundations and Query
execution. technical report. stanford university,
stanford, Ca, 2003.
2. aurora project; http://www.cs.brown.edu/research/aurora.
3. Chandrasekaran, s., et al. telegraphCQ: Continuous
dataflow processing for an uncertain world. In
Proceedings of Conference on Innovative Data
Systems Research (2003).
4. sQlstream Inc.; http://www.sqlstream.com.
Julian hyde is chief architect of sQlstream, a streaming
query engine. He is also the lead developer of Mondrian, the
most popular open source relational olaP engine and a part
of the Pentaho open source bI suite. an expert on relational
technology, including query optimization and streaming
execution, Hyde introduced bitmap indexes into oracle and
led development of the broadbase analytic dbMs.