the Communications Web site, http://cacm.acm.org,
features more than a dozen bloggers in the BLoG@CaCm
community. In each issue of Communications, we’ll publish
selected posts or excerpts.
follow us on twitter at http://twitter.com/blogCaCm
DOI:10.1145/1941487.1941491
http://cacm.acm.org/blogs/blog-cacm
stonebraker on
Data Warehouses
Data warehouses are not only increasing in size
and complexity, but also in their importance to business.
Michael Stonebraker shares 10 key ideas on the topic.
michael stonebraker from “my top 10 assertions about Data Warehouses” http://cacm.acm.org/ blogs/blog-cacm/98136 Data warehouses, business intelli- gence, business analytics, and complex analytics are the subject of increasingly
intense marketing wars, seemingly accelerated by Oracle’s introduction of
the Exadata appliance. Here is my spin
on the situation. Please note that I have
a financial interest in several database
companies, and may be biased in a
number of ways. The reader should always keep this in mind.
1. Star and snowflake schemas are a
good idea in the data warehouse world.
In short, data warehouses store
a large collection of facts. The overwhelming majority of these facts are
the “five Ws” (who, what, where, when,
and why) along with a collection of attributes about the fact. For example, a
typical retail organization stores facts
about historical transactions. These
facts include the customer (who), the
retail store (where), the time of the sale
(when), and the purchased product
(what), along with attributes of the sale
(e.g., price, sales tax, credit card, etc.).
One should organize such data as
shown in the figure here. Such a schema is called a star schema, with a central fact table and surrounding dimension tables. If stores are organized into
divisions, then the star schema has another table between store and fact, and
becomes a snowflake schema. Star and
snowflake schemas are clean, simple,
easy to parallelize, and usually result in
very high-performance database management system (DBMS) applications.
If you are a data warehouse designer and come up with something other
than a snowflake schema, you should
probably rethink your design.
However, you will often come up
with a design having a large number
of attributes in the fact table; 40 attributes are routine and 200 are not
uncommon. Current data warehouse
administrators usually stand on their
heads to make “fat” fact tables perform
on current relational database management systems (RDBMSs).
2. Column stores will dominate the
data warehouse market over time, re-
placing row stores.
The problem with row stores is they
store data in the fact table, row by row,
on disk. A typical business intelligence
query requires half-a-dozen attributes
or less (e.g., find me the average price of
widgets by store by month for the past
two years). A row store will read all 200
attributes, even though only six are required. In contrast, a DBMS that organized data by column will read only the
six required, a savings of a factor of 33.
Since fact tables are getting fatter
over time as business analysts want access to more and more information, this
architectural difference will become increasingly significant. Even when “
skinny” fact tables occur or where many attributes are read, a column store is still
likely to be advantageous because of its
superior compression ability.
For these reasons, over time, column stores will clearly win.
3. The vast majority of data warehous-
es are not candidates for main memory
or flash memory.
Data warehouses are increasing in
size faster than storage is getting cheaper. Business analysts have an appetite
for as much attribute data as they can
get their hands on, and want to keep
increasingly long periods of history.
Hence, data warehouse problems are
getting “net harder,” not “net easier.”
Put differently, most data warehouses
are measured in Gbytes today, Tbytes
tomorrow, and Pbytes the next day.
4. Massively parallel processor (MPP)
systems will be omnipresent in this
market.
Massively parallel processor systems are the only kind of computer ar-