a raw stock feed directly from the exchanges. The monthly uncompressed
data for the International Securities Exchange historical options ticker data is
more than a terabyte, so even handling
daily deltas can be multiple gigabytes
on a full financial stream.
2
Ticker information alone is not very
useful without symbol lookup tables
and other corporate data with which
to cross-reference it. These may come
from separate sources and may or may
not handle data changes for you—
someone has to recognize the change
from SUNW to JAVA to ORCL and ensure it is handled meaningfully. Different feeds also come in at different
rates. Data can change for technical,
business, and regulatory reasons, so
keeping it in sync with a usable data
model is a full-time job.
Quality is both a function of the
source of the data and the process by
which it flows through the organization. If a complex formula shows up in
hundreds of different reports authored
by dozens of different people, the
chance of introducing errors is almost
certain. Adding up all of the invoices
will produce a revenue number, but
it may not take into account returns,
refunds, volume discounts, and tax
rebates. Calculations such as revenue
and profit typically rely upon specific
assumptions of the underlying data,
and any formulas based on them need
to know what these assumptions are:
Do all of the formulas use the exact ˲
same algorithm?
How do they deal with rounding er- ˲
rors?
Have currency conversions been ˲
applied?
Has the data been seasonally ad- ˲
justed?
Are nulls treated as zeros or a lack ˲
of data?
The more places a formula is managed, the more likely errors will be introduced.
Cost can be traded off against both
quality and flexibility. With enough
people hand-inspecting every report, it
doesn’t matter how many times things
are duplicated—quality can be maintained through manual quality assurance. However, this is not the most efficient way to run a data warehouse. Cost
and flexibility typically trade off based
on the effort necessary to take raw data
300
400
500
600
and turn it into useful data. Each level
of processing takes effort and loses flexibility unless you are willing to invest
even more effort to maintain both base
and processed data.
Of course, you can always keep all of
the base data around forever, but the
cost of maintaining this can be prohibitive. Having everything in the core
data warehouse at the lowest possible
level of detail represents the extreme of
maximizing flexibility and quality while
trading off cost.
External Web services typically trade
off flexibility in exchange for quality
and cost. Highly summarized, targeted
data services with built-in assumptions,
calculations, and implicit filters are not
as flexible, but they are often all that is
needed to solve a specific problem. It
doesn’t matter that the daily weather
average for a metropolitan area is actually sampled at the airport and the
method of averaging isn’t explicit. This
loss of flexibility is a fair trade when all
you care about is what the temperature
was that day +/- 3 degrees.
The following are questions to ask
when determining which trade-offs
make sense for a given data source:
700
800
900
What is the business impact of in- ˲
correct data?
What is the cost of maintaining the ˲
data feed?
How large are the datasets? ˲
How often does the data change? ˲
How often does the data schema ˲
change?
How complex is the data? ˲
How complex and varied are the ˲
consumption scenarios?
What is the quality of the data (how ˲
many errors expected, how often, magnitude of impact)?
How critical is the data to decision ˲
making?
What are the auditing and trace- ˲
ability requirements?
Are there any regulatory concerns? ˲
Are there any privacy or confidenti- ˲
ality concerns?
enterprise Data mashups
Traditional warehouse life cycle, topology, and modeling approaches are not
well suited for external data integration. The data warehouse is often considered a central repository; the single
source of truth. In reality, it can rarely
keep up with the diversity of informa-