from a software development perspective, but their relevance to enterprise
data management is just emerging.
Traditional data service providers such
as Dunn & Bradstreet and Thomson Reuters have started offering most of their
products via Web services. Hundreds of
smaller Web service companies are also
providing data in areas such as retail
sales, Web trends, securities, currency,
weather, government, medicine, current events, real estate, and competitive
intelligence.
With Web services comes the ability
to add value in the form of functional
services. Instead of allowing the retrieval of a flat file of data—effectively a
fetch() function—it is easy for a data
provider to add services on top of the
data: conversions, calculations, searching, and filtering. In fact, for most of
the smaller upstarts the emphasis has
been heavily on the functional services,
which typically present a more highly
processed subset of the overall data.
This can save time and effort when the
functions provided are what you need.
The data is precleansed, aggregated at
the right level, and you don’t have to
implement your own search.
This ability can also lead to challenges, however, when the functional interfaces don’t match your exact needs.
EBay provides marketplace research
data on the daily best-selling products
in various categories, top vendors, and
bid prices for products. This works well
if these specific queries are what you
want, but if you require a query that
eBay has not thought of, you don’t have
access to the base data to create that
custom query yourself.
Another important data source is the
Web itself. A great deal of unstructured
data exists in Web pages and behind
search engines. The area of competitive
intelligence has been driving the merging of unstructured and semistructured Web sources with the data warehouse. Competitive intelligence is also
an area that is driving the shift from
solely back-end data feeds to those all
the way through the stack.
1 Structured
Web services from Amazon and eBay
are one source of specific market and
sales information, while technologies
from companies such as Kapow and
Dapper allow users to turn any external
Web page content into semistructured
data feeds by mapping some of the visual fields on the page to data fields in
the dynamic feed.
Although these tools are beginning
to make Web scraping easier, most end
users still resort to cutting and pasting from competitors’ Web sites into
spreadsheets in order to gain the insights they need to do their jobs. This
is a manually intensive and error-prone
approach, but the alternative—
sourcing market information, talking to IT,
integrating the datasets into the core
data-warehouse model, staging, testing, deploying—takes too long, particularly when sources may be changing on
a weekly or monthly basis.
Architectural considerations
External data should be considered and
planned for differently from internal
data. Much the way distributed computing architectures must account for
latency and data failures, a robust data-warehousing plan must take into account the fact that external sources are
not, by definition, in the sphere of control of the receiving organization. They
are more prone to unpredictable delays, data anomalies, schema changes,
and semantic data changes. That is not
to say they are lower quality; plenty of
internal sources have the same issues,
and data-service companies are paid to
provide top-quality data. However, the
communication channel and processing systems are inter- rather than intra-company, creating an additional source
of issues and delays.
Competitive intelligence data (
legally) scraped off of publicly available
sites must not contend with cleanliness
issues, but the schema can also change
at any time and the publisher has no
obligation to notify consumers of that
change. If a scheduled report relies
on this information, then it will often
break, resulting in a blank or incomprehensible report. Even worse, it could
result in incorrect numbers, leading to
bad business decisions. Data reliability
and accuracy must be considered fundamental attributes throughout the
data’s flow in the organization.
flexibility, Quality, and cost
Not all data needs to go through the
entire data-warehouse workflow. In
information-intensive organizations
the IT group can rarely accommodate
every user’s data needs in a timely manner. Trade-offs must be made. These
trade-offs can be considered along the
dimensions of flexibility, quality, and
cost. In most cases, you can pick two
and trade off the third.
Flexibility refers to how easily you
can purpose the data for the end users’
needs. Getting base data from raw flat
files maximizes your ability to massage
the data further; however, the effort
involved is much higher than getting
a highly summarized feed from a Web
service vendor. For example, the International Securities Exchange historical options daily ticker data for a single
stock symbol has more than 80 fields2
(see the accompanying box).
Having all of the base data in CSV
(comma-separated values) format provides maximum flexibility; you can
derive any information you want from
it. However, if all you require is the
high-level summary information, you
would be better off giving up that flexibility in exchange for a simpler API
from a Web service provider such as
StrikeIron or Xignite. For example, GET
http://www.xignite.com/xquotes.asmx/
GetSingleQuote?Symbol=AAPL
international securities exchange historical options data ticker.
TRADE_DT,UNDLY,SEC_TYPE,SYM_ROOT,... 80+ fields...,ISE_VOL,TOTAL_VOL
20090810,AAPL, 1,QAA , ... ,2241,164.72,0.01, 1, 2
20090810,AAPL, 1,QAA, ... ,2347,164.72,0.02, 1, 2
20090810,AAPL, 1,QAA, ... ,3591,164.72,0.03, 7,130
20090810,AAPL, 1,APV, ... ,2714,164.72, 40. 7, 10, 15
…
<QuickQuote>
<Symbol>AAPL</Symbol>
<Last>188.50</Last>
<Change> 7. 85</Change>
<Volume> 25,094,395</Volume>
<Time>4:01pm ET</Time>
</QuickQuote>
In the case of securities information,
very few IT shops can afford to manage