600
500
400
300
data that is relevant to that department
or user base. The marketing department may need an external address database for cleansing before doing mass
mailings, while the customer support
organization may be more interested in
geo-tagging for the purpose of analyzing case distribution patterns.
As integration moves closer to the
end user, the key issue to be aware of
is loss of data control. If each mart
integrates the same data in a slightly
different way, then the chances are
greater that inconsistencies will be
introduced. The sales group in North
America may be adjusting for refunds
to take into account high rotation in its
box stores, while the Asia Pacific group
is not. A report that tries to aggregate
the results from these two independent data marts may naively sum the
numbers, resulting in an incorrect
result. There may be good reasons for
different rules at the data-mart level,
but when values are combined or compared downstream, these differences
can cause problems.
BI Tools. Most BI tool vendors have
now incorporated data-mashup capabilities so end users can join external
and internal data. This approach tends
to be more end-user driven than doing it at the data-mart layer and often
requires no administration access to
the database or formal data-modeling
expertise. Vendors’ tools vary in their
approaches, but typically there are options to do lightweight business-user
modeling, as well as on-screen combining of datasets via formula languages.
This provides a great deal of flexibility
while maintaining the ability to audit
and trace usage.
The issues faced at this layer are similar to those of the data mart. Aspects
of the data management and integration are still decentralized, which can
result in redundant definitions of the
same business concepts, increasing
the risk of incorrect interpretations.
Facilities for auditing and tracing, as
well as common business-user-level
semantic layers, help overcome some
of the issues.
Desktop. This end of the spectrum
represents the most common mashup
scenario. Excel, flat files, macros, and
Web application integration are easy
to throw together and occasionally are
even accurate. While database admin-
istrators and warehouse architects may
cringe, the bottom line is that when
people need information they will find
a way to get it. From an information architecture point of view, planning for
this eventuality allows you to decide
which data belongs where and what
the potential impacts will be in an informed way. For some data sources,
where accuracy and traceability are not
critical, this is a perfectly acceptable
choice. As these data sources become
more heavily used, though, it may make
sense to push them further down the
stack to ensure the data-integration
work is done by only one person rather
than many.
100
integration approach for an external
data source can balance the variables
of flexibility, quality, and cost while providing end users with timely answers to
their business questions.
Related articles
on queue.acm.org
Why Your Data Won’t Mix
Alon Halevy
http://queue.acm.org/detail.cfm?id=1103836
The Pathologies of Big Data
Adam Jacobs
http://queue.acm.org/detail.cfm?id=1563874
A Conversation with Michael Stonebraker
and Margo Seltzer
http://queue.acm.org/detail.cfm?id=1255430
conclusion
There is no correct layer to integrate
external data into the enterprise information flow, rather a set of trade-offs
to consider. The characteristics of the
data, its consumption scenarios, and
the business context all must be considered. These factors also need to be reevaluated periodically. As a data source
becomes more widely used, economics
may dictate centralizing and formalizing data acquisition. Choosing the right
References
1. boncella, r.J. Competitive intelligence and the Web.
Commun. AIS 12 (2003): 327–340; http://www.
washburn.edu/faculty/boncella/CoMPetItIVe-IntellIGenCe.pdf.
2. International securities exchange; http://www.ise.
com/.
3. Programmableweb.com; http://www.
programmableweb.com/.
Stephen Petschulat is a senior product architect in the
advanced analytics area of saP business objects.