starting to store their data in the cloud,
it makes sense to think about data
service integration in the cloud when
there are many, many small data services available. For example, how can
Clark Cloudlover integrate his Google
calendar with his wife’s Apple iCal
calendar? At this point, there is no homogeneity in data representation and
querying, and the number of cloud
service providers is rapidly increasing.
Google Fusion Tables27 is one example
of a system that follows this trend and
allows its users to upload tabular data
sets (spreadsheets), to store them in
the cloud, and subsequently to integrate and query them. Users are able
to integrate their data with other publicly available datasets by performing
left outer joins on primary keys, called
table merges. Fusion Tables also visualize users’ data using maps, graphs,
and other techniques. Work is needed
on many aspects of cloud data sharing
and integration.
Data summaries. As the number
of data services increases to a “
consumer scale,” it will be difficult even
to find the data services of interest and
to differentiate among data services
whose output schemas are similar.
One approach to easing this problem
is to offer data summaries that can be
searched and that can give data service
consumers an idea of what lies behind
a given data service. Data sampling and
summarization techniques that have
been traditionally employed for query
optimization can serve as a basis for
work on large-scale data service characterization and discovery.
Cloud data service security. Stor-
ing proprietary or confidential data in
the cloud obviously creates new secu-
rity problems. Currently, there are two
broad choices. Data owners can either
encrypt their data, but this means that
all but exact-match queries have to be
processed on the client, moving large
volumes of data across the cloud, or
they must trust cloud providers with
their data, hoping there are enough
security mechanisms in the cloud to
guard against malicious applications
and services that might try to access
data that does not belong to them.
There is early ongoing work24 that may
help to bridge this gap by enabling que-
ries and updates over encrypted data,
but much more work is needed to see
if practical (for example, efficient) ap-
proaches and techniques can indeed
be developed.
Acknowledgments
We would to thank Divyakant Agrawal
(UC Santa Barbara), Pablo Castro (
Microsoft), Alon Halevy (Google), James
Hamilton (Amazon) and Joshua Spiegel (Oracle) for their detailed comments on an earlier version of this
article. We also thank the associate
editor and anonymous reviewers for
feedback that improved the quality of
this article. This work was supported
in part by NSF IIS awards 0910989,
0713672, and 1018961.
References
1. amazon Web services, 2010. http://aws.amazon.com/.
2. adya, a., blakeley, J.a., Melnik, s., and Muralidhar,
s. anatomy of the ado.net entity framework. In
Proceedings of SIGMOD Conference (2007), 877–888.
3. agrawal, d., abbadi, a.e., antony, s. and das, s.
data management challenges in cloud computing
infrastructures. In DNIS (2010), 1–10.
4. allen a. Friedman and darrell M. West. Privacy and
security in cloud computing. Issues in Technology
Innovation 3 (oct. 2010), Center for technology
Innovation at brookings.
5. baker, J., bond, C., Corbett, J. C., Furman, J., khorlin,
a., larson, J., léon, J.M. li, y., lloyd, a. and yushprakh,
V. Megastore: Providing scalable, highly available
storage for interactive services. In Proceedings of
CIDR Conference, 2011.
6. bancilhon, F and spyratos, n. update semantics of
relational views. ACM Trans. Database Syst. 6 (dec.
1981), 557–575.
7. bernstein, P.a., Cseri, I., dani, n., ellis, n., kalhan,
a., kakivaya, G., lomet, d.b., Manne, r., novik, l. and
talius, t. adapting Microsoft sql server for cloud
computing. In Proceedings of IEEE International
Conference on Data Engineering (2011), 1255–1263.
8. blow, M., borkar, V., Carey, M., Hillery, C., kotopoulis,
a., lychagin, d., Preotiuc-Pietro, r., reveliotis,
P., spiegel, J. and Westmann, t. updates in the
aqualogic data services Platform. In Proceedings of
IEEE International Conference on Data Engineering
(2009), 1431–1442.
9. boag, s., Chamberlin, d., Fernandez, M. F., Florescu, d.,
robie, J. and siméon, J. Xquery 1.0: an XMl query
language. W3C recommendation (Jan. 23, 2007);
http://www.w3.org/tr/xquery/.
10. borkar, V.r., Carey, M.J., engovatov, d., lychagin, d.,
reveliotis, P., spiegel, J., thatte, s. and Westmann,
t. access Control in the aqualogic data services
Platform. In Proceedings of SIGMOD Conference
(2009), 939–946.
11. brantner, M., Florescu, d., Graf, d.a., kossmann, d. and
kraska, t. building a database on s3. In Proceedings
of SIGMOD Conference (2008), 251–264.
12. britton-lee Inc. IdM 500 software reference Manual
Version 1. 3, 1981.
13. Carey, M.J. data delivery in a service-oriented world:
the bea aqualogic data services platform. In
Proceedings of SIGMOD Conference (2006), 695–705.
14. Carey, M. J. soa what? IEEE Computer 41, 3 (2008),
92–94.
15. the Cassandra apache Project, 2009; http://
cassandra.apache.org/.
16. Cattell, r. scalable sql and nosql data stores.
SIGMOD Record 39, 4 (2010), 12–27.
17. Cautis, b., deutsch, a., onose, n. and Vassalos, V.
efficient rewriting of XPath queries using query set
specifications. PVLDB 2, 1 (2009), 301–312.
18. Chang, F., dean, J., Ghemawat, s., Hsieh, W. C.,
Wallach, d.a., burrows, M., Chandra, t., Fikes, a. and
Gruber, r.e. bigtable: a distributed storage system for
structured data. In Proceedings of the 7th USENIX
Symposium on Operating Systems Design and
Implementation, (2006).
19. Christensen, e., Curbera, F., Meredith, G. and
Weerawarana, s. Web services description language
( Wsdl) 1. 1. W3C note (Mar. 15, 2001); http://www.
w3.org/tr/wsdl.
Michael J. Carey ( mjcarey@ics.uci.edu) is a professor in
the bren school of Information and Computer sciences at
the university of California, Irvine.
nicola Onose ( onose@google.com) is a software enginer
at Google. He conducted this work as a postdoctoral
research fellow in the bren school of Information and
Computer sciences at the university of California, Irvine.
Michalis Petropoulos ( mpetropo@gmail.com) is an
architect at Greenplum/eMC. He conducted this work
while a research scientist in the Computer sceince and
engineering department at the university of California,
san diego.
© 2012 aCM 0001-0782/12/06 $10.00