in different rows, hence the “
sparsity.” The ability to address data based
on (potentially many) columns differentiates sparse tables from key/value
stores and makes it possible to index
and query data more meaningfully.
Compared to traditional RDBMSs that
require static (fixed) schemas, sparse
tables have a more flexible data model, since the set of columns identifiers
may change based on data updates.
Bigtable has inspired the creation of
similar open source systems such as
HBase29 and Cassandra.
SimpleDB1 is Amazon’s version
of a sparse table and exposes a Web
service interface for basic indexing
and querying in the cloud. A column
value in a SimpleDB table may be
atomic, as in the relational model, or
a list of atomic values (limited in size
to 1KB). SimpleDB’s tables are called
domains. SimpleDB queries have
a SQL-like syntax and can perform
selections, projections and sorting
over domains. There is no support for
joins or nested subqueries.
A SimpleDB application stores its
customer information in a domain
called Customers and its order information in an Orders domain. Using
SimpleDB’s RESTb interface, the application can insert records (id=‘C043’,
state=‘N Y’) into Customers and (id=
‘O012’, cid=‘043’, status=‘open’)
into Orders. Further inserts do not necessarily need to conform to these schemas, but for the sake of our example we
will assume they do.
Since SimpleDB does not imple-
ment joins, joins must be coded at the
client application level. For example, to
retrieve the orders for all NY clients, an
application would first fetch the client
info via the query:
volume of data needs to manually parti-
tion (“shard”) it and issue separate que-
ries against each of the partitions.
RDBMSs: In cloud computing systems that provide a virtual machine interface, such as EC2,
1 users can install
an entire database system in the cloud.
However, there is also a push toward
providing a database management
system itself as a service. In that case,
administrative tasks such as installing
and updating DBMS software and performing backups are delegated to the
cloud service provider.
Amazon RDS1 is a cloud data service
that provides access to the full capabilities of a MySQL39 database installed on
a machine in the cloud, with the possibility of setting several “read replicas”
for read-intensive workloads. Users
can create new databases from scratch
or migrate their existent MySQL data
into the Amazon cloud. Microsoft has
a similar offering with SQL Azure,
chooses a different strategy that supports scaling by physically partitioning and replicating logical database
instances on several machines. A SQL
Azure source can be service-enabled by
publishing an OData service on top of
it, as in the section “Service-Enabling
Data Stores.” Google’s Megastore5 is
also designed to provide scalable and
reliable storage for cloud applications,
while allowing users to model their
data in a SQL-like schema language.
Data types can be string, numeric
types, or Protocol Buffers26 and they
can be required, optional or repeated.
Amazon RDS users manage and interact with their databases either via
shell scripts or a SOAPc-based Web services API. In both cases, in order to connect to a MySQL instance, users need
to know its DNS name, which is a sub-domain of rds.amazonaws.com. They
can then either open a MySQL console
using an Amazon-provided shell script,
or they can access the database like any
MySQL instance identified by a DNS
name and port.
select id from Customers where
the result of which would include C043
and would then retrieve the corresponding orders as follows:
select from Orders where cid= ‘C043’
A major limitation for SimpleDB is that
the size of a table instance is bounded.
An application that manipulates a large
Advanced Technical issues
So far we have mostly covered the ba-
sics of data services, touching on a
range of use cases (single source, in-
tegrated source, and cloud sources)
along with their associated data ser-
b Representational State Transfer.
c Simple Object Access Protocol.
vice technologies. Here, we will briefly
highlight a few more advanced topics
and issues, including updates and
transactions, data consistency for scalable services, and issues related to security for data services.
Data service updates and transactions. As with other applications, applications built over data services require
transactional properties in order to operate correctly in the presence of concurrent operations, exceptions, and
service failures. Data services based on
single sources, for the most part, can
inherit their answer to this requirement from the source that they serve
their data from. Data services that integrate data from multiple sources,
however, face additional challenges—
especially since many interesting data
sources, such as enterprise Web services and cloud data services, are either
unable or “unwilling” to participate in
traditional (two-phase commit-based)
distributed transactions due to issues
related to high latencies and/or temporary loss of autonomy. Data service
update operations that involve non-transactional sources can potentially
be supported using a compensation-based transaction model8 based on Sagas.
23 The classic compensating transaction example is travel-related, where
a booking transaction might need to
perform updates against multiple autonomous ticketing services (to obtain
airline, hotel, rental car, and concert
reservations) and roll them all back
via compensation in the event that reservations cannot be obtained from all
of them. Unfortunately, such support
is underdeveloped in current data service offerings, so this is an area where
all current systems fall short and further refinement is required. The current state of the art leaves too much
to the application developer in terms
of hand-coding compensation logic as
well as picking up the pieces after nonatomic failures.
Another challenge, faced both by
single-source and multisource data
services, is the mapping of updates
made to the external model to correspondingly required updates to the
underlying data source(s). This challenge arises because data services
that involve non-trivial mappings—as
might be built using the tools provided
by WCF or ODSI—present the service