fault tolerance on commodity hardware. However, the latest generation
of big data systems is rediscovering the
value of these principles and is adopting concepts and methods that have
been long-standing assets of the database community. Building on these
principles and assets, the database
community is well positioned to drive
transformative improvements to big
But big data also brings enormous
challenges, whose solutions will require massive disruptions to the design, implementation, and deployment of data management solutions.
The main characteristics of big data
are volume, velocity, and variety. The
database community has worked on
volume and velocity for decades, and
has developed solutions that are mission critical to virtually every commercial enterprise on the planet. The
unprecedented scale of big data, however, will require a radical rethinking
of existing solutions.
Variety arises from several sources.
First, there is the problem of integrating and analyzing data that comes
from diverse sources, with varying
formats and quality. This is another long-standing topic of database
work, yet it is still an extremely labor-intensive journey from raw data to
actionable knowledge. This problem
is exacerbated by big data, causing a
major bottleneck in the data processing pipeline. Second, there is the variety of computing platforms needed to
process big data: hardware infrastructures; processing frameworks, languages, and systems; and programming abstractions. Finally, there is a
range of user sophistication and preferences. Designing data management
solutions that can cope with such extreme variety is a difficult challenge.
Moving beyond the three Vs, many
big data applications will be deployed
in the cloud, both public and private, on
a massive scale. This requires new tech-
niques to offer predictable performance
and flexible interoperation. Many ap-
plications will also require people to
solve semantic problems that still be-
devil current automatic solutions. This
can range from a single domain expert
to a crowd of workers, a user commu-
nity, or the entire connected world (for
example, Wikipedia). This will require
new techniques to help people be more
productive and to reduce the skill level
needed to solve these problems.
Finally, big data brings important
community challenges. We must rethink the approach to teaching data
management, reexamine our research
culture, and adapt to the emergence of
data science as a discipline.
The meeting identified five big data
challenges: scalable big/fast data infrastructures; coping with diversity in
data management; end-to-end processing of data; cloud services; and the
roles of people in the data life cycle.
The first three challenges deal with the
volume, velocity, and variety aspects of
big data. The last two deal with deploying big data applications in the cloud
and managing the involvement of people in these applications.
These big data challenges are not
an exclusive agenda to be pursued at
the expense of existing work. In recent
years the database community has
strengthened core competencies in relational DBMSs and branched out into
many new directions. Some important
issues raised repeatedly during the
meeting are security, privacy, data pricing, data attribution, social and mobile
data, spatiotemporal data, personalization and contextualization, energy-constrained processing, and scientific
data management. Many of these issues cut across the identified big data
challenges and are captured in the discussion here.
It is important to note that some
of this work is being done in collaboration with other computer science
fields, including distributed systems,
artificial intelligence, knowledge discovery and data mining, human-computer interaction, and e-science. In
many cases, these fields provided the
inspiration for the topic and the data
management community has joined
in, applying its expertise to produce
robust solutions. These collaborations
have been very productive and should
continue to grow.
Scalable big/fast data infrastructures. Parallel and distributed processing. In the database world, parallel
processing of large structured datasets has been a major success, leading
to several generations of SQL-based
Many big data
be deployed in the
cloud, both public
and private, on
a massive scale.
This requires new