stances; for example, in health care,
controlled experiments have helped
identify many causes of disease but
may not reflect the actual complexities of health. 3, 18 Indeed, some estimates claim clinical trials exclude
as much as 80% of the situations in
which a drug might be prescribed, as
when a patient is on multiple medications. 3 In situations where we are able
to design randomized trials, big data
makes it feasible to uncover the causal models generating the data.
As shown earlier in the diabetes-related health-care example, big data
makes it feasible for a machine to ask
and validate interesting questions
humans might not consider. This capability is indeed the foundation for
building predictive modeling, which
is key to actionable business decision
making. For many data-starved areas
of inquiry, especially health care and
the social, ecological, and earth sciences, data provides an unprecedented opportunity for knowledge discovery and theory development. Never
before have these areas had data of the
variety and scale available today.
This emerging landscape calls for
the integrative skill set identified here
as essential for emerging data scientists. Academic programs in computer science, engineering, and business
management teach a subset of these
skills but have yet to teach the integration of skills needed to function
as a data scientist or to manage data
scientists productively. Universities
are scrambling to address the lacunae
and provide a more integrated skill
set covering basic skills in computer
science, statistics, causal modeling,
problem isomorphs and formulation,
and computational thinking.
Predictive modeling and machine
learning are increasingly central to
the business models of Internet-based
data-driven businesses. An early suc-
cess, Paypal, was able to capture and
payments due to its ability to predict
the distribution of losses for each
transaction and act accordingly. This
data-driven ability was in sharp con-
trast to the prevailing practice of treat-
ing transactions identically from a
risk standpoint. Predictive modeling
is also at the heart of Google’s search
engine and several other products. But
the first machine that could arguably
be considered to pass the Turing test
and create new insights in the course
of problem solving is IBM’s Watson,
which makes extensive use of learning
and prediction in its problem-solving
process. In a game like “Jeopardy!,”
where understanding the question it-
self is often nontrivial and the domain
open-ended and nonstationary, it is
not practical to be successful through
an extensive enumeration of possi-
bilities or top-down theory building.
The solution is to endow a computer
with the ability to train itself auto-
matically based on large numbers of
examples. Watson also demonstrat-
ed the power of machine learning is
greatly amplified through the avail-
ability of high-quality human-curated
data, as in Wikipedia. This trend—
combining human knowledge with
machine learning—also appears to be
on the rise. Google’s recent foray in
the Knowledge Graph16 is intended to
enable the system to understand the
entities corresponding to the torrent
of strings it processes continuously.
Google wants to understand “things,”
not just “strings.” 26
Organizations and managers face
significant challenges in adapting to
the new world of data. It is suddenly
possible to test many of their established intuitions, experiment cheaply
and accurately, and base decisions on
data. This opportunity requires a fundamental shift in organizational culture, one seen in organizations that
have embraced the emerging world of
data for decision making.
1. anderson, C. the end of theory: the data deluge
makes the scientific method obsolete. Wired 16, 7
(june 23, 2008).
2. aral, s. and walker, d. Identifying influential and
susceptible members of social networks. Science 337,
6092 (june 21, 2012).
3. buchan, I., winn, j., and bishop, C. A Unified Modeling
Approach to Data-Intensive Healthcare. The Fourth
Paradigm: Data-Intensive Scientific Discovery.
microsoft research, redmond, wa, 2009.
4. dhar, V. Prediction in financial markets: the case for
small disjuncts. ACM Transactions on Intelligent
Systems and Technologies 2, 3 (apr. 2011).
5. dhar, V. and Chou, d. a comparison of nonlinear
models for financial prediction. IEEE Transactions on
Neural Networks 12, 4 (june 2001), 907–921.
6. dhar, V. and stein, r. Seven Methods for Transforming
Corporate Data Into Business Intelligence.
Prentice-hall, englewood Cliffs, nj, 1997.
7. frawley, w. and Piatetsky-shapiro, g., eds. Knowledge
Discovery in Databases. aaaI/mIt Press, Cambridge,
8. gladwell, m. The Tipping Point: How Little Things Can
Make a Big Difference. little brown, new york, 2000.
9. goel, s., watts, d., and goldstein, d. the structure of
online diffusion networks. In Proceedings of the 13th
ACM Conference on Electronic Commerce (2012),
10. hastie, t., tibsharani, r., and friedman, j. The
Elements of Statistical Learning: Data Mining,
Inference, and Prediction. springer, new york, 2009.
11. heilbron, j.l., ed. The Oxford Companion to the
History of Modern Science. oxford university Press,
new york, 2003.
12. hey, t., tansley, s., and tolle, k., eds. 2009. The
Fourth Paradigm: Data-Intensive Scientific Discovery.
microsoft research, redmond, wa, 2009.
13. hunt, j., baldochi, d., and van Ingen, C. Redefining
Ecological Science Using Data. The Fourth Paradigm:
Data-Intensive Scientific Discovery. microsoft
research, redmond, wa, 2009.
14. Issenberg, s. a more perfect union: how President
obama’s campaign used big data to rally individual
voters. MIT Technology Review (dec. 2012).
15. kohavi, r., longbotham, r., sommerfield, d., and
henne, r. Controlled experiments on the web: survey
and practical guide. Data Mining and Knowledge
Discovery 18 (2009), 140–181.
16. lin, t., Patrick, P., gamon, m., kannan, a., and fuxman,
a. active objects: actions for entity-centric search. In
Proceedings of the 21st International Conference on
the World Wide Web (lyon, france). aCm Press, new
17. linoff, g. and berry, m. Data Mining Techniques: For
Marketing, Sales, and Customer Support. john wiley
& sons, Inc., new york, 1997.
18. maguire, j. and dhar, V. Comparative effectiveness for
oral anti-diabetic treatments among newly diagnosed
type 2 diabetics: data-driven predictive analytics in
healthcare. Health Systems 2 (2013), 73–92.
19. mckinsey global Institute. Big Data: The Next
Frontier for Innovation, Competition, and Productivity.
technical report, june 2011.
20. meinshausen, n. relaxed lasso. Computational
Statistics & Data Analysis 52, 1 (sept. 15, 2007),
21. Papert, s. an exploration in the space of mathematics
educations. International Journal of Computers for
Mathematical Learning 1, 1 (1996), 95–123.
22. Pearl, j. Causality: Models, Reasoning, and Inference.
Cambridge university Press, Cambridge, u.k., 2000.
23. Perlich, C., Provost, f., and simonoff, j. tree induction
vs. logistic regression: a learning-curve analysis.
Journal of Machine Learning Research 4, 12 (2003),
24. Popper, k. Conjectures and Refutations. routledge,
25. Provost, f. and fawcett, t. Data Science for Business.
o’reilly media, new york, 2013.
26. roush, w. google gets a second brain, changing
everything about search. Xconomy (dec. 12, 2012);
27. shmueli, g. to explain or to predict? Statistical
Science 25, 3 (aug. 2010), 289–310.
28. simon, h.a. and hayes, j.r. the understanding
process: Problem isomorphs. Cognitive Psychology 8,
2 (apr. 1976), 165–190.
29. sloman, s. Causal Models. oxford university Press,
oxford, u.k. 2005.
30. spirtes, P., scheines, r., and glymour, C. Causation,
Prediction and Search. springer, new york, 1993.
31. tukey, j.w. Exploratory Data Analysis.
addison-wesley, boston, 1977.
32. wing, j. Computational thinking. Commun. ACM 49, 3
(mar. 2006), 33–35.
Vasant Dhar ( email@example.com) is a professor and co-director of the Center for business analytics at the stern
school of business at new york university, new york.
Copyright held by owner/author(s). Publication rights
licensed to aCm. $15.00