balls” approach to massive datasets.
However, even massive datasets are
sometimes not complete enough to
deliver definitive results. Recent discoveries in biomedical research have
revealed that even a complete index
of the human genome’s three billion
pairs of chemical bases has not greatly
accelerated breakthroughs in health
care, because other crucial medical
data is missing. A study of 19,000 women, led by researchers at Brigham and
Women’s Hospital in Boston, used data
constructed from the National Human
Genome Research Institute’s catalog
of genome-wide association study results published between 2005 and June
2009—only to find that the single biggest predictor of heart disease among
the study’s cohort is self-reported family history. Correlating such personal
data with genetic indexes on a wide demographic scale today is nearly impossible as an estimated 80% of U.S.-based
primary-care physicians do not record
patient data in electronic medical records (EMRs). Recent government financial incentives are meant to spur
EMR adoption, but for the immediate
future, crucial data in biomedical research will not exist in digital form.
Another issue in biomedical research
is the reluctance of traditionally trained
scientists to accept datasets that were
not created under the strict parameters
required by, for example, epidemiologists and pharmaceutical companies.
CMU’s Mitchell says this arena of
public health research could be in the
vanguard of what may be the true crux
of the new data flood—the idea that the
provenance of a given dataset should
matter less than the provenance of a
“The right question is, Do I have a
scientific question and a method for
answering it that is scientific, no matter
what the dataset is?” Mitchell asks. Increasingly, he says, computational scientists will need to frame their questions
and provide data for an audience that extends far beyond their traditional peers.
“We’re at the beginning of the curve
of a decades-long trend of increasingly evidence-based decision-making
across society, that’s been noticed by
people in all walks of life,” he says.
“For example, the people at the public
policy school at CMU came to the machine learning department and said,
“the right question is,
Do i have a scientific
question and a
method for answering
it that is scientific,
no matter what
the dataset is?”
asks tom mitchell.
‘We want to start a joint Ph.D. program
in public policy and machine learning,
because we think the future of policy
analysis will be increasingly evidence-based. And we want to train people
who understand the algorithms for
analyzing and collecting that evidence
as well as they understand the policy
side.’” As a result, the joint Ph.D. program was created at CMU.
Duda, S.N. Cushman, C., and Masys, D.R.
An XML model of an enhanced dictionary
to facilitate the exchange of pre-existing
clinical research data in international
studies, Proceedings of the 12th World
Congress on Health Informatics, Brisbane,
Australia, August 20–24, 2007.
Mitchell, T.M., Shinkareva, S.V., Carlson, A.,
Chang, K.-M., Malave, V.L., Mason, R.A.,
and Just, M.A.
Predicting human brain activity associated
with the meanings of nouns, Science 320,
5880, May 30, 2008.
Data-driven science—a scientist’s view,
nSF/JISC Repositories Workshop position
paper, April 10, 2007.
Data intensive grids and networks for high
energy and nuclear physics: drivers of the
formation of an information society, World
Summit on the Information Society, Pan-European Ministerial Meeting, Bucharest,
Romania, november 7–9, 2002.
The Sloan Digital Sky Survey: drinking from
the fire hose, Computing in Science and
Engineering 10, 1, Jan./Feb. 2008.
Gregory Goth is an oakville, ct-based writer who
specializes in science and technology.
© 2010 acm 0001-0782/10/1100 $10.00
Due to enormous governmental
investments in research and
development, scientists in
many asian countries are
steadily increasing their
number of papers published in
The asia-Pacific region
increased its total of published
science articles from 13% in
the early 1980s to slightly more
than 30% in 2009, according
to the Thomson reuters
National Science Indicators,
an annual database of the
number of articles published
in about 12,000 internationally
recognized journals. China
leads the pack with 11% in
2009, up from 0.4% in the early
1980s, followed by Japan with
6.7% and India with 3.4%. In
contrast, the ratio of articles
from scientists in the u. S.
decreased to 28% in 2009,
down from 40% in the early
In all, 25 nations have
increased their research, but
none more so than Singapore.
With a population of just five
million, the nation published
8,500 articles in 2009,
compared with only 200 in
1981. Singapore now allocates
3% of its gross domestic
product to research and
development, a figure expected
to rise to 3.5% by 2015.
The increase in scientific
publications, especially in
east asian countries, reflects
a “phenomenal” increase in
funding, Simon Marginson, a
professor of higher education
at the university of Melbourne,
told The New York Times.
Marginson attributed the
increase in research output to
to establishing knowledge-intensive economies. “It’s
very much not simply about
knowledge itself—it’s about
its usefulness throughout the
economy. I think that that
economic vision is really the
principal driver,” Marginson
another reason for
increased publications is that
many asian universities now
receive additional funding to
have their papers translated
into english, the language used
by the majority of academic