ing opportunities, and the success of
Google and other technologies that
successfully analyze unstructured texts
are examples and evidence that the
old-fashioned ideas that only structure
is the way to go forth is wrong.”
A Lack of flexibility
The medical community has long
agreed on the basic data fields for the
description of diseases. This schema,
the International Classification of Diseases (ICD), uses a hierarchical system
to pinpoint specific medical conditions. Two versions of ICD are concurrently in use: ICD- 9, which is reaching
the end of its viable usefulness, and
ICD- 10, its successor. ICD- 9’s 13,000
available diagnostic codes use three to
five alphanumeric characters per disease or medical condition but do not
distinguish between injuries to the left
or right sides, whereas the more granular ICD- 10 diagnoses comprise 68,000
codes and include laterality.
ICD is currently the lingua franca of
disease classification upon which many
researchers rely to supply a structured
data element when analyzing comorbidities. However, leading bioinformatics researchers, including Brunak and
Christopher Chute, M.D., professor of
bioinformatics at the Mayo Clinic and
chairman of the World Health Organization’s steering group responsible for
the next ICD iteration, ICD- 11, say the
terminology is not really suitable for
serving as a baseline.
“ICD is not intended to be an exhaustive catalog of clinical concepts
“My priority is
that the patient
record becomes
as information-
rich as possible,”
says søren Brunak.
“That maximizes
the data mining
opportunities.”
that may be encountered or enumer-
ated,” Chute says. “It is a high-level
aggregation of diseases, and that is
its purpose. It was originally a public
health response, and it is also used for
reimbursement, but to treat it as a cata-
log of disease is incorrect.”
Brunak concurs. “There is no doubt
ICD- 10 is not the best text-mining
vocabulary for spotting things in re-
cords. It’s very difficult to say some-
thing general about which ontology or
which system is best because there’s a
different signal-to-noise ratio in differ-
ent types of records, and one type does
not fit all.”
Yet without some type of univer-
sally accepted thesaurus, Chute says
the prospect for significant advances
in the natural language processing of
clinical data is uncertain. Chute says
the ICD- 11 working group has been
operating in concert with the Interna-
tional Health Terminology Standards
Organization, which oversees the de-
velopment of the Systematized No-
menclature of Medicine (SNOMED), a
hierarchical semantic network of more
than 300,000 medical concepts and
their relationships. The semantic na-
ture of SNOMED allows for more than
seven million relationships descend-
ing from the top three hierarchical
classifications of “finding,” “disease,”
and “procedure.”
Chute says the ICD- 11 and SNOMED
terms will be harmonized, yielding in-
teroperability between the two domi-
nant clinical data schema. “That’s what
we ultimately need, and even SNOMED
in its current incarnation doesn’t fully
capture the spectrum of clinical con-
cepts you’d like to catalog for natural
language processing.”
S. Trent Rosenbloom, M.D., associ-
ate professor of biomedical informat-
ics at Vanderbilt University, says the
lack of an overriding priority in de-
veloping health-care data formats—a
patient record must satisfy clinicians’
needs to describe a course of treatment
for their patient as well as provide
documentation for legal safeguards,
plus serve as a billing document that
satisfies insurance companies—has
led to the difficulty in aligning analyti-
cal capabilities. In “Data From Clini-
cal Notes: A Perspective on the Ten-
sion Between Structure and Flexible
Documentation,” which was recently
published in Journal of the American
Medical Informatics Association, Rosen-
bloom and colleagues note that “the
flexibility of a computer-based docu-
mentation method to allow healthcare
providers freedom and ensure accu-
racy can directly conflict with a desire
to produce structured data to support
reuse of the information in [electronic
health record] systems.”
Rosenbloom’s colleague, Joshua
C. Denny, M.D., assistant professor
of biomedical informatics at Vander-
bilt, says the holy grail of analyzing
extremely large volumes of health re-
cords to deliver “personalized medi-
cine” to a single patient will depend
not only on perfecting NLP capabili-
ties confined within clinical walls,
but also in expanding the concept
of what belongs in a medical record,
and who should be authorized to pro-
vide data. For example, a 50-year-old
man who runs every day may paradox-
ically have high levels of both good
high-density lipoprotein (HDL) cho-
lesterol, which helps to clear the ar-
teries—high amounts of exercise can
elevate it—and of bad low-density
lipoprotein (LDL) cholesterol, which
is a risk factor for coronary disease.
Following conventional medical wis-
dom, the man’s physician may want
to prescribe medication to lower the
LDL levels without actually knowing
if it is necessary because there is not a
current capability to pull population-
wide data on such a relatively small
cohort. Patient-curated data may
be able to help discern what, if any,
treatment would be appropriate for
such patients.
Perfecting nLP Capabilities
While standards groups and policy
committees address the issues surrounding format harmonization, as
well as the legal issues surrounding pa-