match between an expression and a
note that may contain the expression
(e.g., “hastroublegoingtosleep”
and “hasahistoryofalcohol,” as
presented as examples in Figures 5
and 6, respectively). In traditional
machine-learning approaches for
text classification, a human expert
is required to label phrases or entire
notes, and then a supervised-learning
algorithm attempts to generalize the
associations and apply them to new
data. In contrast, using non-negated
distinct expressions eliminates the
need for an additional computational
method to achieve generalizability, as
the expressions have consistently been
found highly prevalent across multiple
clinical conditions by considering
more than 10 million clinical narrative
notes. TN thus provides distinct
classifications and is thereby expected
to provide robust results.
TN yields a high performance
for the determination of a variety
of clinical conditions using rapid
processing (i.e., approximately 1
millisecond per note on average);
however, no benchmark for
processing performance comparison
has yet been reported for similar
tasks. We also extended TN for uses
beyond extracting smoking status.
For instance, we used TN to extract
family history of coronary artery
disease, classify patients with sleep
disorders, improve the accuracy
of the Framingham risk score for
patients with nonalcoholic fatty liver
disease, and classify nonadherence
to T2DM (see past projects: http://
researcher.ibm.com/researcher/
view_person_pubs.php?person=ibm-Uri.Kartoun).
Further, TN could be used to
enhance the standard regular
expression pattern language (in
which a sequence of characters
defines a search pattern). Applying
standard regular expressions relies
on knowing a priori the patterns
to search for, and this is exactly
what TN’s human-in-the-loop step
addresses (Figure 2). The regular
expression pattern language
can benefit from TN’s initial
identification of a collection of
phrases to match.
An additional advantage of TN
is that it is not sensitive to negations
and is capable of ignoring them
efficiently (e.g., “no longer smokes,”
“not a cigarette smoker,” and “has not
smoked since”). Further, TN does not
require setting a priori, subjectively
selected, or data-dependent
configuration parameters such as those
required when using SVMs, which was
the most popular approach used at the
i2b2 smoking status natural language
processing challenge [ 6]. Another
advantage unique to TN is its potential
applicability to languages other than
English because the human-in-the-
loop procedure is not tied to a specific
language. For instance, the phrase
“smokes two packages per day” would
be “smokestwopackagesperday”
in English, “ ”
in Hebrew, “ ” in Chinese,
and “fumadoispacotespordia”
in Portuguese.
TN has several limitations. First,
TN requires that a human dedicate
time to identifying candidate
expressions. While TN requires
only a few human hours for the
task of classifying smoking status,
performing more complex tasks (e.g.,
identifying complications after a
surgery) would require additional
time. However, when this effort is
complete, the identified expressions
can be generalizable and could be
deployed on any database and used
by the research community. In
addition, the language describing an
individual’s smoking status might be
quite diverse in various places. Per
Zipf’s law, English has an infinite
number of possible expressions, so
one cannot enumerate all the ways to
describe smoking status. However,
across varied conditions, the results
demonstrate that the tail of the
distributions can be ignored (as seen
in Figures 4–6).
Humans are the ones who
created letters and languages, and
therefore we are capable of accurately
identifying highly descriptive
non-negated expressions. Similar
to my Ph. D. dissertation in which
I described how a collaboration
between a human and a robot can
expedite a learning task [ 4], my
research on TN demonstrated that
an interactive human-in-the-loop
extension, applied here on large
collections of clinical narrative
DOI: 10.1145/3139488 © 2017 ACM 1072-5520/17/11 $15.00
notes, can produce high classification
accuracy across distinct medical
conditions. In conclusion, TN is a
rapid, accurate, and easily adaptable
method of identifying patients’
clinical descriptors by interacting
with clinical narrative notes. The
use of TN allows for accurate and
comprehensive identification of
many medical conditions, which will
improve precision and recall values
in studies that rely on textual data.
ACKNOWLEDGMENTS
I would like to thank Professor Peter
Szolovits of MIT’s Department of
Electrical Engineering and Computer
Science for his constructive feedback
regarding Text Nailing. I further
wish to express my deepest gratitude
to Drs. Stanley Shaw and Kathleen
Corey for their priceless guidance
during my fellowship at MGH.
Endnotes
1. Kartoun, U. The man who had them all.
ACM Interactions 24, 4 (July–Aug. 2017),
22–23.
2. Kartoun, U., Kumar, V., Cheng, S.C., Yu,
S., Liao, K., Karlson, E., Ananthakrishnan,
A., Xia, Z., Gainer, V., Cagan, A., Savova,
G., Chen, P., Murphy, S., Churchill, S.,
Kohane, I., Szolovits, P., Cai, T., and
Shaw, S. Demonstrating the advantages of
applying data mining techniques on time-dependent electronic medical records. Proc
of AMIA 2015 Annual Symposium.
3. Savova, G.K., Ogren, P.V., Duffy, P.H.,
Buntrock, J.D., and Chute, C.G. Mayo
clinic NLP system for patient smoking
status identification. Journal of the
American Medical Informatics Association
15, 1 (2008), 25–28.
4. Kartoun, U. Human–robot collaborative
learning methods. Ph. D. Dissertation. Ben-Gurion University of the Negev, Israel. 2008.
5. Kartoun, U.*, Beam, A.*, Pai, J., Chatterjee,
A., Fitzgerald, T., Kohane, I.*, and Shaw,
S*. The spectrum of insomnia-associated
comorbidities in an electronic medical
records cohort. Proc. of AMIA 2016 Annual
Symposium. *Contributed equally.
6. Uzuner, O., Goldstein, I., Luo, Y., and
Kohane, I. Identifying patient smoking
status from medical discharge records.
Journal of the American Medical Informatics
Association 15, 1 (2008), 14–24.
Uri Kartoun is a research staff member at
IBM Research in Cambridge, MA. Previously
he was a research fellow at Harvard Medical
School/Massachusetts General Hospital. His
Ph.D. from Ben-Gurion University of the Negev,
Israel, focused on human-robot collaboration.
→
uri.kartoun@ibm.com
INTERACTIONS.ACM.ORG NOVEMBER–DECEMBER 2017 INTERACTIONS 49