At the end of 2014, cardiologist Dr.
Stanley Shaw introduced me to Dr.
Kathleen Corey, a hepatologist with
whom I started interrogating the
cohort to identify new biomarkers
associated with outcomes in
individuals suffering from liver diseases
and associated comorbidities. Dr.
Corey was interested in extracting
smoking-status information for use
as an important covariate in our
prediction models. The effects of
tobacco use are significant in the study
of patient outcomes and have been
studied extensively over the past several
decades. Tobacco use is linked to an
increased risk for and severity of a variety
of diseases, including cardiovascular
disease, respiratory illness, psychiatric
conditions, and cancers.
I pointed out that smoking status
is a data element that is not captured
sufficiently in a structured form
in this cohort. Smoking status is
typically documented in clinical
narrative notes as free text. The
available smoking-status extraction
methods (which are commonly based
on supervised learning) are only
moderately accurate as reported in
many publications, which could result
in misclassifications. Using support-
vector machines (SVM), for instance,
and several hundred documents,
yielded an accuracy of 85. 57 percent,
meaning that 14. 43 percent of the
documents were misclassified [ 3].
The most significant scientific
moment during my training years
at MGH was when, inspired by
Dr. Corey’s request to extract
smoking statuses, I thought to
implement a new, highly accurate
text-classification method to extract
the statuses from notes. Motivated
by my Ph.D. dissertation in human-
robot collaboration to accomplish
learning tasks [ 4], I hypothesized that
following a simple human-in-the-loop
approach could achieve better results
than many widely used computational
approaches. I dedicated a few days
to implementing my method; in the
subsequent months, my colleagues
and I evaluated its accuracy
We have extensively tested my
method, which I call Text Nailing
( TN). I came up with the notion of Text
Nailing to allude to a metaphorical
hammer that uses metaphorical nails
to fasten characters in a fixed position.
Any alphabetical letter must precede
or follow another alphabetical letter.
Figure 1 illustrates the difference
between the widely adopted supervised-learning approach for text classification
and TN. I had the opportunity to briefly
present TN at the American Medical
Informatics Association’s 2016 Annual
Symposium [ 5]. In all use cases, nurses
and physicians manually validated
our performance results using clinical
chart reviews to guarantee high levels
of accuracy. Typically, micro and
macro F-measures (weighted averages
of precision and recall frequently used
in information retrieval) were above
0.95 for the extracted descriptors. In
contrast, using other approaches in
the task of classifying smoking status
yielded lower performance, for example,
micro F-measures of up to 0.90 and
macro F-measures of up to 0.76 [ 6].
Classification of whether a clinical
narrative note contains an indication
for smoking status (i.e., current, past,
or never) requires the identification
of smoking-related expressions,
which need to be manually assigned
into classes. To identify unique
expressions that distinctively define
smoking status, we implemented C
Use a small collection of documents
to create a training set that contains
features and labels.
Use a machine learning algorithm
to create a prediction model.
Extract features for a new document
and use the prediction model to
determine the document’s
Use a large collection of documents
to extract non-negated descriptors and
assign them into classes.
Convert a new document to an
alphabetical-only representation and
apply on the document all extracted
non-negated descriptors to determine
the document’s expected label.
Convert the descriptors into
Figure 1. Supervised learning versus Text Nailing.
Notes that contain smok, tob, or cig
Randomly select a sub-set
For each note observe at the first occurrence
of the string smok, tob, or cig, and extract
± 50 characters around it.
Copy the 100-character blobs ( 1,000 blobs) to
a text editor for manual review.
Store expressions of interest and assign them
Figure 2. An interactive human-in-the-loop method to identify
Figure 3. An example of an alphabetical-only converted note.