a human-in-the-loop procedure
(Figure 2). The procedure initially
considered all available 10,927,595
clinical narrative notes associated
with the 314,292 patients and
ignored notes that did not contain
any smoking-related keywords (e.g.,
“smok,” “tobac,” and “cig”). Once we
identified a smaller set of notes, we
used a randomly selected sample to
observe small text blobs located next
to smoking-related keywords. This
allowed for the quick identification of
expressions associated with smoking
status and the classification of the
expressions into the smoking-status
classes. The manual evaluation
continued until we identified a
subjectively defined significant
portion of the smoking-related
distinctive expressions.
HOMOGENEOUS SMOKING-
RELATED EXPRESSIONS
We converted all identified expressions
and notes into alphabetical-only
representations (i.e., removed all
numbers, spaces, and characters
outside the 26 English letters) to
create homogeneous representations
of the expressions and to allow one-to-one matching when searching for
an expression in a note. In clinical
documentation, notes typically are
heterogeneous and may contain
dozens of similar expressions that
describe the same smoking status. For
instance, “smoked 2 packs per week”
(an indication for past smoker) may be
similar to variations in other notes, such
as “smoked 5 packs per week”
(a different number of packs and a tab)
or “smoked: 4-6 packs per week”
(a range for the number of packs, a
hyphen, and multiple spaces). As
such, we converted all identified
smoking-related expressions to
homogeneous representations of the
expressions. The three exemplary
sentences are represented by
one homogeneous expression:
“smokedpacksperweek.” An
example for a note converted to an
alphabetical-only representation is
shown in Figure 3.
Of the hundreds of smoking-related
expressions that we identified, we
selected a cutoff threshold to highlight
the expressions that are useful to the
identification of smoking status in a
variety of cohorts. We sampled 10,000
notes to present the prevalence of the
W
Figure 4. Most prevalent smoking-related expressions. The y-axis represents the percentage
of notes that contain the expression within a sample of 10,000 randomly selected notes.
PPD=packs per day. (A) Current smoker. (B) Past smoker. (C) Never smoked.
0.45%
0.40%
0.35%
0.30%
0.25%
0.20%
0.15%
0.10%
0.05%
0.00%
c
ur
ren
ts
m
oker
co
nti
nu
e
st
os
m
oke
s
m
oki
ngp
pd
sm
oke
sp
pd
s
mo
kerp
pd
h
e
sm
o
ke
s
sti
ll
sm
o
ke
s
s
h
e
sm
o
ke
s
acti
v
es
m
oker
pp
ds
m
oker
c
urre
nt
ly
sm
o
ke
s
t
ob
acc
op
ackd
ay
d
o
e
ss
m
oke
p
ati
e
nt
sm
o
ke
s
(a)
1.10%
1.00%
0.90%
0.80%
0.70%
0.60%
0.50%
0.40%
0.30%
0.20%
0.10%
0.00%
q
uit
sm
oki
ng
pa
sts
m
oker
s
m
oked
ppd
s
m
oki
ngq
uit
t
ob
acc
oq
uit
h
es
m
oked
exs
m
oker
q
uit
to
bacco
u
s
edt
o
sm
oke
s
h
es
m
oked
s
m
okedp
ack
s
m
okedcig
are
tt
es
ha
s
n
ots
m
oked
si
nce
w
as
as
m
oker
s
m
oked
pac
ks
s
mo
kedi
nt
h
ep
ast
(b)
1.75%
1.50%
1.25%
1.00%
0.75%
0.50%
0.25%
0.00%
d
o
e
s
n
ots
m
oke
n
on
s
mo
ker
t
ob
acc
on
o
n
e
n
ev
era
s
mo
ker
s
m
oker
n
o
n
ev
ers
m
oke
sm
oke
n
o
n
os
m
oke
n
ota
s
mo
ker
d
o
e
s
nts
m
oke
(c)