helpful for identifying connections
that heretofore had not been identified and thus can be used as a predic-tive model to enable disease screening,
early detection, and intervention.
Demonstrating this ability to identify relationships, the cosine distances in the github.com link show
that the term “hypertension” is most
closely related to “hyperlipidemia,”
which, according to the American
Heart Association, means “high levels
of fat particles (lipids) in the blood.”
Considering that this is a very common condition, estimated to afflict 31
million Americans, it might be expected. The terms “benign” and “
essential” are also close, as is the diagnostic code “4011,” or “icd4011,” which,
in the ICD9 code, means “benign essential hypertension.” Hypertension
is also, as expected, closely related to
“obesity” and “mellitus” (diabetes),
hardening arteries (“
atherosclerosis”), acid reflux (“gerd”), and high
cholesterol (“hypercholesterolemia”).
However, hypertension is also seman-
ed of suffering from congestive heart
failure in 2013 and 2014. IBX removed
associated patient identifiers, demo-
graphics, and cost data. The IBX Privacy
office and Drexel University medical
institutional review board (IRB) both
approved the research protocol in ad-
vance. The medical records consisted
of the text portions of the medical re-
cord in one file and the ICD medical
codes in another file. An artificial pa-
tient ID key replaced the actual patient
ID in each medical report and each ICD
list of codes. The medical reports were
combined by that patient’s ID.
We analyzed the data as is. We did
not correct the data for alternative
equivalent spellings (such as “
catheterization” and “catheterisation”). Nor did
we correct the data for obvious spelling
mistakes and OCR errors (such as correcting “cardioverterdefibrillator” to
“cardioverter-defibrillator”). This was
done deliberately so the power of LSA
could be shown even when run on untreated raw data. This was important
because manually correcting medical
reports is both costly and prone to introducing additional error. Manually
checking these words revealed them to
be mostly misspellings.
Analysis
We created the TDM after all the words
were cast as lower case and punctuations and the standard set of stop words
removed. Numbers were not removed
from the raw data as they could have represented ICD codes. We then subjected
the TDM to a TF-IDF transformation
before a SVD was run on it, retaining
100 dimensions. There are no standard
rules of thumb for how many dimensions to retain because dimensionality
depends on context and corpora. 13 Adding more SVD dimensions inevitably results in more nuance and variance, as
well as more noise.
Knowing the data concerned con-
gestive heart failure, we identified the
closest neighbor terms to “cardiac”
and “hypertension” after creating the
semantic space. The cosine distances
are listed in the github.com link men-
tioned earlier, omitting terms that ap-
peared in fewer than four patient re-
cords, as well as in strings with 10 or
more numeric digits (such as phone
numbers). Figure 1 and Figure 2 show
the heat-map clustering for the terms
“hypertension” and “cardiac,” respec-
tively, and Figure 3 outlines the LSA
process. It is important to emphasize
that LSA also reveals indirect associa-
tions among terms (such as when one
term is related to another only through
a third term), a key advantage of LSA
over manual inspection.
A researcher might correctly associate terms that appear together but
could miss those related only indirectly. That is beside the obvious advantage
of LSA in that the analysis can be done
semi-automatically and on very large
corpora quickly and might otherwise
require an unrealistic investment of
time if done manually. Moreover, and
crucially important, LSA reveals not
only that terms are related but also
the degree of that relationship compared to other relationships as a set
of ordinal cosine distance measures.
These distances can be used for other
analyses (such as clustering terms to
determine structures within the text or
to compare documents). The ability to
run other analyses could be particularly
Figure 1. Clustering of the 40 terms and ICD-9-CM codes closest to “hypertension.”