options in R packages that are used to
create a semantic space; R is a free and
popular statistical language.
Once the semantic space is created,
researchers can project terms and combinations of terms onto it; likewise, documents and parts of documents can be
projected on it to produce various measures of semantic closeness. 2, 12 That degree of semantic closeness is typically
a cosine value that provides an ordinal
scale of relatedness of terms and documents. Terms and documents that are
close to one another in meaning also
have higher degrees of closeness, as revealed by LSA. 6 This allows researchers
to use LSA to identify synonyms. 6, 9, 16 In
fact, LSA has been used so successfully
in identifying such closeness levels that
it has been shown to answer introduc-tion-to-psychology multiple-choice
exam questions almost as well as students do12 and score on the Test of English as a Foreign Language exam almost
as high as nonnative speakers. 11 LSA can
also classify articles into core research
topics. 3 The semantic space created by
LSA can be so realistic that LSA has even
been applied to identifying how questionnaire items factor together. 5
Our study has sought to show that
applying standard LSA packages in R is
enough to produce associations among
medical terms and ICD codes in electronic health records covering medical
visits, including their relative strength
in comparison to other associations,
as corroborated by subject-matter experts, and do so even when applying
the packages with only their default
settings and without transforming the
data beyond the automated transformation the packages introduce. Applying the standard transformation is
important because it is not practical
for researchers to manually correct typos, alternative spellings, shorthand,
or optical character recognition (OCR)
errors. Processing such “dirty” data
without manual correction is necessary
in real real-world applications.
Data
The data we used was provided by IBX
in a joint research project with Drexel
University. It consisted of 32,124 text
files obtained by running an OCR
on the medical transcripts of 1,009
scanned medical charts of 416 distinct
patients who suffered or were suspect-
less related to hypothyroidism. The re-
sults also suggest the method could be
applied to assist in the management of
medical treatment by identifying unusu-
al cases for special attention.
Introduction to LSA
LSA creates a semantic space from
which it is possible to derive lexical
closeness information; that is, how
close terms or documents are to one
another in a corpus. LSA creates that
space by first creating a TDM from
a relevant corpus of documents and
then running an SVD on that TDM. The
TDM is a frequency matrix that records
how often each term appears in each
document. Before the TDM is created,
the text in the corpus is often stemmed
and stop words are excluded. Stop
words are words that occur frequently
(such as “the” and “or”) and thus add
little or no semantic information to
the documents or to how terms relate
to one another. There are default lists
of stop words in English and other languages in R and other software packages. Additional words of interest can
be added to these lists so they, too, are
excluded from the semantic space. It is
also common in LSA practice to remove
accents, cast the text in lower case, and
remove punctuation marks.
After the TDM is created researchers often apply a process of weighting, whereby the frequency numbers
are replaced with a transformation
that considers the distribution of each
term in the document it appears in and
across the documents in the corpus.
Researchers typically apply both local
and global weighting. Local weighting
gives more weight to terms that appear
more often in a single document. Global weighting gives less weight to terms
that appear in many documents. One
of the most common weighting transformations is the term “
frequency-in-verse document frequency,” or TF-IDF,
transformation. Some research (such
as by Beel et al. 1) claims that TF-IDF is
the most common text-mining transformation, giving more weight to terms
that appear often in a given document
but less weight if the terms appear
frequently in the corpus as a whole. It
is also a recommended type of transformation. 14 Stemming, stop-word removal, weighting transformation, and
other preparatory steps are standard
LSA reveals
not only that terms
are related
but also the degree
of that relationship
compared to
other relationships
as a set of
ordinal cosine
distance measures.