Pintea, C.-M., and Palade, V. A glass-box interactive
machine learning approach for solving NP-hard
problems with the human-in-the-loop; https://arxiv.
8. Holzinger, A., Schantl, J., Schroettner, M., Seifert, C.,
and Verspoor, K. Biomedical text mining: State-of-the-art, open problems and future challenges. Chapter
in Interactive Knowledge Discovery and Data Mining
Biomedical Informatics, Lecture Notes in Computer
Science LNCS 8401, A. Holzinger and I. Jurisica, Eds.
Springer, Berlin, Heidelberg, Germany, 2014, 271–300.
9. Islam, A., Milios, E., and Keselj, V. Text similarity using
Google tri-grams. In Proceedings of the 25th Canadian
Conference on Artificial Intelligence (Toronto, Canada,
May 28–30). Springer, Toronto, Canada, 2012, 312–317.
10. Kintsch, W. Predication. Cognitive Science 25, 2 (2001),
11. Landauer, T. K. and Dumais, S. T. A solution to Plato’s
problem: The latent semantic analysis theory of
acquisition, induction, and representation of knowledge.
Psychological Review 104, 2 (1997), 211–240.
12. Landauer, T.K., Foltz, P. W., and Laham, D. An
introduction to latent semantic analysis. Discourse
Processes 25, 2 and 3 (1998), 259–284.
13. Landauer, T.K., Laham, D., and Derr, M. From paragraph
to graph: Latent semantic analysis for information
visualization. Proceedings of the National Academy of
Sciences 101, 1 (Apr. 6, 2004), 5214–5219.
14. Larsen, K. R. and Bong, C. H. A tool for addressing
construct identity in literature reviews and meta-analyses. MIS Quarterly 40, 3 (Sept. 2016), 529–551.
15. Larsen, K.R., Michie, S., Hekler, E.B., Gibson, B.,
Spruijt-Metz, D., Ahern, D., Cole-Lewis, H., Ellis,
R.J.B., Hesse, B., Moser, R.P., and Yi, J. Behavior
change interventions: The potential of ontologies for
advancing science and practice. Journal of Behavioral
Medicine 40, 1 (Feb. 2017), 6–22.
16. Valle-Lisboa, J. C. and Mizraji, E. The uncovering
of hidden structures by latent semantic analysis.
Information Sciences 177, 19 (Oct. 2007), 4122–4147.
17. Zhiwei, H., Meiping, C., Jimin, W., Qing, S., Chao,
Y., Xing, D., and Zhonggao, W. Improved control of
hypertension following laparoscopic fundoplication for
gastroesophageal reflux disease. Frontiers of Medicine
11, 1 (Mar. 2017), 68–73.
David Gefen ( email@example.com) is a professor in the
Decision Sciences and MIS Department, Academic
Director of the Doctorate in Business Administration
Program, and Provost Distinguished Research Professor
in the LeBow College of Business at Drexel University,
Philadelphia, PA, USA.
Jake Miller ( firstname.lastname@example.org) is an assistant clinical
professor in the Management Department in the LeBow
College of Business at Drexel University, Philadelphia,
Johnathon Kyle Armstrong ( Kyle.Armstrong@ibx.
com) is a research scientist at Independence Blue Cross,
Philadelphia, PA, USA.
Frances H. Cornelius ( email@example.com) is a professor in
the College of Nursing and Health Professions and Chair
of the MSN Advanced Role Department, Complementary
and Integrative Health Department, and coordinator of
Clinical Nursing Informatics Education in the College
of Nursing and Health Professions at Drexel University,
Philadelphia, PA, USA.
Noreen Robertson ( firstname.lastname@example.org) is the Associate
Vice Dean for research at Drexel University College
of Medicine and a research assistant professor in the
Department of Biochemistry & Molecular Biology at
Drexel University, Philadelphia, PA, USA.
Aaron Smith-McLallen ( email@example.com)
is the Director of Data Science and Health Care Analytics
at Independence Blue Cross, Philadelphia, PA, USA.
Jennifer A. Taylor ( firstname.lastname@example.org) is an associate
professor of environmental and occupational health in the
School of Public Health at Drexel University, Philadelphia,
This study was supported by Drexel Grant #282847.
Copyright held by authors.
Publication rights licensed to ACM. $15.00
dicative of the potential in applying
LSA to such contexts. We created a semantic model that identified known
relationships among medical terms,
relating diagnoses and treatments.
We scrubbed the sample of most demographic data; the dataset was too
small anyway to allow cross-sectional
analysis. Constructing a model from
a larger, more detailed dataset could
yield substantial potential for medical discovery. Comparing reports
across patients could provide even
more information, as by, say, enabling
creation of a “typical” profile of care/
treatment trajectory, as well as diagnosis and prognosis as they apply
to disease, condition names, or ICD
codes. Such a profile could conceivably lead to early detection and allow
identifying exceptional cases in need
of immediate medical attention.
A typical profile for a condition or
ICD code could help create a method
to at least partly support phase IV
testing of new drugs involving long-term monitoring of the effects of
drugs following approval by the U.S.
Food & Drug Administration. LSA
could improve this process not only
by automating it but also by identifying a drug’s possible indirect effects,
or the effects associated with the drug
but only through other diagnosis. So,
for example, if drug A is associated
with condition B and condition B is
associated with condition C, then
LSA will identify that A and C might
be related. A human examiner might
not notice it but could be aided by
LSA to identify possible connections
of interest for the expert to consider;
see Holzinger et al. 7 for more on interactive machine learning.
Analyzing medical records could
also allow comparison of diagnosis and
prognosis across populations (such
as differentiating between men and
women). Accounting for demographics could also indicate the prevalence of
diagnosis and prognosis by age and by
geographical area, possibly indicating
hazardous environmental conditions.
Moreover, given the diversity within so-
ciety, running LSA on medical records
could also allow quasi-experimental de-
sign studies, as in, say, comparing clin-
ics in areas where unique treatments
are allowed against those where they
are not. Planning such an experiment
would be difficult and IRB approval
might not always be forthcoming, but
if the treatment conditions occur in the
population, then studying them would
not be so contentious and could be-
come routine. LSA could support such
an after-the-effect examination.
Combining LSA cosine distances
with TDM frequency values may also
allow identification of extraordinary
cases that are closely related to a condition but very rare. In the data we had,
we omitted terms that appeared in fewer than four records, but the relationships between rare diseases and more
common ones might suggest new avenues of research for a range of health issues. As a supplement to medical practice, a text-analytic approach might be
able to suggest alternative diagnoses
based on documented symptoms that
might otherwise be attributed to more
Above all, a key advantage of LSA is
that it allows rank ordering of related
terms. Being able to assign numbers,
and hence categorize how much a condition is related to other conditions,
could provide insight about identifying what symptoms, and how much
more than others, they might indicate
a problem (such as hypertension).
In a musical parody, lyricist David Lazar
wrote in a song called “Dr. Freud” that Sigmund Freud’s disciples said, “…by God,
there’s gold in them thar ills.” There certainly was. Maybe there also is in gleaning medical insight from medical records documents through LSA.
1. Beel, J., Gipp, B., Langer, S., and Breitinger, C.
Research-paper recommender systems: A literature
survey. International Journal on Digital Libraries 17, 4
(Nov. 2016), 305–338.
2. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,
T.K., and Harshman, R. Indexing by latent semantic
analysis. Journal of the American Society for
Information Science 41, 6 (1990), 391–407.
3. Evangelopoulos, N., Zhang, X., and Prybutok, V.R.
Latent semantic analysis: Five methodological
recommendations. European Journal of Information
Systems 21, 1 (Jan. 2012), 70–86.
4. Gefen, D., Endicott, J., Fresneda, J., Miller, J., and
Larsen, K.R. A guide to text analysis with latent
semantic analysis in R with annotated code studying
online reviews and the Stack Exchange community.
Communications of the Association for Information
Systems 41, 21 (Dec. 2017), 450–496.
5. Gefen, D. and Larsen, K. Controlling for lexical closeness
in survey research: A demonstration on the technology
acceptance model. Journal of the Association for
Information Systems 18, 10 (Oct. 2017), 727–757.
6. Gomez, J.C., Boiy, E., and Moens, M.-F. Highly discriminative
statistical features for email classification. Knowledge
Information Systems 31, 1 (Apr. 2012), 23–53.
7. Holzinger, A., Plass, M., Holzinger, K., Crisan, G.C.,