n-grams (such as “how” and “the”)
would remain relatively stable from
year to year. Alternatively, a linguistically informed approach to reducing
the size of the search space could be
done through part-of-speech (POS)
tagging, such that only those n-grams
identified as noun phrases would
be added to the dataset. However,
state-of-the-art POS taggers are only
about 95% accurate, implying that
many interesting n-grams could have
been overlooked had we taken this
approach. Nevertheless, POS-based
n-gram identification remains an option, especially when the corpus to be
analyzed is extremely large.
Finally, we constructed a Web-based system to enable us to query,
graph, and explore our
Communications n-gram database, plot and analyze multiple n-grams simultaneously,
and combine related search terms
into a single result. For example, the
search phrase “cellphone+cellphones,
smartphone+smartphones” would
produce a graph containing two lines,
one representing the combined frequencies of the terms “cellphone” and
“cellphones” over time, the other representing the combined frequencies of
the terms “smartphone” and “
smartphones” over time. To try out our
Communications n-gram tool, see http://
www.invivo.co/ngrams/cacm.aspx.
findings
Though we cannot expect to identify all ways the computing field has
evolved in a single article, we do aim
to provide a point of embarkation for
future research. Beginning with big-picture considerations, we are confident saying the structure and content
of Communications evolved significantly from 2000 to 2010. An analysis
of our metadata revealed several striking, large-scale structural changes
from 2000 to 2010. Over that time,
Communications published an average
of 306 articles per year, each containing an average of about 2,400 words.
However, these averages obscured underlying trends showing that both the
number of articles published per year
and the average length of each article
grew significantly, especially in more
recent years. These trends (see Figure
1) imply Communications was providing more value to its readers than in
if, in the aggregate,
Communications
reflects what
is happening in
computing, then
perhaps existing
industry standards
should be refined
to more closely
approximate
real-world practice.
it had previously, since more recent
issues contained more articles and
words than earlier issues.
Changing focus
Continuing our investigation, we next
extracted the 15 terms that experienced the most growth or decline in
popularity in Communications from
2000 to 2010 (see Table 1). We hope
you find at least a few trends in the
table that are unexpected or interesting; indeed finding them is a primary
goal of large-scale data mining. For us,
we noticed that several of the terms
showing the most growth were related
to science and technology, while several of the declining terms were related to business and management.
But is this observation anecdotal or a
broader pattern in Communications?
To answer, and to show how n-gram
analyses can be integrated with more
traditional analytic techniques, we
conducted an interaction analysis
comparing the n-gram frequencies for
terms related to business and management with those related to science
and technology. We identified related
terms using Thinkmap’s Visual Thesaurus software ( http://www.visual-
thesaurus.com), which is specifically
designed for this purpose. We then
extracted n-gram frequencies for the
resulting lists of related terms, using
these values to conduct our interaction analysis (see Figure 2). As shown
in the figure, the average frequency of
business- and management-related
terms declined steadily from 2000 to
2010, while science- and technology-related terms became more common.
Our interaction analysis indicated
that the observed disparity was highly
significant (t[5052] = 2.834, p < 0.01), providing statistical evidence of
Communications’ evolving identity.
Changes in style
The style of the writing in
Communications also evolved from 2000 to 2010.
Authors seemed to be transitioning
from the traditional academic style
of writing, adopting instead a less-formal, more personal voice. Evidence
of this change can be seen in the increasing use of words that refer directly to an article’s author(s) (such as
“I” +143% and “we” +137%) and in the
increased frequency authors spoke