be headed. After all, “We are what we
write, we are what we read, and we are
what we make of what we read.”
4
Here, we address the identity prob-
lem by prescribing the same medicine
for ourselves—technology and algo-
rithms—we often prescribe for others.
We present a culturomic analysis5 of
Communications showing how natural
language processing can be used to
quantitatively explore the identity and
culture of an institution over time, in-
spired by the n-gram project released
in 2010 by Google labs (http://ngrams.
googlelabs.com). In natural language
processing, an n-gram can be viewed
as a sequence of words of length n
extracted from a larger sequence of
words.
12 Google’s project allows for
quantitative study of cultural trends
based on combinations of words and
phrases (n-grams) appearing in a cor-
pus of more than five million books
published as early as the 15th century.
The central theory behind the project
is that when taken together, the words
appearing in a collection of books re-
veal something about human culture
at the time the books were written.
Analyzing these words computation-
ally makes it possible to study cultural
evolution over time.
figure 1. structural changes in Communications from 2000 to 2010.
450
total words published per year
articles published per year
1, 100,000
Words Published
900,000
700,000
400
350
300
Articles Published
500,000
2000
2002
2004
2005
2006
250
2010
table 1. major trends (growth and decline) in terms published in Communications from
2000 to 2010.
Growing in Popularity
Change
+ 18,151%
+ 11,160%
+ 10,833%
+ 6,439%
+ 5,703%
+ 5,151%
+ 5,151%
+ 4,844%
+ 4,599%
+ 4,292%
+ 4,231%
+ 4,016%
+ 3,741%
+ 3,557%
+ 3,434%
term
Google
queue
cloud
vM
i T professionals
parity
workload
venue
polynomial time
DRAM
test cases
theorem
ooP
science and engineering
emulator
Declining in Popularity
Change
- 9,459%
- 9,459%
- 9,295%
- 9,295%
- 8,969%
- 7,991%
- 7,665%
- 7,502%
- 6,425%
- 6,197%
- 5,382%
- 4,281%
- 4,240%
- 4,158%
- 4,077%
term
perceptual
wrapper
biometrics
CoRBA
telemedicine
disintermediation
multimedia
transcription
personalization
user profile
e-commerce
e-business
satellites
AoL
oCR
method
To appreciate how the identity of
Communications has evolved, we first constructed a corpus of the complete text
of every article it published from 2000
to 2010.a We also collected metadata
for all these articles, including title,
author(s), year published, volume, and
issue. In total, our corpus contained
3,367 articles comprising more than
8. 1 million words. To put this in perspective, consider that if you were to
spend 40 hours per week reading
Communications, you would need more
than four months to read every article
published from 2000 to 2010.
With our corpus complete, we next
constructed a software system to tokenize, or split the text of each article
into a series of n-grams. For example,
René Descartes’ famous phrase “
cogito ergo sum”
10 can be subdivided into
three 1-grams (cogito, ergo, and sum),
two 2-grams (cogito ergo, and ergo
sum), and one 3-gram (cogito ergo
sum). As this illustrates, the number
of n-grams that could potentially be
extracted from a large corpus of text
greatly exceeds the number of words
in the corpus itself. This situation has
serious scaling and performance implications for a corpus with millions
of words, so to avoid them we limited
our analysis to include n-grams with a
maximum length of n = 4.
To address the challenges of punctuation, we adopted the same method
used by the developers of Google’s
n-gram project for digitized books,
13
treating most punctuation marks as
separate words during the n-gram
construction process. The phrase “
Elementary, my dear Watson” would be
tokenized as, say, five words:
Elementary my dear Watson ,
a Only the text of each article was included in
the database; excluded was trailing matter
(such as acknowledgements and references).