Notable exceptions to this rule include currency symbols, decimal components of numbers, and apostrophes
indicating possessive case. A term like
“$5.95” would be treated as a 1-gram,
while “Euler’s constant” would be
treated as a 2-gram. For a more general
rule for tokenization, developers might
consider splitting tokens that contain
a special character only if the character
is adjacent to whitespace or a linefeed.
PhotograPh by alicia kUbiSta
We ignored case in the construction
of our n-gram corpus. Had we retained
case sensitivity, a term (such as “
computer science”) would have been treated as distinct from the term “
Computer Science.” While ignoring case
vastly reduced the potential number of
n-grams the system might encounter,
it also involved a few negative implications for search specificity. Without
case sensitivity, the term “IT” (for information technology) would be considered identical to, say, the word “it.”
Despite this drawback, we concluded
that the overall benefit of ignoring case
outweighed its cost.
Broadly speaking, our analysis of
how Communications evolved from
2000 to 2010 was predicated on the
idea that the level of importance or
relevance of a particular concept is
reflected in how often the concept is
mentioned over time. We therefore
had to compute the frequency with
which every n-gram in the corpus appeared in Communications during
each year of the analysis. For example,
if the n-gram “e-commerce” was mentioned 273 times in 2000 but only 23
times in 2010,b we might infer the
concept of e-commerce had become
less important in Communications
over time. However, direct frequency
comparisons can be deceiving because they do not account for potential growth or decline in the number
of words Communications published
over time. It was therefore necessary
for us to calculate relative frequencies for each n-gram. We thus divided
b These were the actual n-gram frequencies for
“e-commerce” during 2000 and 2010.
n-gram frequencies for each year by
the total number of words appearing
in the corpus during that year in order to produce a standardized measure of frequency that would allow
valid comparisons between n-grams
from year to year.
13 The standardized
frequency values resulting from this
process indicated how often a particular n-gram appeared in
Communications during a particular year relative
to the total quantity of text published
in it that year. Standardized frequencies are not, of course, the only means
n-grams can be compared over time.
Indeed, other, more sophisticated
information-theoretic measures (such
as entropy and cross-entropy) can also
be used for this purpose.
The result was a vast database
containing more than 160 million n-
grams and their associated years and
standardized frequencies. From it we
then selected the one million unique
n-grams exhibiting the most absolute change over time, reasoning that
the frequencies of less-interesting