because it’s melting down in the face of
an unforeseen increase in traffic.
Load on a service is measured using different combinations of metrics
depending on the type of service being discussed, but a common denominator unit for many services is QPS.
Layered on top of QPS might be other
service-dependent metrics such as
storage size (gigabytes or terabytes),
memory usage, network bandwidth, or
I/O bandwidth (gigabits per second).
It’s useful to break demand growth
down into organic and inorganic. Organic growth is what you can forecast
by extrapolating historical trends in
traffic, and the forecasting problem
can often be addressed using statistical tools. Inorganic growth is what you
forecast for one-time events such as
product launches, changes in service
performance, or anticipated changes
in user behavior, among other factors,
and this growth cannot be extrapolated from historical data. Prediction
of inorganic growth is less amenable
to statistical tools and often relies on
rules of thumb and estimates derived
from similar events in the past. In the
time leading up to a service launch,
when there is not enough historical
data available to make an organic
growth forecast, teams estimate demand using techniques applicable to
inorganic growth.
Forecasting organic growth. For
mature products that have been in operation for a few years, you can forecast
organic growth using statistical methods. Note that linear regression is not
a useful tool in most cases, because it
does not capture seasonal traffic fluctuations; it also does not work if growth
is not linear. Many Web services see
significant drops in traffic (the “
summer slump”) because of the midyear
vacation season, and, conversely, see
big spikes in traffic during the year-end
shopping season, followed by a major
“holiday dip” in the last week of the
year, followed in turn by a “
back-to-work bounce” at the start of the new
year (see the accompanying figure). At
Google, we even account for predictable changes with a cycle time of several years, caused by events such as the
FIFA World Cup.
Google uses a variety of forecast-
ing models that attempt to capture
seasonality on a monthly or annual
time scale. There is uncertainty in
forecasts, and they imply a confidence
level, so rather than forecasting a line,
we are forecasting a cone. Any given
statistical model has its strengths and
weaknesses, so many Google products
use outputs generated from a large
ensemble of models,
6 which include
variants on many well-known ap-
proaches, such as the Bass Diffusion
Model; Theta Model; logistic models;
Bayesian Structural Time Series; STL
(seasonal and trend decomposition
using Loess); Holt-Winters and other
exponential smoothing models; sea-
sonal and other ARIMA (autoregres-
sive integrated moving average)-based
models; year-over-year growth models;
custom models; and more.
Having generated independent
estimates from each model in the en-
semble, we then compute their mean
after applying a configurable “trim-
ming” parameter to eliminate outlier
estimates, and this adjusted mean is
used as the final prediction. Depend-
ing on the scale and global reach of a
service and its different levels of adop-
tion in different parts of the world, it
might be more accurate to generate
continent-level or country-level fore-
casts and aggregate them instead of at-
tempting to forecast at the global level.
It is important to compare fore-
casts regularly with actual traffic in
order to tune the model parameters
over time and improve the accuracy
of the models. Experience shows that
the trimmed mean of the ensemble
of models delivers superior accuracy
compared with any individual model.
Forecasting inorganic growth. Inor-
ganic growth is generated by one-time
events that have no periodicity, such as
launches of new products, new features,
or marketing promotions, or changes
in user behavior that are triggered by
some extraneous factor for which the
timing is predictable but the resulting
peak traffic volume has a high degree
of uncertainty (like the FIFA World Cup
or the Royal Wedding), among others.
Inorganic growth involves an abrupt
change in traffic and is intrinsically un-
predictable because it is triggered by an
event that hasn’t happened before or
otherwise cannot be directly extrapo-
lated from the past. When the product
owners and SREs have advance notice
of such growth, such as when planning
for a new feature launch, they need to
apply intuition and rules of thumb to
estimating post-launch traffic, and un-
derstand their predictions will have a
higher level of uncertainty.
General rules for forecasting in-
organic growth for product/feature
launches include the following:
˲ Examine historical traffic changes
from past launches of similar or analo-
gous features.
˲For country- or market-specific
launches, consider past user behavior
in that market.
˲ Consider the level of publicity and
promotion around the launch.
An interesting case study of inorganic growth that was not generated by any
engineering change and was not small involves the initial launch of Google Analytics, a
service for gathering and analyzing traffic to any website. Google had acquired Urchin
Software Corporation for its Web-analytics product that provided traffic collection and
analytics dashboards to paying customers. The inorganic traffic growth event occurred
when the product was made available for free under the Google brand, permitting
any website owner to sign up for it at no charge. Google correctly anticipated a flood
of new users, based on prior experience launching the Keyhole (later called Google
Earth) subscription-based product for free. Therefore, we carefully load tested and
provisioned the product for the expected increase in traffic.
Our prediction for core product usage then performed reasonably well, but we
had forgotten to account for traffic to the signup page! The page where new users
signed up was backed by a single-threaded SQL database with limited transaction
capacity, placing a strict and previously unknown limit on the number of signups per
second, resulting in a stream of public complaints from users about site slowness and
unavailability. We learned this lesson well, and our product launch checklist afterward
contained the question, “Do new users have to sign up for your service, and if so, have you
estimated and tested the load on your signup page?”
Google Analytics
Lesson Learned