third-party merchant selling through
the large e-commerce site. There’s no
single identity before matching.
Normalizing cleans up the various
inputs to try to have a consistent representation. If the color is Kelly green, forest green, olive, or chartreuse, should it
be normalized to green Normalization
makes it easier to match various inputs
to each other. It also loses some of the
fidelity of the original input.
Matching attempts to find stuff that
is the same. Is this product for sale
from Merchant A the same as another
product for sale from Merchant B?
Each merchant has their own SKU as
a personal unique identifier. How can
they be correlated?
Slippery and sliding identifiers.
Another challenge is that the merchants’
SKUs are assigned and bound by the
merchants. There’s nothing to stop
them from changing SKU 12345 from a
pair of ruby slippers to a can of chocolate sauce. When your partner business
uses identifiers in a non-immutable way,
you need to be on your toes. I’ve heard
tales of small merchants with 40 bins of
stuff in their basement. The contents of
SKU #23 corresponds to whatever product is kept in bin #23 at the time.
UPCs: The same but … maybe different. Consider large retailers that consolidate many sellers’ goods through
the large retailer’s platform. It is helpful if the merchants have the UPCs in
the description of their item(s). UPCs
make it much easier to match items
from different merchants. Each of
these 12-digit identifiers is for a
particular manufactured product. The UPC
works along with the EAN- 13 (
European Article Number 13) code, which is a
bar code supporting scanners mostly
for retail environments.
UPCs are mostly correct. Achieving
consistency and equivalence of products with the same UPC is hard for both
manufacturing and retail. Not everything has a UPC. Handcrafted items,
for example, may not have UPCs. For a
number of years, shoes were notorious
for not having UPCs.
Books: ISBNs, paperback, used, and
digital. What about books? The International Standard Book Number
(ISBN) is a 13-digit (formerly 10-digit)
number that uniquely identifies a particular version and format of a book.
What about reviews? Most reviews
different connected and disconnected
identities weave through the complex
Session-state and shopping carts.
Each shopper gets their own shopping
cart. This can be associated with an online account or with the Web session.
Shoppers do not get multiple shopping
carts during a single Web session. Furthermore, no one expects or wants the
shopping cart to share state or consistent updates with other shopping carts.
The uniqueness of the shopping
cart is provided by the shopping cart
ID. There is some logic in the system to
bind the session, either via user login
or online session state, to a shopping
cart ID. Based on that unique ID, the
shopping cart contents are located.
The scalable key-value store. One com-
mon pattern in scalable solutions is the
scalable key-value store. Take, for ex-
ample, an e-commerce retail product
catalog. The retailer has a whole bunch
of products, each with a product iden-
tifier. The product description cache
is sharded by the product ID. This sup-
ports scalable description data. Rep-
licated shards support scalable read
traffic. To add more product descrip-
tions, add more shards. To support
more read traffic, add more replicas
of the shards. See the scalable catalog
of product descriptions indexed by the
product ID in Figure 1. There is no re-
quirement that the product catalog can
update different products atomically.
In fact, the product catalog cannot up-
date all the cached entries for a single
Identifying cached jittery versions. Up-
dates to product descriptions distrib-
ute new versions to replicas over time.
Hence, reads are jittery, and later reads
may show earlier values. Product ID is the
immutable glue that makes this work.
Even if the read of the cache returns an
old cached value, it is associated with the
desired product ID and meets the busi-
ness needs. In product catalogs and for
many other uses, old values are fine.
Matching and deriving descriptions.
In most large e-commerce sites, prod-
uct descriptions come from data sub-
mitted by manufacturers, merchants,
and other sources. To correlate these,
it is necessary to normalize inputs,
match descriptions from different
sources, and then combine them to get
the best information available. Inputs
arrive with identifiers such as model
number, UPC, and SKU, defined by the
Figure 1. A scalable catalog of product descriptions.
Incoming Read Requests
Updates to Product