are about the contents of the book,
not the quality of the paperback’s
binding. Don’t you want to have
shared reviews for the e-book, paperback, and hardback editions? Typically, this is handled with yet another
unique identifier used to represent
all the different versions and formats. Similarly, many times the same
online products share reviews when
the color and unique identifier differ.
Products, SKUs, offers, inventory, and
shippability. Online retail is an ocean
of unique IDs, all weaving across different systems, concepts, and cooperating companies. Merchants will
describe their perspective of goods for
sale as their SKUs. Matching and correlating these goods into products from
the perspective of the e-commerce site
is a major endeavor in data science and
machine learning. When done, the correlation is kept to facilitate working
across the merchant and e-commerce
site. Of course, the merchant is free to
label a completely different product
with the same SKU tomorrow; the e-commerce site must adapt.
The identifiers for products will
reference the product catalog. The
contents of the product catalog will
evolve and be cached for efficient scalable reads. When accessing the cache,
it may race with updates to the cache,
and later reads may return earlier versions of the product description. It
doesn’t matter because either version
is OK. The product catalog does not
need transactional consistency.
Next, an offer to buy from a merchant is presented. Do you want a new
or used product? What condition is
it in, and what’s the reputation of the
seller? These offers are correlated to the
product, the shopping cart, the inventory for the specific offer, the price, the
shipping commitment, and the details
of how it will be shipped. Of course, this
needs to be tied to the payment.
Each of these relationships across
internal and external systems is knit
together using various related identities. Figure 2 shows a very small subset
of these interactions and how identifiers knit them together. Oh, yeah;
the e-commerce retailer hopes the
merchant has not recycled the SKU
when an order is placed. Attaching the
product description to the SKU usually
Using Identity to Search
Let’s consider Web search as we have
all seen it in Yahoo!, Google, and Bing.
Not surprisingly, searches are accomplished by assigning unique IDs to
each of the documents in the Web.
Document IDs, URLs, and search
terms. As these huge Web crawlers
traverse the URLs they find to locate
documents, they remember the URL
for each document. These URLs form
unique IDs. It’s common to bind the
URL to another unique document ID
As the document is crawled, the
word sequences are extracted for
indexing. These word sequences
(known as N-grams) correspond to
the search terms entered into the Web
N-grams are sharded into a large
number of partitions. As multiple
search terms enter a search, the shards
that may hold those terms are queried.
This returns sets of document IDs from
many shards. By comparing the results
looking for document IDs in common
across the search terms, a resulting
collection of document IDs can be returned.
While this is vastly and grossly
oversimplified, the main point is that
search is all about identities.
Searching an object-relational app.
Object-relational systems typically
have application objects layered on
top of underlying relational systems.
Some object-relational systems offer
search features that find the identities
of objects based on their contents and
the N-grams within them. This mechanism depends on the object identities captured by the search system and
correlated to the objects. While these
identities may not be explicitly understood by the underlying SQL database,
they are understood by the object-relational system and the search engine
layered on top.
Search means finding identities.
Search today typically means a system
that finds identities of documents,
objects, or other things. It is the correlation of the N-grams extracted
from these things to the identities
that provides search results. Which
document identities have the closest
match to the set of N-grams submitted with the search?
Naturally, the sorted N-grams are
not strongly consistent with the underlying things. There may be things
with identities that have not yet been
indexed. Sometimes, there are indices
that contain the identities for things
that no longer exist. While the things
and the indices may slide around, the
identities usually stay intact.
Figure 2. E-commerce—A tangled Web of identifiers.
prod-ID SKU mapping
product catalog cache
Is this SKU