These events can be queried, joined,
and connected based on their attributes. You can create new events by extracting attributes from a single event
or from a join across multiple events.
An event is an immutable set of at-
tributes with an identity.
The Quest for Identity
Some of today’s most challenging
problems come from the quest for
identity. Product matching, data science, fraud detection, homeland security, and more all struggle with figuring
out when one thing is the same as another thing so identity can be assigned.
Product matching is finding identity.
As discussed, providing an integrated
marketplace for stuff sold by wildly
disparate merchants is a big challenge.
The core of this challenge is matching
different SKUs from different merchants with different descriptions to
find the same product identity.
This is often made easier with UPC
or ISBN codes that actually do match.
This leaves the product-matching system with the easier job of comparing
attributes to verify identity. Product
matching is not always given the boost
from shared unique identifiers, and the
problem becomes a task of data science.
Data science is finding identity. In
data science, there are many objects,
each with many attributes. Each object
has a unique identity.
˲ Attaching new attributes: By comparing many objects and their attributes, the data-science algorithm associates new attributes with existing
objects.
˲ Merging object identities: By examining the attributes bound to sets
of objects, the data-science algorithm
can realize two objects are one. That, in
turn, unites their attributes.
Repeating the attribute/merge pattern
causes a new understanding of identity.
Fraud detection is finding identity.
Banks issuing credit cards invest heavily in fraud detection, as do retailers
and other institutions that accept
credit cards. Very large companies that
accept credit cards have a strong incentive to detect fraud since their banks
will charge them lower fees if their rate
of fraud is noticeably lower. Fraud detection is big business.
Fraud detection works by looking
at the transactions as objects with as-
Using Identity to Learn
Data science is based on identities, objects, and attributes. It has been used
to learn surprising new things. Identities are key to its work.
Data science and observations. Data
science revolves around identities. The
identities have attributes. It is the manipulation of these identities and attributes and comparison with other identities that share those attributes that leads
to new and deeper understanding.
Identities, objects, and attributes.
When observations are made, they are
stored as objects and given identities.
These objects have attributes. Analyzing the objects may lead to additional
attributes being added to them. Continued pattern matching on attributes
over large collections of objects can
lead to new attributes slapped onto the
sides of the objects.
Sometimes, looking at patterns on
the objects and their attributes leads
to new objects showing the connections between existing objects. This
will result in new identities for the new
objects. So, the pattern of attributes
becomes an identity in its own right,
which may lead to new attributes.
Attributes on identities—rinse and
repeat. It is the continuous cycle of
looking at lots and lots of attributes
on the objects and their identities
that leads to more attributes. These
new attributes are either attached to
existing objects or used to generate
new objects with their own independent identities.
Data science uses identities to achieve
serendipitous learning.
Big Data Is Lots of Identities
Big-data systems such as MapReduce,
2 Apache Hadoop (http://hadoop.
apache.org), and Apache Spark (https://
spark.apache.org) take immutable inputs and apply functional transformations to produce immutable outputs.
Because of the immutable nature of
the inputs and outputs, it is easy to reason about fault tolerance when pieces
of the work fail and are restarted.
Each of these big-data systems leverages the identities of data items to connect work and storage spread across
many servers.
MapReduce and Apache Hadoop.
These big-data systems look at the datasets they process as a bunch of key/
value pairs. Consider MapReduce and
Hadoop:
˲ The map function of MapReduce
takes a series of key/value pairs and
makes a set of output key/value pairs.
These output pairs may be the same as
or different than the map function input.
˲ The reduce function is called once
for each unique key and can iterate
through the values associated with that
key. There may be multiple values for a
single key.
Queries, joins, and more with keys.
Queries and joins in these big-data
environments leverage the keys in the
key/value pairs. These are sorted across
shards with the map function. The queries and joins are applied by the reduce
function handling all key/value pairs
with the same key (or identity).
Because the map function can arrange an input key/value into another
shaped key/value, MapReduce and Hadoop can query, sort, and join on arbitrary fields in the data. Putting the join
fields into the key and sorting allows
for a huge flexibility in function.
Big-data means handling lots of keys.
Big-data systems require handling lots
of keys. They can be spread around in a
scalable fashion across very large clusters of servers to accomplish massive
scale. The identity provided by the keys
hooks it all together.
The “Internet of Identities.” IoT, or
the Internet of Things, is the new trend
wherein massive numbers of events
from disparate devices are processed
at high rates.
Internet of Things: Identifying the
thing. In Io T, an extremely large number of devices that may barely qualify as
computers generate massive numbers
of events to be processed. Each of these
devices will have an identifier in some
form. As it generates events, each of
these events will have a more detailed
identifier that usually specifies its device of origin.
Each of these events will, in turn,
have a bunch of attributes that are specific to the device. Events originating
from your refrigerator will have different attributes than events originating
from your car’s transmission or from a
security camera at a large stadium.
Querying, joining, and connecting
things. Similar to what is seen in big
data, each of these Io T events has an
identity and a bunch of attributes.