ing data shape specified by the pre-
scriptive data, you must do something
When a shoe is too large, people will
shove padding into it. Similarly, when
the incoming data doesn’t have all the
information required for the outgoing
shape and form, you add stuff. This
may be a default value or a null value.
If a shoe is too small and the foot is
too large, sometimes you use a shoe-
horn to force the foot into the shoe,
comfort be damned. This is a real pain!
Similarly, when the incoming data
has too much information, the system
needs to discard data that doesn’t fit
the outgoing metadata.
The process of discarding or pad-
ding data is very common.
All too often, the descriptive meta-
data for the input is not a perfect match
to the prescriptive metadata for the de-
Sometimes data is extracted from
many sources with either the same or
different input metadata describing the
stuff being loaded. It’s essential that the
data from the various sources be modi-
fied to fit into the target metadata.
Note that normalizing the data to
relational form may be difficult with
different input data from different sys-
tems. The needed information may be
missing from some input source.
ETL takes disparate sources and destinations and moves data from one to
the other. Frequently there is only a
partially useful mapping of the metadata. Sometimes data must be discard-ed to traverse the path from source to
destination. Other times the source
data may need to be augmented with
null values or default values. It’s also
possible that the mapping is complex
and loses much of the meaning kept in
the original translation as the data is
reshaped and re-formed.
Metadata for the loaded source data
is descriptive—it describes the data.
Metadata for the data loaded into the
target is prescriptive—it prescribes the
required target shape and form. The
challenge is that the described output
may be ill fitting to the prescribed input.
It turns out the business value of ill-
fitting data is extremely high. The pro-
cess of taking the input data, discard-
ing what doesn’t fit, adding default or
null values for missing stuff, and gen-
erally shoehorning it to the prescribed
shape is important. The prescribed
shape is usually one that is amenable
to analysis for deeper meaning.
It is the shoehorning that gives the
data the shape it needs to be understood consistently.
Immutability Changes Everything
Data in Flight
Other People’s Data
1. Helland, P. If you have too much data then ‘good
enough’ is good enough. acmqueue 9, 5 (2001); https://
Pat Helland has been implementing transaction systems,
databases, application platforms, distributed systems,
fault-tolerant systems, and messaging systems since
1978. He currently works at Salesforce.
Copyright held by owner/author.
Publication rights licensed to ACM.