A LOT OF data is moved from system to system in
an important and increasing part of the computing
landscape. This is traditionally known as ETL
(extract, transform, and load). While many systems
are extremely good at this process, the source for the
extraction and the destination for the load frequently
have different representations for their data. It is
common for this transformation to squeeze, truncate,
or pad the data to make it fit into the target. This is
really like using a shoehorn to fit into a shoe that is
too small. Sometimes it’s a needed step. Frequently
it’s a real pain!
Two major parts of ETL are the extraction and the
load. These processes are where the rubber meets the
participating data stores.
Extraction pulls data out of a source system. This
may be relational data kept in a database. If so, it may
be converted to an object relational format where each
object transforms the join of multiple relational
rows into a cohesive thing. Data is fre-
quently organized as messages when
it is sucked out. It’s also common for
data to be extracted from key-value
stores where it is kept in a semi-struc-
Load happens when the data is
placed into the target system. The target will have its own metadata describing the shape and form of the data in
its belly. If the target is an analytics system, then its data will likely be loaded
into a relational form.
While it may be counterintuitive, it
is frequently useful to take relational
data out of a system as objects; convert,
massage, and shoehorn the data from
one object representation to another;
and load it into the target system in relational form.
The new, shoehorned data is then
used for analytic queries.
To make this work, you need metadata1—for both the source being extracted and the target being loaded.
The metadata for the extracted
source is descriptive. The data exists.
The metadata describes its shape,
form, and meaning. It is always the
case that extracted data is copied out
and the metadata describes what its
shape was and what its shape is as
The metadata for the loaded data is
prescriptive. The ETL system makes it
fit the output metadata’s shape and
form exactly as it is prescribed to do.
Only when the system knows what
the results should look like can it do
ETL systems always need to know
the current shape of the input, as well
as the shape the data should become.
Frequently the output data to be
loaded into the target system is relational in its shape. Many analytics
systems expect relational data to analyze and report. Relational is great
as the format for both planned and
ad hoc queries.
What If the Shoe Doesn’t Fit?
When incoming data and its descriptive metadata don’t match the outgo-
Article development led by
Data doesn’t always fit nicely into a new home.
BY PAT HELLAND