selves jury-rigging mechanisms to
capture descriptions of what the
data cleaner was trying to accomplish and what features were used.
Fairly often, cleaning was a process that occurred only after code
had been written and errors hit:
They would go back and look for
aberrations after a crash, or when
a model looked odd. As a result,
the analysts also wanted to capture the justifications for cutting
aberrant data points: How had they
detected the error? What model
had found the issue?
Most data tables have the
built-in notion that data should
be stored and edited in place.
Interviewees stressed the expectation that data storage is comparatively cheap: Rather than
mutating it in place and losing
history, they would prefer to create additional clean versions of
columns and new datasets.
Write code. With an architecture
selected and the data in place,
the analyst begins to select their
analysis. In the examples we studied, the analyses were articulated
through code, written in C# and
Microsoft’s SCOPE; outside these
environments, analysts might work
in languages such as R, Python,
or PIG (a database-like language),
usually over Hadoop. High-level
languages that make it easy for the
compiler to support parallelism—
such as DryadLINQ or Matlab’s
ultimately help users write cloud-based jobs.
Users must design their code and
systems around the idea of sepa-
rating their work into parallelizable
jobs. Algorithms need to be written
in new ways in order to do this,
and data must be stored differ-
ently. For example, some resources
might need to be duplicated, one
per node. In order to reduce costly
communication, an analyst might
store a copy of a reference lookup
table on each VM.