Spreadsheet Data Manipulation
Using Examples
Abstract
Millions of computer end users need to perform tasks over
large spreadsheet data, yet lack the programming knowledge to do such tasks automatically. We present a
programming by example methodology that allows end users to
automate such repetitive tasks. Our methodology involves
designing a domain-specific language and developing a
synthesis algorithm that can learn programs in that language from user-provided examples. We present instantia-tions of this methodology for particular domains of tasks:
(a) syntactic transformations of strings using restricted
forms of regular expressions, conditionals, and loops, (b)
semantic transformations of strings involving lookup in
relational tables, and (c) layout transformations on spreadsheet tables. We have implemented this technology as an
add-in for the Microsoft Excel Spreadsheet system and have
evaluated it successfully over several benchmarks picked
from various Excel help forums.
1. intRoDuCtion
The IT revolution over the past few decades has resulted
in two significant advances: the digitization of massive
amounts of data and widespread access to computational
devices. It is thus not surprising that more than 500 million
people worldwide use spreadsheets for storing and manipulating data. These business end users have myriad diverse
backgrounds and include commodity traders, graphic
designers, chemists, human resource managers, finance
professionals, marketing managers, underwriters, compliance officers, and even mailroom clerks—they are not professional programmers, but they need to create small, often
one-off, applications to perform business tasks. 4
Unfortunately, the state of the art of interfacing with
spreadsheets is far from satisfactory. Spreadsheet systems,
like Microsoft Excel, come with a maze of features, but
end users struggle to find the correct features to accomplish their tasks. 12 More significantly, programming is still
required to perform tedious and repetitive tasks such as
transforming names or phone numbers or dates from one
format to another, cleaning data, or extracting data from
several text files or Web pages into a single document. Excel
allows users to write macros using a rich inbuilt library of
string and numerical functions, or to write arbitrary scripts
in Visual Basic or .Net programming languages. However,
since end users are not proficient in programming, they find
it too difficult to write desired macros or scripts. Moreover,
Harris’s work was done during an internship at Microsoft Research.
Singh’s work was done during two internships at Microsoft Research.
even skilled programmers might hesitate to write a script for
a one-off repetitive task.
We performed an extensive case study of spreadsheet
help forums and observed that string and table processing is a very common class of programming problems
that end users struggle with. This is not surprising given
that various languages such as Perl, Awk, and Python were
designed to support string processing, and that new languages such as Java and C# provide rich support for string
processing. During our study, we also observed how novice
users specified their desired programs to expert users: most
specifications consisted solely of one or more input–output
examples. Since input–output examples may underspecify
a program, the interaction between a novice and an expert
often involved multiple rounds of communication over
multiple days. Inspired by this observation, we developed a
programming by example (PBE), or inductive synthesis, methodology15 that has produced synthesizers that can automatically generate a wide range of string/table manipulating
programs in spreadsheets from input–output examples.
Each synthesizer takes the role of the forum expert, removing a human from the interaction loop and enabling users
to solve their problems in a few seconds instead of few days.
This paper is organized as follows. We start with a brief
overview of our PBE methodology (Section 2). We then
describe an application of this methodology to perform syntactic string manipulation tasks (Section 3). 6 This is followed
by an extension that automates more sophisticated semantic string manipulations requiring background knowledge,
which can often be encoded as relational tables (Section 4). 18
We also describe an application of this methodology to
perform layout transformations on tables (Section 5). 8 In
Section 6, we discuss related work, and in Section 7, we conclude and discuss future work.
2. oVERViEW
In this section, we outline a general methodology that we
have used for developing inductive synthesizers for end-user
programming tasks, along with how a user can interact with
the synthesizers. In the first step of our methodology, we
identify a domain of useful tasks that end users struggle with
This paper is based on “Automating String Processing
in Spreadsheets using Input-Output,” S. Gulwani, pub-
lished in POPL (2011); “Learning Semantic String Trans-
formations from Examples,” R. Singh and S. Gulwani,
PVLDB 5 (2012), in press; and “Spreadsheet Table Trans-
formations from Examples,” W.R. Harris and S. Gulwani,
published in PLDI (2011).