and can clearly describe with examples, by studying help
forums or performing user studies (this paper presents two
domains: string manipulation and table manipulation). We
then develop the following.
domain-specific language: We design a domain-specific
language L that is expressive enough to capture several real-
world tasks in the domain, but also restricted enough to
enable efficient learning from examples.
data structure for representing consistent programs: The
number of programs in L that are consistent with a given set
of input–output examples can be huge. We define a data
structure D based on a version-space algebra14 to succinctly
represent a large set of such programs.
algorithm for synthesizing consistent programs: Our
synthesis algorithm for language L applies two key proce-
dures: (i) Generate learns the set of all programs, repre-
sented using data structure D, that are consistent with a
given single example. (ii) Intersect intersects these sets
(each corresponding to a different example).
ranking: We develop a scheme that ranks programs,
preferring programs that are more general. Each ranking
scheme is inspired by Occam’s razor, which states that a
smaller and simpler explanation is usually the correct one.
We define a partial order relationship between programs to
compare them. Any partial order can be used that efficiently
orders programs represented in the version-space algebra
used by the data structure D. Such an order can be applied to
efficiently select the top-ranked programs from among a set
represented using D. The ranking scheme can also take into
account any test inputs provided by the user (i.e., new additional inputs on which the user may execute a synthesized
program). A program that is undefined on any test input or
generates an output whose characteristics are different from
that of training outputs can be ranked lower.
2. 1. interaction models
A user provides to the synthesizer a small number of examples, and then can interact with the synthesizer according to
multiple models. In one model, the user runs the top-ranked
synthesized program on other inputs in the spreadsheet and
checks the outputs produced by the program. If any output
is incorrect, the user can fix it and reapply the synthesizer,
using the fix as an additional example. However, requiring
the user to check the results of the synthesized program,
especially on a large spreadsheet, can be cumbersome. To
enable easier interaction, the synthesizer can run all synthesized programs on each new input to generate a set of
corresponding outputs for that input. The synthesizer can
highlight for the user the inputs that cause multiple distinct
outputs. Our prototypes, implemented as Excel add-ins,
support this interaction model.
A second model accommodates a user who requires a
reusable program. In this model, the synthesizer presents
the set of consistent programs to the user. The synthesizer
can show the top k programs or walk the user through the
data structure that succinctly represents all consistent
programs and let the user select a program. The programs
can be shown using programming-language syntax, or they
can be described in a natural language. The differences
between different programs can be explained by synthesizing a distinguishing input on which the programs behave
differently. 10 The user can reapply the synthesizer with the
distinguishing input and its desired output as an additional example.
3. SyntACtiC tRAnSFoRMAtionS
Spreadsheet users often struggle with reformatting or cleaning data in spreadsheet columns. For example, consider the
following task.
Example 1 (Phone Numbers). An Excel user wants to uniformly format the phone numbers in the input column, adding a
default area code of “425” if the area code is missing.
input v1
323-708-7700
(425)-706-7709
510.220.5586
235 7654
745-8139
output
323-708-7700
425-706-7709
510-220-5586
425-235-7654
425-745-8139
Such tasks can be automated by applying a program that
performs syntactic string transformations. We now present
an expressive domain-specific language of string-processing
programs that supports limited conditionals and loops, syntactic string operations such as substring and concatenate,
and matching based on regular expressions. 6
3. 1. Domain-specific language
Our domain-specific programming language for performing syntactic string transformations is given in Figure 1(a). A
string program P is an expression that maps an input state s,
which holds values for m string variables v1, … , vm (denoting
the multiple input columns in a spreadsheet), to an output
string s. The top-level string expression P is a Switch constructor whose arguments are pairs of Boolean expressions
b and trace expressions e. The set of Boolean expressions in
a Switch construct must be disjoint, that is, for any input
state, at most one of the Boolean expressions can be true.
The value of P in a given input state s is the value of the trace
expression that corresponds to the Boolean expression satisfied by s. A Boolean expression b is a propositional formula
in Disjunctive Normal Form (DNF). A predicate Match(vi, r, k)
is satisfied if and only if vi contains at least k nonoverlapping
matches of regular expression r. (In general, any finite set of
predicates can be used.)
A trace expression Concatenate(f1, … , fn) is the concatenation of strings represented by atomic expressions f1,
… , fn. An atomic expression f is either a constant-string
expression ConstStr, a substring expression constructed from SubStr, or a loop expression constructed
from Loop.
The substring expression SubStr(vi, p1, p2) is defined
partly by two position expressions p1 and p2, each of which
implicitly refers to the (subject) string vi and must evaluate to a position within the string vi. (A string with
characters has + 1 positions, numbered from 0 to
starting from left.) SubStr(vi, p1, p2) is the substring of
string vi in between positions p1 and p2. For a nonnegative