3. 2. Synthesis algorithm
The synthesis algorithm first computes, for each input–
output example (s, s), the set of all trace expressions that map
input s to output s (using procedure Generate). It then intersects these sets for similar examples and learns conditionals
to handle different cases (using procedure Intersect). The
size of such sets can be huge; therefore, we must develop a
data structure that allows us to succinctly represent and efficiently manipulate huge sets of program expressions.
data structure: Figure 1(b) describes our data structure for
succinctly representing sets of programs from our domain-specific language. P~, e , f , and p denote representations of,
respectively, a set of string programs, a set of trace expressions, a set of atomic expressions, and a set of position expressions. r~ and c represent a set of regular expressions and a set
of integer expressions; these sets are represented explicitly.
The Concatenate constructor used in our string language
is generalized to the Dag constructor Dag(h~, hs, ht , x~, W), where
h~ is a set of nodes containing two distinctly marked source
and target nodes hs and ht , x is a set of edges over nodes in h~ that
defines a Directed Acyclic Graph (DAG), and W maps each x Î x
to a set of atomic expressions. The set of all Concatenate
expressions represented by a Dag(h~, hs, ht, x~, W) constructor
includes exactly those whose ordered arguments belong to the
corresponding edges on any path from hs to ht. The Switch,
Loop, SubStr, and Pos constructors are all overloaded to
construct sets of the corresponding program expressions that
are shown in Figure 1(a). The ConstStr and CPos constructors can be regarded as producing singleton sets.
The data structure supports efficient implementation of
various useful operations including intersection, enumeration of programs, and their simultaneous execution on a
given input. The most interesting of these is the intersection
operation, which is similar to regular automata intersection.
The additional challenge is to intersect edge labels—in the
case of automata, the labels are simply sets of characters,
while in our case, the labels are sets of string expressions.
Procedure Generate: The number of trace expressions
that can generate a given output string from a given input
state can be huge. For example, consider the second input–
output pair in Example 1, where the input state consists
of one string “(425)-706-7709” and the output string is
“425-706-7709”. Figure 2 shows a small sampling of different ways of generating parts of the output string from the
input string using SubStr and ConstStr constructors.
Each substring extraction task itself can be expressed with
a huge number of expressions, as explained later. The following are three of the trace expressions represented in the
figure, of which only the second one, shown in the figure in
bold, expresses the program expected by the user:
1. Extract substring “425”. Extract substring “-706-7709”.
2. Extract substring “425”. Print constant “-”. Extract substring “706”. Print constant “-”. Extract substring “7709”.
3. Extract substring “425”. Extract substring “-706”. Print
constant “-”. Extract substring “7709”.
We apply two crucial observations to succinctly generate
and represent all such trace expressions. First, the logic for
Figure 2. Small sampling of different ways of generating parts of an
output string from the input string.
(425) – 706 – 7709
425 – 706 – 7709
Constant Constant Constant
generating some substring of an output string is completely
decoupled from the logic for generating another disjoint
substring of the output string. Second, the total number of
different substrings/parts of a string is quadratic (and not
exponential) in the size of that string.
The Generate procedure creates a Directed Acyclic Graph
(DAG) Dag(h~, hs, ht, x~, W) that represents the trace set of all trace
expressions that generate a given output string from a given
input state. Generate constructs a node corresponding to
each position within the output string and constructs an edge
from a node corresponding to any position to a node corresponding to any later position. Each edge corresponds to some
substring of the output and is annotated with the set of all
atomic expressions that generate that substring. We describe
below how to generate the set of all such SubStr expressions.
Any Loop expressions are generated by first generating candidate expressions (by unifying the sets of trace expressions associated with the substrings s[k1 : k2] and s[k2 : k3], where k1, k2, and k3
are the boundaries of the first two loop iterations, identified by
considering all possibilities), and then validating them.
The number of substring expressions that can extract a
given substring from a given string can be huge. For example, following is a small sample of various expressions that
extract “706” from the string “425-706-7709” (call it v1).
•;Second number: SubStr2(v1, Num Tok, 2).
•;2nd last alphanumeric token:
SubStr2(v1, AlphNum Tok, − 2).
•;Substring between the first hyphen and the last hyphen:
SubStr(v1, Pos(Hyphen Tok, e, 1), Pos(e, Hyphen Tok, − 1) ).
•;First number that occurs between hyphen on both ends.
SubStr(v1, Pos(Hyphen Tok,
TokenSeq(Num Tok, Hyphen Tok), 1),
Pos(TokenSeq(Hyphen Tok, Num Tok),
•;First number preceded by a number–hyphen sequence.
SubStr(v1, Pos(TokenSeq(Num Tok, Hyphen Tok),
Num Tok, 1),
Pos(TokenSeq(Num Tok, Hyphen Tok,