The substring-extraction problem can be decomposed into
two independent position-identification problems, each of
which can be solved independently. The solutions to the
substring-extraction problem can also be maintained succinctly by independently representing the solutions to the
two position-identification problems. Note the representation of the SubStr constructor in Eq. 1 in Figure 1(b).
Procedure Intersect: Given a trace set for each
input–output example, the Intersect procedure generates the top-level Switch constructor. Intersect first
partitions the examples, so that inputs in the same partition are handled by the same conditional in the top-level
Switch expression, and then intersects the trace sets for
inputs in the same partition. If a set of inputs are in the
same partition, then the intersection of trace sets is non-empty. Intersect uses a greedy heuristic to minimize the
number of partitions by starting with singleton partitions
and then iteratively merging partitions that have the highest compatibility score, which is a function of the size of the
resulting partition and its potential to be merged with other
partitions.
Intersect then constructs a classifier for each of the
resultant partitions, which is a Boolean expression that is
satisfied by exactly the inputs in the partition. The classifier
for each partition and the intersection of trace sets for the
inputs in the partition serve as the Boolean condition and
corresponding trace expression in the constructed Switch
expression, respectively.
ranking: We prefer Concatenate and TokenSeq
expressions that have fewer arguments. We prefer SubStr
expressions to both ConstStr expressions (it is less
likely for constant parts of an output string to also occur
in the input) and Concatenate expressions (if there is
a long substring match between the input and output, it
is more likely that the corresponding part of the output
was produced by a single substring extraction). We prefer
a Pos expression to CPos expression (giving less preference to extraction expressions based on constant offsets).
StartTok and EndTok are our most preferred tokens;
otherwise, we prefer tokens corresponding to a larger character class (favoring generality).
The implementation of the synthesis algorithm is less
than 5,000 lines of C# code, and takes less than 0.1 s on average for a benchmark suite of more than 100 tasks obtained
from online help forums and the Excel product team.
id
S33
B56
d32
W98
A46
. . .
that stores id, purchase date (in month/year format), and
purchase price of items. The selling price of an item is
computed by adding its purchase price (for the corresponding
month) to its markup charges, which in turn is calculated by
multiplying the markup percentage by the purchase price.
input v1
Stroller
Bib
diapers
Wipes
Aspirator
input v2
10/12/2010
23/12/2010
21/1/2011
2/4/2009
23/2/2010
output
$145.67 + 0.30*145.67
$3.56 + 0.45* 3. 56
$21.45 + 0.35* 21. 45
$5.12 + 0.40* 5. 12
$2.56 + 0.30* 2. 56
MarkupRec
name Markup
Stroller 30%
Bib 45%
diapers 35%
Wipes 40%
Aspirator 30%
... ...
id
S33
S33
B56
d32
W98
A46
. . .
CostRec
Date
12/2010
11/2010
12/2010
1/2011
4/2009
2/2010
. . .
Price
$145.67
$142.38
$3.56
$21.45
$5.12
$2.56
. . .
To perform the above task, the user must perform a join of
the two tables on the common item Id column to lookup the
item Price from its Name (v1) and selling Date (substring
of v2). We present an extension to the trace expression
(from Section 3. 1) that can also manipulate strings present
in such relational tables. 18
4. SEMAntiC tRAnSFoRMAtionS
Some string transformation tasks also involve manipulating
strings that need to be interpreted as more than a sequence
of characters, for example, as a column entry from some
relational table, or as some standard data type such as date,
time, currency, or phone number. For example, consider the
following task from an Excel help forum.
4. 1. Domain-specific language
We extend the trace expression (from Section 3. 1), as shown
in Figure 3(a), to obtain the semantic string transformation
language that can also perform table lookup operations.
The atomic expression f is modified to represent a constant
string, a select expression, or a substring of a select expression. A select expression et is either an input string variable
vi or a lookup expression denoted by Select(Col, Tab, g),
where Tab is a relational table identifier and Col is a column identifier of the table. The Boolean condition g is an
ordered conjunction of column predicates h1∧…∧hn, where
a column predicate h is an equality comparison between
the content of some column of the table and a constant or
a trace expression e. We require columns present in these
conditions to together constitute a primary key of the table
to ensure that the select queries produce a single string as
opposed to a set of strings.
The task in Example 3 can be represented in the lan-
guage as
Example 3. A shopkeeper wants to compute the selling
price of an item (Output) from its name (Input v1 ) and sell-
ing date (Input v2 ). The inventory database of the shop con-
sists of two tables: (i) MarkupRec table that stores id, name
and markup percentage of items, and (ii) CostRec table
Concatenate (f1, ConstStr(“+0.”), f2, ConstStr(“*”), f3)
where f1 ≡ Select(Price, CostRec, Id = f4 ∧ Date = f5),
f4 ≡ Select(Id, MarkupRec, Name = v1),
f5 ≡ SubStr(v2, Pos(Slash Tok, e, 1), Pos(e, End Tok, 1) ),
f2 ≡ SubStr2(f6, Num Tok, 1), f3 ≡ SubStr2(f1, DecNum Tok, 1),
f6 ≡ Select(Markup, MarkupRec, Name = v1).