string program P :=
boolean condition b :=
Conjunction d :=
predicate π :=
trace expr e :=
atomic expr f :=
position p :=
Integer expr c : =
regular expr r :=
Switch ((b1, e1), . . . , (bn, en)) | e
d1 ∨ . . . ∨ dn
π1∧ ... ∧ πn
Match(υi, r, k) | ¬ Match(υi, r, k)
Concatenate(f1, . . . , fn) | f
ConstStr(s) | SubStr(υi, p1, p2) | loop(λw : e)
CPos(k) | pos(r1, r2, c)
k | k1w + k2
TokenSeq(t1, ..., tn) | t | ε
(a)
(b)
constant k, CPos(k) denotes the kth position in the subject string. For a negative constant k, CPos(k) denotes
the ( + 1 + k)th position in the subject string, where
A regular expression r is e (which matches the empty
string, and therefore can match at any position of any
string), a token T, or a token sequence TokenSeq(T1, …, Tn).
This restricted choice of regular expressions enables
efficient enumeration of regular expressions that match
certain parts of a string. We use the following finite (but easily extended) set of tokens: (a) Start Tok, which matches
the beginning of a string, (b) End Tok, which matches the
end of a string, (c) a token for each special character, such
as hyphen, dot, semicolon, comma, slash, or left/right
parenthesis/bracket, and (d) two tokens for each character
class C, one that matches a sequence of one or more characters in C, and another that matches a sequence of one or
more characters that are not in C. Examples of a character
class C include numeric digits (0– 9), alphabetic characters
(a–zA–Z), lowercase alphabetic characters (a–z), uppercase alphabetic characters (A–Z), alphanumeric characters, and whitespace characters. Upper Tok, Num Tok, and
AlphNumTok match a nonempty sequence of uppercase
alphabetic characters, numeric digits, and alphanumeric
characters, respectively. DecNumTok matches a non-empty sequence of numeric digits and/or decimal point.
Hyphen Tok and Slash Tok match the hyphen character
and the slash character, respectively.
The task described in Example 1 can be expressed in our
domain-specific language as
Switch((b1, e1), (b2, e2)), where
b1 º Match(v1, Num Tok, 3)
b2 º ¬Match(v1, Num Tok, 3)
e1 º Concatenate(SubStr2(v1, Num Tok, 1), ConstStr(“−”),
SubStr2(v1, Num Tok, 2), ConstStr(“−”),
SubStr2(v1, Num Tok, 3) )
e2 º Concatenate(ConstStr(“425–”), SubStr2(v1, Num Tok, 1),
ConstStr(“−”), SubStr2(v1, Num Tok, 2) )
The atomic expression Loop(λw : e) is the concatenation
of e1, e2, …, en, where ei is obtained from e by replacing all
occurrences of integer w by i, and n is the smallest integer
such that evaluation of en+ 1 is undefined. (It is also possible
to define more interesting termination conditions, e.g.,
based on position expressions or predicates.) A trace expression e is undefined when (i) a constituent CPos(k) expression refers to a position not within its subject string, (ii) a
constituent Pos(r1, r2, c) expression is such that the subject
string does not contain c occurrences of a match bounded
by r1 and r2, or (iii) a constituent SubStr(vi, p1, p2) expression has position expressions that are both defined but
the first refers to a position that occurs later in the subject
string than the position indicated by the second. The following example illustrates the utility of the loop construct.
Example 2 (Generate Abbreviation). The following task
was presented originally as an Advanced Text Formula. 23
input v1 output
Association of Computing Machinery ACM
Principles Of Programming Languages PoPL
Foundations of Software engineering FSE
This task can be expressed in our language as
Loop(λw : Concatenate(SubStr2(v1, Upper Tok, w) ) ).
Our tool synthesizes this program from the first example
row and uses it to produce the entries in the second and
third rows (shown here in boldface for emphasis) of the
output column.