Under review as a conference paper at ICLR 2017
Input v Output
1 William Henry Charles Charles, W.
2 Michael Johnson Johnson, M.
3 Barack Rogers Rogers, B.
4 Martha D. Saunders Saunders, M.
5 Peter T Gates Gates, P.
String e := Concat(f
1
, · · · , f
n
)
Substring f := ConstStr(s)
| SubStr(v, p
l
, p
r
)
Position p := (r, k, Dir)
| ConstPos(k)
Direction Dir := Start | End
Regex r := s | T
1
· · · | T
n
(a) (b)
Figure 1: An example FlashFill task for transforming names to lastname with initials of first name,
and (b) The DSL for regular expression based string transformations.
The syntax and semantics of the DSL for string transformations is shown in Figure 1(b) and Figure 7
respectively. The DSL corresponds to a large subset of FlashFill DSL (except conditionals), and
allows for a richer class of substring operations than FlashFill. A DSL program takes as input a
string v and returns an output string o. The top-level string expression e is a concatenation of a
finite list of substring expressions f
1
, · · · , f
n
. A substring expression f can either be a constant
string s or a substring expression, which is defined using two position logics p
l
(left) and p
r
(right).
A position logic corresponds to a symbolic expression that evaluates to an index in the string. A
position logic p can either be a constant position k or a token match expression (r, k, Dir), which
denotes the Start or End of the k
th
match of token r in input string v. A regex token can either be a
constant string s or one of 8 regular expression tokens: p (ProperCase), C (CAPS), l (lowercase), d
(Digits), α (Alphabets), αn (Alphanumeric),
∧
(StartOfString), and $ (EndOfString). The semantics
of the DSL programs is described in the appendix.
A DSL program for the name transformation task shown in Figure 1(a) that is con-
sistent with the examples is: Concat(f
1
, ConstStr(“, ”), f
2
, ConstStr(“.”)), where f
1
≡
SubStr(v, (“ ”, −1, End), ConstPos(−1)) and f
2
≡ SubStr(v, ConstPos(0), ConstPos(1)). The
program concatenates the following 4 strings: i) substring between the end of last whitespace and
end of string, ii) constant string “, ”, iii) first character of input string, and iv) constant string “.”.
3 OVERVIEW OF OUR APPROACH
We now present an overview of our approach. Given a DSL L, we learn a generative model of pro-
grams in the DSL L that is conditioned on input-output examples to efficiently search for consistent
programs. The workflow of our system is shown in Figure 2, which is trained end-to-end using a
large training set of programs in the DSL together with their corresponding input-output examples.
To generate a large training set, we uniformly sample programs from the DSL and then use a rule-
based strategy to compute well-formed input strings that satisfy the pre-conditions of the programs.
The corresponding output strings are obtained by running the programs on the input strings.
A DSL can be considered a context-free grammar with a start symbol S and a set of non-terminals
with corresponding expansion rules. The (partial) grammar derivations or trees correspond to (par-
tial) programs. A na
¨
ıve way to perform a search over the programs in a DSL is to start from the start
symbol S and then randomly choose non-terminals to expand with randomly chosen expansion rules
until reaching a derivation with only terminals. We, instead, learn a generative model over partial
derivations in the DSL that assigns probabilities to different non-terminals in a partial derivation and
corresponding expansions to guide the search for complete derivations.
Our generative model uses a Recursive-Reverse-Recursive Neural Network (R3NN) to encode par-
tial trees (derivations) in L, where each node in the partial tree encodes global information about
every other node in the tree. The model assigns a vector representation for every symbol and every
expansion rule in the grammar. Given a partial tree, the model first assigns a vector representation
to each leaf node, and then performs a recursive pass going up in the tree to assign a global tree
representation to the root. It then performs a reverse-recursive pass starting from the root to assign
a global tree representation to each node in the tree.
3