Figure 4. Probabilistic inference of program properties on an example.
function chunkData(e, t) {
var n = [];
var r = e.length;
var i = 0;
for(; i < r; i += t) {
if (i + t < r) {
n.push(e.substring(i, i + t));
} else {
n.push(e.substring(i, r));
}
}
return n;
} }
/* str: string, step: number, return: Array */
function chunkData(str, step) {
var colNames = [];/* colNames: Array */
var len = str.length;
var i = 0; /* i: number */
for (;i < len;i += step) {
if (i + step < len) {
colNames.push(str.substring(i, i + step));
} else {
colNames.push(str.substring(i, len));
}
}
return colNames;
(a) JavaScript program with minified identifier names. (e) JavaScript program with new identifier names and types.
L R Score
Unknown properties (variable names):
r
i step
j j
i j
u q
0.5
e t n r i
0.4
?????
?
0.1
0.01
i
L<R L=_.R
step
r
L+=R
i
?
length
t
i
Known properties (constants, APIs):
0 [] length push . . .
L R Score
?
0.8
len
length
t
L R Score
i len
i length
length length
len length
0.5
0.6
0.4
(b) Known and unknown name properties. (c) Dependency network.
(d) Result of MAP inference.
To achieve good performance for the MAP inference, we
developed a new algorithmic variant which targets the domain
of programs (existing inference algorithms cannot efficiently
known properties and (ii) the relationship between
elements (discussed here).
2. 2. Step 2: Build dependency network
Next, using features we later define, we build a dependency
network capturing relationships between program elements. The dependency network is key to capturing structure when performing predictions and intuitively determines
how properties to be predicted influence each other. For
example, the link between known and unknown properties
allows us to leverage the fact that many programs use common anchors (e.g., JQuery API) meaning that the unknown
quantities we aim to predict are influenced by how known
elements are used. Dependencies are triplets of the form
〈n,m,rel〉 where, n, m are program elements, and rel is the
particular relationship between the two elements. In our
work all dependencies are triplets, but these can be extended
to relationships involving more than two elements.
In Figure 4(c), we show three example dependencies
between the program elements. For instance, the statement
i += t generates a dependency 〈i,t,L+=R〉, because i and
t are on the left and right side of a+= expression. Similarly,
the statementvar r = e.length generates several
dependencies including 〈r,length,L=_.R〉 which designates that the left part of the relationship, denoted by L,
appears before the de-reference of the right side denoted by R
(we elaborate on the different types of relationships in Section
4). For clarity, in Figure 4(c) we include only some of the
relationships.
2. 3. Step 3: MAP inference
After obtaining the dependency network, we next infer the
most likely values for the nodes via a query known as MAP
MARCH 2019 | VOL. 62 | NO. 3 | COMMUNICATIONS OF THE ACM 101
inference.
14 Informally, in this type of query, a prediction
leverages the structure of the network (i.e., the connections
between the nodes) as opposed to predicting separately
each node at a time. As illustrated in Figure 4(d), for the
network of Figure 4(c), our system infers the new names
step and len. It also predicts that the previous name i
was most likely. Let us consider how we predicted the
names step and len. The network in Figure 4(d) is the
same one as in Figure 4(c) but with additional tables produced as an output of the learning phase (i.e., the learning
determines the concrete feature functions and the weights
associated with them). Each table is a function that scores
the assignment of properties when connected by the corresponding edge (intuitively, how likely the particular pair
is). Consider the topmost table in Figure 4(d). The first row
says the assignment of i and step is scored with 0.5. The
MAP inference searches for assignment of properties to
nodes such that the sum of the scores shown in the tables is
maximized. For the two nodes i and t, the inference ends up
selecting the highest score from that table (i.e., the values i
and step). Similarly for nodes i and r. However, for nodes
r and length, the inference does not select the topmost row
but the values from the second row. The reason is that if it
had selected the topmost row, then the only viable choice (in
order to match the value length) for the remaining relationship is the second row of i, r table (with value 0.6).
However, the assignment 0.6 leads to a lower combined
overall score. That is, MAP inference must take into account
the global structure and cannot simply select the maximal
score of each function.