Predicting Program Properties
from ‘Big Code’
By Veselin Raychev, Martin Vechev, and Andreas Krause
DOI: 10.1145/3306204
Abstract
We present a new approach for predicting program properties from large codebases (aka “Big Code”). Our approach
learns a probabilistic model from “Big Code” and uses this
model to predict properties of new, unseen programs.
The key idea of our work is to transform the program into
a representation that allows us to formulate the problem of
inferring program properties as structured prediction in
machine learning. This enables us to leverage powerful
probabilistic models such as Conditional Random Fields
(CRFs) and perform joint prediction of program properties.
As an example of our approach, we built a scalable prediction engine called JSNice for solving two kinds of tasks in
the context of JavaScript: predicting (syntactic) names of
identifiers and predicting (semantic) type annotations of
variables. Experimentally, JSNice predicts correct names
for 63% of name identifiers and its type annotation predictions are correct in 81% of cases. Since its public release at
http://jsnice.org, JSNice has become a popular system with
hundreds of thousands of uses.
By formulating the problem of inferring program
properties as structured prediction, our work opens up the
possibility for a range of new “Big Code” applications such
as de-obfuscators, decompilers, invariant generators, and
others.
1. INTRODUCTION
Recent years have seen significant progress in the area of
programming languages driven by advances in type systems,
constraint solving, program analysis, and synthesis techniques. Fundamentally, these methods reason about each
program in isolation and while powerful, the effectiveness of
programming tools based on these techniques is approaching its inherent limits. Thus, a more disruptive change is
needed if a significant improvement is to take place.
At the same time, creating probabilistic models from
large datasets (also called “Big Data”) has transformed a
number of areas such as natural language processing, com-
puter vision, recommendation systems, and many others.
However, despite the overwhelming success of “Big Data” in
a variety of application domains, learning from large datas-
ets of programs has previously not had tangible impact on
programming tools. Yet, with the tremendous growth of
publicly available source code in repositories such as
GitHub4 and BitBucket2 (referred to as “Big Code” by a recent
DARPA initiative11) comes the opportunity to create new
kinds of programming tools based on probabilistic models
of such data. The vision is that by leveraging the massive
effort already spent in developing millions of programs,
such tools will have the ability to solve tasks beyond the
The original version of this paper was published in ACM
POPL’2015.
reach of traditional techniques. However, effectively learning from programs is a challenge. One reason is that programs are data transformers with complex semantics that
should be captured and preserved in the learned probabilistic model.
1. 1. Structured prediction for programs
The core technical insight of our work is transforming the
input program into a representation that enables us to formulate the problem of predicting program properties as
structured prediction with Conditional Random Fields
(CRFs).
15 Indeed, CRFs are a powerful probabilistic model
successfully used in a wide variety of applications including
computer vision and natural language processing.
12, 15, 16 We
show how to instantiate this approach towards predicting
semantic information (e.g., type annotations) as well as syntactic facts (e.g., identifier names). To our knowledge, this is
the first work which shows how CRFs can be learned and
used in the context of programs. By connecting programs to
CRFs, a wide range of learning and inference algorithms14
can be used in the domain of programs.
Figure 1 illustrates the structured prediction approach.
In the prediction phase (upper part of figure), we are given
an input program for which we are to infer properties of
interest. In the next step, we convert the program into a representation called dependency network. The dependency
network captures relationships between program elements
whose properties are to be predicted with elements whose
properties are known. Once the network is obtained, we perform structured prediction and in particular, a query
referred to as Maximum a Posteriori (MAP) inference.
14 This
query makes a joint prediction for all elements together by
optimizing a scoring function based on the learned CRF
model. Making a joint prediction which takes into account
structure and dependence is particularly important as
Input
program
Dependency network
relating unknown
with known properties
Predicted properties Output
program
Training data Learned CRF model Learning
Figure 1. Structured prediction for programs.