fragments. The average JavaScript program size is 91. 7 KB.
1. 3. Nice2Predict: Structured prediction framework
To facilitate faster creation of new applications (JSNice
being one example), we built a reusable framework called
Nice2Predict (found at http://nice2predict.org) which includes
all components of this work (e.g., training and inference)
except the definition of feature functions (which are application specific). Then, to use our method one only needs
to phrase their application in terms of a CRF model which
is done by defining suitable feature functions (we show
such functions for JSNice later in the paper) and then
invoke the Nice2Predict training and inference mechanisms.
A recent example of this instantiation is DeGuard9
( http://apk-deguard.com), a system that performs Android
layout de-obfuscation by predicting method, class, and field
names erased by ProGuard.
6
2. OVERVIEW
We now provide an informal description of our probabilistic approach on a running example. Consider the JavaScript
program shown in Figure 4(a). This is a program which has
short, non-descriptive identifier names. Such names can be
produced by both a novice inexperienced programmer or by
an automated process known as minification (a form of layout obfuscation) which replaces identifier names with
shorter names. In the case of client-side JavaScript, minification is a common process on the Web and is used to
reduce the size of the code being transferred over the network and/or to prevent users from understanding what the
program is actually doing. In addition to obscure names,
variables in this program also lack annotated type information. It can be difficult to understand that this obfuscated
program happens to partition an input string into chunks
of given sizes, storing those chunks into consecutive entries
of an array.
Given the program in Figure 4(a), JSNice automatically
produces the program in Figure 4(e). The output program
has new identifier names and is annotated with predicted
types for the parameters, local variables, and return statement. Overall, it is easier to understand what that program
does when compared to the original. We now provide an
overview of the prediction recipe that performs this transformation. We focus on predicting names (reversing minification), but the process for predicting types is identical.
2. 1. Step 1: Determine known and unknown properties
Given the program in Figure 4(a), we use a simple static
(scope) analysis which determines the set of program ele-
ments for which we would like to infer properties. These
are elements whose properties are unknown in the input
(i.e., are affected by minification). When predicting names,
this set consists of all local variables and function param-
eters of the input program: e, t, n, r, and i. We also deter-
mine the set of elements whose properties are known (not
affected by minification). These include field and method
names (e.g., the field element with name length). Both
kinds of elements are shown in Figure 4(b). The goal is to
predict the unknown properties based on: (i) the obtained
properties of different elements are often related. A useful
analogy is the ability to make joint predictions in image pro-
cessing where the prediction of a pixel label is influenced by
the predictions of neighboring pixels.
1. 2. JSNICE: Name and type inference for JavaScript
As an example of this approach, we built a system which
addresses two important challenges in JavaScript: predicting
(syntactic) identifier names and (semantic) type annotations
of variables. Such predictions have applications in software
engineering (e.g., refactoring to improve code readability),
program analysis (e.g., type inference) and security (e.g.,
deobfuscation). We focused on JavaScript for three reasons.
First, in terms of type inference, recent years have seen
extensions of JavaScript that add type annotations such as
the Google Closure Compiler5 and TypeScript.
7 However,
these extensions rely on traditional type inference, which
does not scale to realistic programs that make use of
dynamic evaluation and complex libraries (e.g., jQuery).
13
Our work predicts likely type annotations for real world programs which can then be provided to the programmer or to
a standard type checker. Second, much of JavaScript code
found on the Web is obfuscated, making it difficult to understand what the program is doing. Our approach recovers
likely identifier names, thereby making much of the code on
the Web readable again. This is enabled by a large and well-annotated corpus of JavaScript programs available in open
source repositories such as GitHub.
Since its release, JSNice has become a widely used system
with users ranging from JavaScript developers to security specialists. In a period of a year, our users deobfuscated over 9 GB
( 87. 7 mn lines of code) of unique (non-duplicate) JavaScript
programs. Figure 3 shows a histogram of the size of these programs, indicating that users often query it with large code
Figure 2. A screenshot of http://jsnice.org/: minified code (left),
deobfuscated version (right).
Figure 3. Histogram of query sizes to http://jsnice.org/ sent by users
in the period May 10, 2015–May 10, 2016.
0
3.0K
6.0K
9.0K
12.0K
15.0K
1
2
5
10
22
46 100 215 4641.0K2.2K4.6K10.0K21.5K46.4K100.0K215.4K464.2K1.0M2.2M4.6M10.0M N u
mb
er
of
que
ri
e
s
Size of the JavaScript programs given by our users (in bytes)