has 3,505 type annotations for function parameters in 396
programs. After removing these annotations and reconstructing them with JSNice, the number of annotations that
are not increased to 4, 114 for the same programs. The reason JSNice produces more types than originally present
despite having 66.3% recall is that not all functions in the
original programs had manually provided types.
Interestingly, despite annotating more functions than
the original code, the output of JSNice has fewer type
errors. We summarize these findings in Figure 7. For each
of the 396 programs, we ran the typechecking pass of
Google’s Closure Compiler to discover type errors. Among
others, this pass checks for incompatible types, calling
into a non-function, conflicting, missing types, and non-existent properties on objects. For our evaluation, we kept
all checks except the non-existent property check, which
fails on almost all (even valid) programs, because it depends
When we ran typechecking on the input programs,
we found the majority (289) to have typechecking errors.
While surprising, this can be explained by the fact that
annotations. Among others, we found the original code
to have misspelled type names. Most typecheck errors
occur due to missing or conflicting types. In a number
of cases, the types provided were interesting for documentation, but were semantically wrong—for example, a parameter is a string that denotes function
name, but the manual annotation designates its type to
be Function. In contrast, the types reconstructed by
JSNice make the majority (227) of programs typecheck.
In 141 of programs that originally did not typecheck,
JSNice was able to infer correct types. On the other
hand, JSNice introduced type errors in 21 programs.
We investigated some of these errors and found that
not all of them were due to wrong types—in several
cases the predicted types were rejected due to inability
of the type system to precisely express the desired program properties without also manually providing type
5. 4. Model sizes
Our models contain 7,627,484 features for names and
70,052 features for types. Each feature is stored as a triple,
along with its weight. As a result we need only 20 bytes per
second column of Table 1. Overall, our best system exactly
recovers 63.4% of identifier names. The systems trained on
less data have significantly lower precision showing the
importance of training data size.
Not using structured prediction also drops the accuracy
significantly and has about the same effect as an order of
magnitude less data. Finally, not changing any identifier
names produces accuracy of 25.3% — this is because minifying the code may not rename some variables (e.g., global
variables) in order to guarantee semantic preserving transformations and occasionally one-letter local variable names
stay the same (e.g., induction variable of a loop).
Type annotation predictions. Out of the 2,710 test programs,
396 have type annotations for functions in a JSDoc. For these
396, we took the minified version with no type annotations and
tried to rediscover all types in the function signatures. We first
ran the Closure compiler type inference, which produces no
types for the function parameters. Then, we ran and evaluated
JSNice on inferring these function parameter types.
JSNice does not always produce a type for each function
parameter. For example, if a function has an empty body, or a
parameter is not used, we often cannot relate the parameter
to any known program properties and as a result, no prediction can be made and the unknown type (?) is returned. To
take this effect into account, we report both recall and precision. Recall is the percentage of function parameters in the
evaluation for which JSNice made a prediction other than ?.
Precision refers to the percentage of cases — among the ones
for which JSNice made a prediction— where it was exactly
equal to the manually provided JSDoc annotation of the test
programs. We note that the manual annotations are not
always precise, and as a result 100% precision is not necessarily a desired outcome.
We present our evaluation results for types in the last two
columns of Table 1. Since we evaluate on production
with complex relationships, the recall for predicting program types is only 66.9% for our best system. However, we
note that none of the types we infer are inferred by state-of-the-art forward type analysis (e.g., Facebook Flow3).
Since the total number of commonly used types is not as
high as the number of names, the amount of training data
has less impact on the system precision and recall. To further increase the precision and recall of type prediction, we
hypothesize that adding more (semantic) relationships
between program elements would be of higher importance
than adding more training data. Dropping structure
increases the precision of the predicted types slightly, but at
the cost of a significantly reduced recall. The reason is that
some types are related to known properties only transitively
via other predicted types—relationships that non-struc-tured approaches cannot capture. On the other end of the
Such a system has 100% recall, but its precision is only 37.8%.
5. 3. Usefulness of predicted types
To see if the predicted types are useful, we compared them
to the original ones. First, we note that our evaluation data
289 with type error
169 with type error
Figure 7. Evaluation results for the number of type-checking
programs with manually provided types and with predicted types.