project is written in an imperative procedural, imperative
scripting, or functional language. In the rest of the paper, we
use the terms procedural and scripting to indicate imperative procedural and imperative scripting respectively.
Type Checking indicates static or dynamic typing. In statically typed languages, type checking occurs at compile time,
and variable names are bound to a value and to a type. In
addition, expressions (including variables) are classified by
types that correspond to the values they might take on at run-time. In dynamically typed languages, type checking occurs
at run-time. Hence, in the latter, it is possible to bind a variable name to objects of different types in the same program.
Implicit Type Conversion allows access of an operand of
type T1 as a different type T2, without an explicit conversion.
Such implicit conversion may introduce type-confusion in
some cases, especially when it presents an operand of specific type T1, as an instance of a different type T2. Since not
all implicit type conversions are immediately a problem, we
operationalize our definition by showing examples of the
implicit type confusion that can happen in all the languages
we identified as allowing it. For example, in languages like
a number is permissible (e.g., “ 5” + 2 yields “ 52”). The same
operation yields 7 in Php. Such an operation is not permitted
in languages such as Java and Python as they do not allow
implicit conversion. In C and C++ coercion of data types can
result in unintended results, for example, int x; float y;
y= 3. 5; x=y; is legal C code, and results in different values
for x and y, which, depending on intent, may be a problem
downstream.a In Objective-C the data type id is a generic
object pointer, which can be used with an object of any data
type, regardless of the class.b The flexibility that such a generic
data type provides can lead to implicit type conversion and
also have unintended consequences.c Hence, we classify a
language based on whether its compiler allows or disallows
the implicit type conversion as above; the latter explicitly
detects type confusion and reports it.
Disallowing implicit type conversion could result from
static type inference within a compiler (e.g., with Java),
using a type-inference algorithm such as Hindley10 and
17 or at run-time using a dynamic type checker. In
contrast, a type-confusion can occur silently because it is
either undetected or is unreported. Either way, implicitly
allowing type conversion provides flexibility but may eventually cause errors that are difficult to localize. To abbreviate, we refer to languages allowing implicit type conversion
as implicit and those that disallow it as explicit.
Memory Class indicates whether the language requires
developers to manage memory. We treat Objective-C as
unmanaged, in spite of it following a hybrid model, because
we observe many memory errors in its codebase, as discussed
in RQ4 in Section 3.
Note that we classify and study the languages as they are
colloquially used by developers in real-world software. For
example, TypeScript is intended to be used as a static language, which disallows implicit type conversion. However,
in practice, we notice that developers often (for 50% of the
variables, and across TypeScript-using projects in our
dataset) use the any type, a catch-all union type, and thus, in
practice, TypeScript allows dynamic, implicit type conversion. To minimize the confusion, we exclude TypeScript
from our language classifications and the corresponding
model (see Table 3 and 7).
2. 3. Identifying project domain
We classify the studied projects into different domains based
on their features and function using a mix of automated and
manual techniques. The projects in GitHub come with
project descriptions and README files that describe their
features. We used Latent Dirichlet Allocation (LDA)
3 to analyze
this text. Given a set of documents, LDA identifies a set of topics where each topic is represented as probability of generating different words. For each document, LDA also estimates
the probability of assigning that document to each topic.
We detect 30 distinct domains, that is, topics, and estimate
the probability that each project belonging to each domain.
Since these auto-detected domains include several project-
specific keywords, for example, facebook, it is difficult to
identify the underlying common functions. In order to assign
a meaningful name to each domain, we manually inspect
each of the 30 domains to identify projectname-independent,
domain-identifying keywords. We manually rename all of
the 30 auto-detected domains and find that the majority of
the projects fall under six domains: Application, Database,
CodeAnalyzer, Middleware, Library, and Framework. We also
find that some projects do not fall under any of the above
a Wikipedia’s article on type conversion, https://en.wikipedia.org/wiki/
Type_conversion, has more examples of unintended behavior in C.
classes Categories Languages
C, C++, C#, Objective-C,
Python, Perl, Php, Ruby
Functional Clojure, Erlang, Haskell, Scala
Type checking Static C, C++, C#, Objective-C, Java,
Go, Haskell, Scala
Python, Perl, Php, Ruby,
Disallow C#, Java, Go, Python, Ruby,
Clojure, Erlang, Haskell,
Allow C, C++, Objective-C,
Memory class Managed Others
Unmanaged C, C++, Objective-C
We omit TypeScript from language classification as it allows both explicit
and implicit type conversion.
Table 3. Different types of language classes.
b This Apple developer article describes the usage of “id” http://tinyurl.com/
c Some examples can be found here http://dobegin.com/objc-id-type/ and