The sign and magnitude of the estimated coefficients in
the above model relates the predictors to the outcome. The
first four variables are control variables and we are not interested in their impact on the outcome other than to say that
they are all positive and significant. The language variables
are indicator variables, viz. factor variables, for each project. The coefficient compares each language to the grand
weighted mean of all languages in all projects. The language
coefficients can be broadly grouped into three general categories. The first category is those for which the coefficient
is statistically insignificant and the modeling procedure
could not distinguish the coefficient from zero. These languages may behave similar to the average or they may have
wide variance. The remaining coefficients are significant
and either positive or negative. For those with positive coefficients we can expect that the language is associated with
a greater number of defect fixes. These languages include
C, C++, Objective-C, Php, and Python. The languages
Clojure, Haskell, Ruby, and Scala, all have negative
coefficients implying that these languages are less likely
than average to result in defect fixing commits.
One should take care not to overestimate the impact of
language on defects. While the observed relationships are
statistically significant, the effects are quite small. Analysis
of deviance reveals that language accounts for less than 1%
of the total explained deviance.
To check that excessive multicollinearity is not an issue, we
compute the variance inflation factor of each dependent variable in all of the models with a conservative maximum value
4 We check for and remove high leverage points through
visual examination of the residuals versus leverage plot for
each model, looking for both separation and large values of
We employ effects, or contrast, coding in our study to facilitate interpretation of the language coefficients.
effects codes allow us to compare each language to the average effect across all languages while compensating for the
unevenness of language usage across projects.
23 To test for
the relationship between two factor variables we use a Chi-square test of independence.
14 After confirming a dependence
we use Cramer’s V, an r × c equivalent of the phi coefficient for
nominal data, to establish an effect size.
We begin with a straightforward question that directly
addresses the core of what some fervently believe must be
RQ1. Are some languages more defect-prone than others?
We use a regression model to compare the impact of each
language on the number of defects with the average impact
of all languages, against defect fixing commits (see Table 6).
We include some variables as controls for factors that
will clearly influence the response. Project age is included as
older projects will generally have a greater number of defect
fixes. Trivially, the number of commits to a project will also
impact the response. Additionally, the number of developers who touch a project and the raw size of the project are
both expected to grow with project activity.
Table 6. Some languages induce fewer defects than other languages.
Defective commits model Coef. (Std. Err.)
(Intercept) − 2.04 (0.11)***
Log age 0.06 (0.02)***
Log size 0.04 (0.01)***
Log devs 0.06 (0.01)***
Log commits 0.96 (0.01)***
C 0.11 (0.04)**
C++ 0.18 (0.04)***
C# −0.02 (0.05)
Objective-C 0.15 (0.05)**
Go −0.11 (0.06)
Java −0.06 (0.04)
CoffeeScript 0.06 (0.05)
TypeScript 0.15 (0.10)
Ruby −0.13 (0.05)**
Php 0.10 (0.05)*
Python 0.08 (0.04)*
Perl −0.12 (0.08)
Clojure −0.30 (0.05)***
Erlang −0.03 (0.05)
Haskell −0.26 (0.06)***
Scala −0.24 (0.05)***
Response is the number of defective commits. Languages are coded with weighted
effects coding. AIC=10432, Deviance=1156, Num. obs.=1076.
***p < 0.001, p < 0.01, p < 0.05
Df Deviance Resid. Df Resid. dev Pr (>Chi)
NULL 1075 25,176.25
Log commits 1 4256.89 1071 1286.74 0
Logage 1 8011.52 1074 17,164.73 0
Log size 1 10,082.78 1073 7081.95 0
Log devs 1 1538.32 1072 5543.63 0
Language 16 130.78 1055 1155.96 0
We can read the model coefficients as the expected change
in the log of the response for a one unit change in the predictor with all other predictors held constant; that is, for a coefficient βi, a one unit change in βi yields an expected change
in the response of eβi. For the factor variables, this expected
change is compared to the average across all languages. Thus,
if, for some number of commits, a particular project developed in an average language had four defective commits, then
the choice to use C++ would mean that we should expect one
additional defective commit since e0.18 × 4 = 4. 79. For the same
project, choosing Haskell would mean that we should expect
about one fewer defective commit as e−0.26 × 4 = 3.08. The accuracy of this prediction depends on all other factors remaining
the same, a challenging proposition for all but the most trivial
of projects. All observational studies face similar limitations;
we address this concern in more detail in Section 5.
Result 1: Some languages have a greater association with
defects than other languages, although the effect is small.
In the remainder of this paper we expand on this basic
result by considering how different categories of application, defect, and language, lead to further insight into the
relationship between languages and defect proneness.
Software bugs usually fall under two broad categories: ( 1)
Domain Specific bug: specific to project function and do not