( 2) Supervised classification. We use the annotated bug
fix logs from the previous step as training data for supervised
learning techniques to classify the remainder of the bug fix
messages by treating them as test data. We first convert each
bug fix message to a bag-of- words. We then remove words
that appear only once among all of the bug fix messages. This
reduces project specific keywords. We also stem the bag-of-words using standard natural language processing techniques. Finally, we use Support Vector Machine to classify the
To evaluate the accuracy of the bug classifier, we manually annotated 180 randomly chosen bug fixes, equally
distributed across all of the categories. We then compare
the result of the automatic classifier with the manually
annotated data set. The performance of this process was
acceptable with precision ranging from a low of 70% for
performance bugs to a high of 100% for concurrency bugs
with an average of 84%. Recall ranged from 69% to 91% with
an average of 84%.
The result of our bug classification is shown in Table 5.
Most of the defect causes are related to generic programming errors. This is not surprising as this category involves
a wide variety of programming errors such as type errors,
typos, compilation error, etc. Our technique could not classify 1.04% of the bug fix messages in any Cause or Impact
category; we classify these as Unknown.
2. 5. Statistical methods
We model the number of defective commits against other
factors related to software projects using regression. All
models use negative binomial regression (NBR) to model the
counts of project attributes such as the number of commits.
NBR is a type of generalized linear model used to model nonnegative integer responses.
In our models we control for several language per-project
dependent factors that are likely to influence the outcome.
Consequently, each (language, project) pair is a row in our
regression and is viewed as a sample from the population
of open source projects. We log-transform dependent count
variables as it stabilizes the variance and usually improves
the model fit.
4 We verify this by comparing transformed with
non transformed data using the AIC and Vuong’s test for
domains and so we assign them to a catchall domain labeled
as Other. This classification of projects into domains was
subsequently checked and confirmed by another member
of our research group. Table 4 summarizes the identified
domains resulting from this process.
2. 4. Categorizing bugs
While fixing software bugs, developers often leave important information in the commit logs about the nature of
the bugs; for example, why the bugs arise and how to fix the
bugs. We exploit such information to categorize the bugs,
similar to Tan et al.
First, we categorize the bugs based on their Cause and
Impact. Causes are further classified into disjoint sub-
categories of errors: Algorithmic, Concurrency, Memory,
generic Programming, and Unknown. The bug Impact is
also classified into four disjoint subcategories: Security,
Performance, Failure, and Other unknown categories. Thus,
each bug-fix commit also has an induced Cause and an
Impact type. Table 5 shows the description of each bug cat-
egory. This classification is performed in two phases:
( 1) Keyword search. We randomly choose 10% of the bug-
fix messages and use a keyword based search technique to
automatically categorize them as potential bug types. We
use this annotation, separately, for both Cause and Impact
types. We chose a restrictive set of keywords and phrases, as
shown in Table 5. Such a restrictive set of keywords and
phrases helps reduce false positives.
(APP) Application End user programs bitcoin, macvim 120
(DB) Database SQL and NoSQL mysql, mongodb 43
(CA) CodeAnalyzer Compiler, parser, etc. ruby, php-src 88
(MW) Middleware OS, VMs, etc. linux, memcached 48
(LIB) Library APIs, libraries, etc. androidApis,
(FW) Framework SDKs, plugins ios sdk,
(OTH) Other – Arduino,
Table 4. Characteristics of domains.
Table 5. Categories of bugs and their distribution in the whole dataset.
Bug type Bug description Search keywords/phrases Count count
Algorithm (Algo) Algorithmic or logical errors Algorithm 606 0.11
Concurrancy (Conc) Multithreading/processing issues Deadlock, race condition, synchronization error 11, 111 1. 99
Memory (Mem) Incorrect memory handling Memory leak, null pointer, buffer overflow, heap
overflow, null pointer, dangling pointer, double free,
30,437 5. 44
Programming (Prog) Generic programming errors Exception handling, error handling, type error, typo,
compilation error, copy-paste error, refactoring,
missing switch case, faulty initialization, default value
495,013 88. 53
t Security (Sec) Runs, but can be exploited Buffer overflow, security, password, oauth, ssl 11,235 2.01
Performance (Perf) Runs, but with delayed response Optimization problem, performance 8651 1. 55
Failure (Fail) Crash or hang Reboot, crash, hang, restart 21,079 3. 77
Unknown (Unkn) Not part of the above categories 5792 1.04