selected by the algorithm, with the objective that they are as dissimilar from
each other as possible. The distance
from each feature request to each centroid is computed, and each feature
request is placed in the cluster associated with its closest centroid. Once all
feature requests are assigned to clusters, the centroids are repositioned so
as to increase their average proximity to
all feature requests in the cluster. This
is followed by a series of incremental
optimizations in which an attempt is
made to move a randomly selected feature request to the cluster for which it
maximizes the overall cohesion gain.
The process continues until no further
optimization is possible. This simple
post-processing step has been shown
to significantly improve the quality of
clustering results. 6
Clustering results are strongly influenced by initial centroid selection,
meaning poor selection can lead to low-quality clusterings. However, this problem can be alleviated through a consensus technique for performing the initial
clustering, an approach that generates
multiple individual clusterings, then
uses a voting mechanism to create a final result. 8 Though it has a much longer
running time than SPKMeans, consensus clustering has been shown to consistently deliver clusterings that are of
higher-than-average quality compared
to the standalone SPK clusterings. 6 In
the AFM system, consensus clustering
is used only for the initial clustering in
order to create the best possible set of
Following the arrival of each new
feature request, the algorithm recomputes the ideal granularity to determine if a new cluster should be added.
To add a new cluster in a way that preserves the stability of existing clusters
and minimizes clustering time, the
AFM approach identifies the least-cohesive cluster, then bisects it using
SPK, with K = 2. Feature requests from
neighboring clusters are then reevaluated to determine if they exhibit closer
proximity to one of the two new centroids than they do to their own currently assigned centroids. If this closer
proximity occurs they are reassigned to
the relevant cluster. To ensure continued cluster quality, the entire data set
is re-clustered periodically following
the arrival of a fixed number of new
feature requests. Re-clustering is performed through a modified SPKMeans
algorithm (we call Stable SPKMeans)
designed to minimize the movement
of feature requests between clusters
through reuse of the current set of centroids as seeds for the new clustering.
Cluster quality is also improved
through user feedback specifying
whether a pair of feature requests belong together. For example, users not
happy with the quality of a cluster can
specify that a given feature request does
not fit the cluster. They might also provide additional tags to help place the
feature request in a more appropriate
cluster. These user constraints, along
with the tag information, are then incorporated into the SPK algorithm of
future clusterings. This reassignment
maximizes the quality of the individual
clusters and optimizes conformance
to user constraints. Our prior work in
this area demonstrated significant improvement in cluster quality when constraints are considered. 13
We conducted a series of experiments
designed to evaluate the AFM’s ability to quickly deliver cohesive, distinct,
and stable clusters. They utilized three
data sets: the first was the SugarCRM
data set discussed earlier; the second
included 4,205 feature requests we
mined from Second Life, an Internet-based virtual world video game in
which stakeholders are represented by
interacting avatars; and the third described the features of an online Am-azon-like portal designed specifically
for students. In spring 2008 we asked
36 master’s-level students enrolled in
two different advanced software-engi-neering classes at DePaul University to
consider the needs of a typical student,
textbook reseller, textbook author,
textbook publisher, instructor, portal
administrator, architect, project manager, and developer and create relevant
feature requests. The result was 366
To evaluate cluster quality, we constructed an “ideal” clustering for the
SugarCRM data by reviewing and modifying the natural discussion threads
created by SugarCRM users. Modifications included merging closely related
singleton threads, decomposing large
megathreads into smaller more cohesive ones, and reassigning misfits to
new clusters. The answer set enabled
us to compare the quality of the generated clusters using a metric called Normalized Mutual Information (NMI) to
measure the extent to which the knowledge of one cluster reduces uncertainty
of other clusters. 9 On a scale of 0 (no
figure 4. A partial cluster with security-related requirements generated after gathering
1,000 constraints from the student data set.
( 1) the system must protect stored confidential information.
( 2) the system must encrypt purchase/transaction information.
is stored and used.
( 4) transmission of personal information must be encrypted.
( 5) transmission of financial transactions must be encrypted.
( 6) the system must use both encrypt and decrypt in some fields.
( 7) the system must allow users to view their previous transactions.
( 8) Databases must use the tripleDes encryption standard for database security.
aes is still new and has had compatibility issues with certain types of databases,
including sQl server express edition.
( 9) the site must ensure that payment information is confidential and credit card
transactions are encrypted to prevent hackers from retrieving information.
( 10) Because the system will be used to buy books, we must focus on security and consider
transaction control in the architecture used to build it.
( 11) correct use of cryptography techniques must be applied in the amazon portal system
to protect student information from outsiders and staff who might potentially acquire
the information if left unprotected.