clustering times that, with typical parameters of 50–100 requests, could take
up to two minutes. As depicted in this
example, the SugarCRM data set took
a total of 57. 83 seconds to complete all
29 clusterings at increments of 50 feature requests. Individual clusterings are
notably fast; for example, the complete
Second Life data set, consisting of 4,205
requests, was clustered in 75. 18 seconds
using standard SPKMeans and in only
1.02 seconds using our stable approach.
The Stable SPKMeans algorithm significantly improves the performance of
the AFM, mainly because it takes less
time to converge on a solution when
quality seeds are passed forward from
the previous clustering. Smaller increments require more clusterings, so the
overall clustering time increases as the
increment size decreases. However,
our experiments found that increasing
the increment size to 25 or even 50 feature requests has negligible effect on
the quality and stability of the clusters.
data-mining techniques to help manage discussion threads in open discussion forums. Our ongoing work aims to
improve techniques for incorporating
user feedback into the clustering process so clusters that appear ad hoc to
users or contain multiple themes can
These findings are applicable across
a range of applications, including those
designed to gather comments from a
product’s user base, support activities
(such as event planning), and capture
requirements in large projects when
stakeholders are dispersed geographically. Our ongoing work focuses on the
use of forums to support the gathering
and prioritizing of requirements where
automated forum managers improve
the allocation of feature requests to
threads and use recommender systems
to help include stakeholders in relevant
discussion groups. 3 They also improve
the precision of forum search and enhance browsing capabilities by predicting and displaying stakeholders’ interest in a given discussion thread.
From the user’s perspective, AFM facilitates the process of entering feature
requests. Enhanced search features
help users decide where to place new
feature requests more accurately. Underlying data-mining functions then
test the validity of the choice and (when
placement is deemed incorrect) recommend moving the feature request
to another existing discussion group or
sometimes to an entirely new thread.
All techniques described here are
being implemented in the prototype
AFM tool we are developing to test and
evaluate the AFM as an integral component of large-scale, distributed-re-quirements processes.
This work was partially funded by National Science Foundation grant CCR-
0447594, including a Research Experiences for Undergraduates summer
supplement to support the work of
Horatiu Dumitru. We would also like to
acknowledge Brenton Bade, Phik Shan
Foo, and Adam Czauderna for their
work developing the prototype.
We have identified some of the problems experienced in organizing discussion threads in open forums. The
survey we conducted in summer 2008
of several open source forums suggests that expecting users to manually create and manage threads may
not be the most effective approach. In
contrast, we described an automated
technique involving our own AFM for
creating stable, high-quality clusters
to anchor related discussion groups.
Though no automated technique always delivers clusters that are cohesive
and distinct from other clusters, our
reported experiments and case studies
demonstrate the advantages of using
table 2. Performance measured by total time spent clustering (in seconds).
1. Basu, c., hirsh, h., and cohen, W. recommendation
as classification: using social and content-based
information in recommendation. in Proceedings of
the 15th National Conference on Artificial Intelligence
(Madison, Wi, July 26–30). Mit Press, cambridge, Ma,
2. can, f. and ozkarahan, e.a. concepts and effectiveness
of the cover-coefficient-based clustering methodology
for text databases. ACM Transactions on Database
Systems 15, 4 (Dec. 1990), 483–517.
3. castro-herrera, c., Duan, c., cleland-huang, J., and
Mobasher, B. a recommender system for requirements
elicitation in large-scale software projects. in
Proceedings of the 2009 ACM Symposium on Applied
Computing (honolulu, hi, Mar. 9–12). acM Press, new
york, 2008, 1419–1426.
4. Davis, a., Dieste, o., hickey, a., Juristo, n., and Moreno,
a. effectiveness of requirements elicitation techniques.
in Proceedings of the 14th IEEE International
Requirements Engineering Conference (Minneapolis, Mn,
sept.). ieee computer society, 2006, 179–188.
5. Dhillon, i.s. and Modha, D.s. concept decompositions for
large sparse text data using clustering. Machine Learning
42, 1–2 (Jan. 2001), 143–175.
6. Duan, c., cleland-huang, J., and Mobasher, B. a
consensus-based approach to constrained clustering of
software requirements. in Proceedings of the 17th ACM
International Conference on Information and Knowledge
Management (napa, ca, oct. 26–30). acM Press, new
york, 2008, 1073–1082.
7. frakes, W.B. and Baeza-yates, r. Information Retrieval:
Data Structures and Algorithms. Prentice-hall,
englewood cliffs, nJ, 1992.
8. fred, a.l. and Jain, a.K. combining multiple clusterings
using evidence accumulation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 27, 6 (June
9. Second Life virtual 3D world; http://secondlife.com,
feature requests downloaded from the Second Life
issue tracker https://jira.secondlife.com/secure/
10. sourceforge. repository of open source code and
applications; feature requests for open Bravo, ZiMBra,
PhPMyadmin, and Mono downloaded from sourceforge
11. sugar crM. commercial open source customer
relationship management software; http://www.
sugarcrm.com/crm/; feature requests mined from
feature requests at http://www.sugarcrm.com/forums/.
12. Wagstaff, K., cardie, c., rogers, s., and schrödl, s.
constrained K-means clustering with background
knowledge. in Proceedings of the 18th International
Conference on Machine Learning (June 28–July 1).
Morgan Kaufman Publishers, inc., san francisco, 2001,
1 10 25
time to cluster entire set
of feature requests one time
Student (366 feature requests)
Stable SPK Means 7. 49
Standard SPK Means 101. 82
Sugar ( 1,000 feature requests)
Stable SPK Means 84. 54
Standard SPK Means 2,374.31
Second Life ( 4,205 feature requests)
Stable SPK Means 1,880.69 268.15
Standard SPK Means 11,409.57 12,748.63
Jane Cleland-huang ( email@example.com) is an
associate professor in the school of computing at DePaul
university, chicago, il.
horatiu Dumitru ( firstname.lastname@example.org) is an
undergraduate student studying computer science at the
university of chicago, chicago, il.
Chuan Duan ( email@example.com) is a post-doctoral
researcher in the school of computing at DePaul
university, chicago, il.
Carlos Castro-herrera ( firstname.lastname@example.org) is
a Ph. D. student in the school of computing at DePaul
university, chicago, il.