The large latency of large batches
raises another performance question:
How does processing a large batch compare with rerunning the fast static connected components algorithm? Figure
4 provides the speed-up over static recomputation. There is a distinct knee
in the curve; batches of 30,000 edge actions provide good speed-up and high
aggregate performance with a latency
of under 0.32 seconds. Also apparent
is that we need not allocate the entire
system to a single analysis kernel.
The peak performance is achieved
with at most 20 threads. We are experimenting to see how the memory
and thread resources affect multiple
analysis kernels’ performance.
figure 4. Speed-up over static connected components algorithm.
Speed−up over static recomputation
RELATED WORK AND
FUTURE DIREC TIONS
Analyzing streaming, graph-structured
data combines, and leverages, work in
many areas from architecture to systems. The STING approach harnesses
current changes in computing architectures. Highly multi-threaded architectures are widely available ranging from
the massive yet general multi-threading of the Cray XMT to more modest
threading on Intel-based systems and
the massive but more coherent threading in general-purpose graphics processing units.
Visualization has long relied on
dataflow systems like Data Explorer,
Khoros, and other tools to process
results as a batched pipeline. Tools
like IBM’s InfoSphere Streams and
Apache’s S4 are carrying this model
into data analysis as well. The current
tools focus on aggregate information
like counts but could be extended to
support graph structure.
Tools for massive data analysis for
graph-structured data, like Google’s
Pregel and Apache’s Hadoop-based
Giraph, work on a bulk-synchronous
model. Kernels work across the entire data store in large, iterative steps.
These systems do not handle streaming
data, but do handle important issues
like fault-tolerance and truly massive
data storage. Other systems like CMU’s
GraphChi are bringing streaming data
into this model.
We see emerging applications using systems combining STING’s focus
on streaming, semantic data with the
Updates per second
104 104.5 105
massive scale and fault tolerance of
these frameworks built as a part of a
larger dataflow system on massively
STING development and porting has
been supported in part through the
Intel Labs Academic Research Office
for the Parallel Algorithms for Non-Numeric Computing Program, the
Center for Adaptive Supercomputing
Software - Multithreaded Architectures at Pacific Northwest National
Laboratory, and the DARPA HPCS program. Work identifying application
areas and needs has been supported
in part by these programs as well
as the National Science Foundation
through grant 1216898.
[ 6] overvie w; visualization to Connect the Dots. Knight
[ 7] Consolidated volume in nySE listed issues, 2010 –
current. 2011. nySE Euronext. http://www.nyxdata.
[ 8] Findings Regarding the Market Events of May 6,
2010; Report of The Staffs of the CF TC and SEC
to the Joint Advisory Committee on Emerging
Regulatory Issues. 2010. U.S. Commodity Futures
Trading Commission. http://www.sec.gov/ne ws/
[ 9] Bader, D. and Riedy, J., et al. S TInGER: Spatio-
Temporal Interaction networks and Graphs (S TInG)
Extensible Representation. 2012. http://www.
[ 10] Chakrabarti, D., Zhan, y., and Faloutsos, C. R-MAT: A
recursive model for graph mining. In the Proceedings
of the 2004 SIAM International Conference on
Data Mining (nashville, Tn, June 13-16). SIAM,
[ 1] Rios, M. Euro 2012 Recap. 2012. http://blog.tw itter.
[ 2] Ball, P., Klingner, J., and lum, K. Crowdsourced data
is not a substitute for real statistics. 2011. http://
[ 3] James, J. How Much Data is Created Every Minute?
2012. http:// www.domo.com/blog/2012/06/how-
[ 4] Georgia Tech team uses Twitter, blogs to monitor
elections in developing nations. Georgia Institute
of Technology. 2011. http:// www.gatech.edu/
[ 5] Frias-Martinez, E. and Frias-Martinez, v. Enhancing
Public Policy Decision Making using large-Scale Cell
Phone Data. 2012. http://www.unglobalpulse.org/
E. Jason Riedy is faculty in the School of Computational
Science and Engineering at Georgia Tech as a Research
Scientist II. He primarily develops algorithms and tools
parallel analysis of dynamic social networks. His other
interests include high-performance and accurate
linear algebra, floating-point arithmetic, and parallel
combinatorial optimization. His Ph. D. in computer science
is from the University of California, Berkeley in 2010 in
combinatorial optimization and targeted high-precision
David A. Bader is a full professor in the School of
Computational Science and Engineering, College of
Computing, at Georgia Institute of Technology, and
Executive Director for high performance computing. He
received his Ph.D. in 1996 from The University of Maryland.
His interests are at the intersection of high-performance
computing and real- world applications, including
computational biology and genomics and massive-scale
data analytics. He is a Fellow of the IEEE and AAAS, a
national Science Foundation CAREER Award recipient, a
co-founder of the Graph500 list for benchmarking “Big
Data” computing platforms. Bader is recognized as a
“Rock Star” of high performance computing by inside HPC
and as HPCwire’s People to Watch in 2012.