visualizations of possible splits into
clusters without the need to a priori
define the number of desired clusters.
[ 1] S. A. Nene, S. K. Nayar, and H. Murase. Columbia
Object Image Library (COIL- 20). Technical Report
CUCS-005-96, February 1996.
[ 2] L. J. P. van der Maaten and G. E. Hinton. Visualizing
high-dimensional data using t-SNE. Journal of
Machine Learning Research 9 (2008): 2579-2605.
[ 3] t-SNE Homepage. https://lvdmaaten.github.io/tsne/
[ 4] Visualizing MNIS T: An Exploration of Dimensionality
[ 5] How to Use t-SNE Effectively. http://distill.
[ 6] t-SNE JS. https://github.com/karpathy/tsnejs
Tejas Khot is a research intern at Virginia Tech
working at the intersection of deep learning and
computer vision. He is interested in understanding how
intelligence emerges from raw data and its multi-modal
associations with memory, perception and action.
Copyright held by Owner(s)/Author(s)
This prohibits use on real-world large
datasets for which one should instead
use the Barnes-Hut implementation,
which is O(n log(n))). This has been
successfully used for datasets of
millions of examples. In our case, we
have 1,024 dimensions and thus we
first run the principal component
analysis (PCA), another dimensionality
reduction algorithm, to reduce the
number of dimensions before applying
the t-SNE transformation. Figure 2
shows the t-SNE scatter plot for the
COIL- 20 dataset. We can see t-SNE
does an impressive job finding clusters
and subclusters in the data, which are
color coded as their class label. The best
aspect about t-SNE representation is
it preserves local structure present at
high dimensions, meaning neighboring
points also appear close in the low-dimensional representation. Clusters of
similar classes are obvious and intuitive.
We suggest trying to run the code with
larger datasets of your choice, and
fine tuning the parameters to obtain
interesting structural insights.
Generating convincing t-SNE plots
may require testing several values
of perplexity between five and 50
over a large number of iterations
( 1,000 to 5,000) depending on the
complexity of dataset. Without
complete understanding, t-SNE plots
can be mysterious or misleading. For
understanding the intricacies and
tradeoffs involved, we recommend you
go online to read a few explanations
[ 3, 4, 5] where many more example
visualizations are presented. If you
have some data and you can measure
their pairwise differences, you can
produce elegant, browser-based 2-D
and 3-D t-SNE visualizations using
the t-SNE JS library [ 6]. Additionally,
you can represent any dataset
as a 2-D array and feed it to the
script for generating scatter plots.
Implementations of t-SNE are available
in multiple languages and for all
platforms on the t-SNE homepage [ 3].
Although we have limited our analysis
here only to the simple case of a
relatively small dataset, in practice
one can use the t-SNE technique
for visualizing complex structure in
different types of data, such as word
embeddings, spatial gene expression
organization, S&P 500, and much more.
Altogether, our analysis
demonstrates modern methods for
dimensionality reduction can find
high-quality representations of the
data in lower dimensions, enabling
Figure 1. Example images from
the Columbia Object Image Library
(COIL- 20) dataset. Courtesy of the
Computer Vision Laboratory at
Figure 2. Clusters and subclusters generated by the t-SNE algorithm.
Different colors correspond to different clusters. We can see some classes
are far apart and can be easily separated while others, like those at
the bottom left, are partially overlapping and are more difficult to separate.