VISUALIZING DATA USING T-SNE
where Y
(t)
indicates the solution at iteration t, η indicates the learning rate, and α(t) represents the
momentum at iteration t.
In addition, in the early stages of the optimization, Gaussian noise is added to the map points after
each iteration. Gradually reducing the variance of this noise performs a type of simulated annealing
that helps the optimization to escape from poor local minima in the cost function. If the variance
of the noise changes very slowly at the critical point at which the global structure of the map starts
to form, SNE tends to find maps with a better global organization. Unfortunately, this requires
sensible choices of the initial amount of Gaussian noise and the rate at which it decays. Moreover,
these choices interact with the amount of momentum and the step size that are employed in the
gradient descent. It is therefore common to run the optimization several times on a dataset to find
appropriate values for the parameters
4
. In this respect, SNE is inferior to methods that allow convex
optimization and it would be useful to find an optimization method that gives good results without
requiring the extra computation time and parameter choices introduced by the simulated annealing.
3. t-Distributed Stochastic Neighbor Embedding
Section 2 discussed SNE as it was presented by Hinton and Roweis (2002). Although SNE con-
structs reasonably good visualizations, it is hampered by a cost function that is difficult to optimize
and by a problem we refer to as the “crowding problem”. In this section, we present a new technique
called “t-Distributed Stochastic Neighbor Embedding” or “t-SNE” that aims to alleviate these prob-
lems. The cost function used by t-SNE differs from the one used by SNE in two ways: (1) it uses
a symmetrized version of the SNE cost function with simpler gradients that was briefly introduced
by Cook et al. (2007) and (2) it uses a Student-t distribution rather than a Gaussian to compute the
similarity between two points in the low-dimensional space. t-SNE employs a heavy-tailed distri-
bution in the low-dimensional space to alleviate both the crowding problem and the optimization
problems of SNE.
In this section, we first discuss the symmetric version of SNE (subsection 3.1). Subsequently, we
discuss the crowding problem (subsection 3.2), and the use of heavy-tailed distributions to address
this problem (subsection 3.3). We conclude the section by describing our approach to the optimiza-
tion of the t-SNE cost function (subsection 3.4).
3.1 Symmetric SNE
As an alternative to minimizing the sum of the Kullback-Leibler divergences between the condi-
tional probabilities p
j|i
and q
j|i
, it is also possible to minimize a single Kullback-Leibler divergence
between a joint probability distribution, P , in the high-dimensional space and a joint probability
distribution, Q, in the low-dimensional space:
C = KL(P ||Q) =
X
i
X
j
p
ij
log
p
ij
q
ij
. (8)
where again, we set p
ii
and q
ii
to zero. We refer to this type of SNE as symmetric SNE, because it
has the property that p
ij
= p
ji
and q
ij
= q
ji
for ∀i, j. In symmetric SNE, the pairwise similarities
4. Picking the best map after several runs as a visualization of the data is not nearly as problematic as picking the model
that does best on a test set during supervised learning. In visualization, the aim is to see the structure in the training
data, not to generalize to held out test data.
5