o
6
o
7
o
8
o
9
o
10
o
11
o
12
o
13
o
14
o
15
o
16
o
17
o
18
ǫ
o
1
o
2
o
3
o
4
o
5
(1 + ρ)ǫ
o
6
o
7
o
8
o
9
o
10
o
11
o
12
o
14
o
15
o
16
o
17
o
1
o
2
o
3
o
4
o
5
o
6
o
7
o
8
o
9
o
10
o
11
o
12
o
14
o
15
o
16
o
17
o
1
o
2
o
3
o
4
o
5
(a) Dataset (b) Core graph (c) One possible ρ-approximate core graph
Figure 2: Illustration of DBSCAN and ρ-approximate DBSCAN (ρ = 0.5, MinPts = 3)
MinPts
, which can be regarded as a constant. Next, we review how
the clusters are formed using graph terminology.
Given a point
p ∈ P
, we use
B(p, r)
to represent the ball that is
centered at
p
, and has radius
r
. The point is said to be a core point
if
B(p, ǫ)
covers at least
MinPts
points of
P
(including
p
itself);
otherwise, it is a non-core point. To illustrate, consider the dataset
of 18 points in Figure 2a, where
ǫ
is the radius of the inner solid
circle, and
MinPts = 3
. The core points have been colored black,
while the non-core points colored white. The dashed circle can be
ignored for the time being.
DBSCAN clusters are defined in two steps. The first one focuses
exclusively on the core points, and groups them into preliminary
clusters. The second step determines how the non-core points should
be assigned to the clusters. Next, we explain the two steps in detail.
Step 1: Clustering Core Points.
It will be convenient to imagine an
undirected core graph
G
on
P
—this graph is conceptual and need
not be materialized. Specifically, each vertex of
G
corresponds
to a distinct core point in
P
. There is an edge between two core
points (a.k.a. vertices)
p
1
, p
2
if and only if
dist(p
1
, p
2
) ≤ ǫ
, where
dist(·, ·)
represents the Euclidean distance between two points. Fig-
ure 2b shows the core graph for the dataset of Figure 2a.
Each connected component (CC) of
G
constitutes a preliminary
cluster. In Figure 2b, there are 3 CCs (a.k.a. preliminary clusters).
Note that every core point belongs to exactly one preliminary cluster.
Step 2: Non-Core Assignment.
This step augments the preliminary
clusters with non-core points. For each non-core point
p
, DBSCAN
looks at every core point
p
core
∈ B(p, ǫ)
, and assigns
p
to the (only)
preliminary cluster containing
p
core
. Note that, in this manner,
p
may be assigned to zero, one, or more than one preliminary cluster.
After all the non-core points have been assigned, the preliminary
clusters become final clusters.
It should be clear from the above that the DBSCAN clusters are
uniquely defined by the parameters
ǫ
and
MinPts
, but they are
not necessarily disjoint. A non-core point may belong to multiple
clusters, while a core point must exist only in a single cluster. It is
possible that a non-core point is not in any cluster; such a point is
called noise.
In Figure 2a, there are two non-core points
o
13
and
o
18
. Since
B(o
13
, ǫ)
covers
o
14
,
o
13
is assigned to the preliminary cluster of
o
14
.
B(o
18
, ǫ)
, however, covers no core points, indicating that
o
18
is
noise. The final DBSCAN clusters are {o
1
, o
2
, ..., o
5
}, {o
6
, o
7
, ...,
o
12
}, {o
13
, o
14
, ..., o
17
}.
Remark.
DBSCAN can also be defined under the notion of “density-
reachable”; see [9]. The above graph-based definition is equiv-
alent, perhaps more intuitive, and allows a simple extension to
ρ-approximate DBSCAN, as we will see later.
Hardness of DBSCAN and USEC.
It is easy to see that the DB-
SCAN clusters on a set
P
of
n
points can be computed in
O ( n
2
)
time, noticing that the core graph
G
has
O ( n
2
)
edges. However,
clever algorithms should produce the clusters without generating all
the edges, and thus, avoid the quadratic trap. Indeed, when
d = 2
,
the clusters can be computed in only O(n log n) time [11].
It would be highly desired to find an algorithm of
˜
O ( n )
time for
d ≥ 3
, but recently Gan and Tao [10] have essentially dispelled
the possibility. They proved an
O ( n )
-time reduction from the unit-
spherical emptiness checking (USEC) problem to DBSCAN. In
other words, any
T (n)
-time DBSCAN algorithm implies that USEC
problem can be solved with O(T (n)) time.
In USEC, we are given a set
S
red
of red points and a set
S
blue
of blue points in
R
d
. All the points have distinct coordinates on
every dimension. The objective is to determine whether there exist a
red point
p
red
and a blue point
p
blue
such that
dist(p
red
, p
blue
) ≤ 1
(the distance threshold 1 can be replaced with any positive value by
scaling). The problem has a lower bound of
Ω(n
4/3
)
for
d ≥ 5
in a
broad class of algorithms [6,7]. For
d = 3
and
4
, beating the bound
has been a grand open problem in theoretical computer science, and
is widely believed [6] to be impossible. By the reduction of [10],
no DBSCAN algorithm can have running time
o(n
4/3
)
in
d ≥ 5
;
in
d = 3
and 4, this is also true unless unlikely ground-breaking
improvements could be made on 3D USEC.
Approximation and the Sandwich Guarantee.
Gan and Tao [10]
developed
ρ
-approximate DBSCAN, which returns almost the same
clusters as exact DBSCAN by offering a strong sandwich guarantee
that will be introduced shortly. In contrast to the high time complex-
ity of the latter, the approximate version takes only
O ( n )
expected
time to compute for any constant ρ > 0.
Besides the parameters
ǫ
and
MinPts
inherited from DBSCAN,
the approximate version accepts a third parameter
ρ
, which is a small
positive constant less than 1, and controls the clustering precision.
Its clusters can also be defined in the same two steps as in exact
DBSCAN, as explained below.
Step 1: Clustering Core Points.
It will also be convenient to follow
a graph-based approach. Let us define an undirected
ρ
-approximate
core graph
G
ρ
on the dataset
P
—again, this graph is conceptual
and need not be materialized. Each vertex of
G
ρ
corresponds to a
distinct core point in
P
. Given two core points
p
1
, p
2
, whether or
not G
ρ
has an edge between their vertices is determined as:
• The edge definitely exists if dist(p
1
, p
2
) ≤ ǫ.