Check out the live web app at https://ecowley-dbscan.herokuapp.com/
Credit to this page provided by scikit-learn for instruction on how to generate blobs.
Credit to the excellent wikipedia page for its pseudocode on the dbscan algorithm.
Introduction
The goal of this project was to implement the classic dbscan clustering algorithm. This was for educational purposes; the aim was to understand how it works, not to build something that can compete with the highly optimized sklearn.cluster.dbscan
.
Vocabulary
eps
is the maximum distance apart two points may be in order to be considered neighbors.minpts
is the minimum number of neighbors that a point must have in order for the group of points to be considered a cluster.
How it works
- Select a point that does not already have a label.
- Query for neighbors that are within the
eps distance
of the point - If at least
minpts
are found, add a new label on the first point and put the neighbors in aset
calledneighbors
. Else, label the point noise and go back to step 1. - For every point in
neighbors
, add the same label, then query for neighbors and add those points to theset
. Continue until all points in the set are handled. - Repeat steps 1-4 until all points are labeled.
Time Complexity
DBSCAN is O(n^2)
because it must find the distance between every two points in the dataset.
My implementation takes 4-7 seconds to run on a dataset of 700 points. sklearn's implementation takes in the milliseconds to run. This is due to optimizations that I was unable to look into in the short amount of time I had for the project.