← Back to homepage

Performing OPTICS clustering with Python and Scikit-learn

December 15, 2020 by Chris

Unsupervised Machine Learning problems involve clustering, adding samples into groups based on some measure of similarity because no labeled training data is available. There are many algorithms for clustering available today. OPTICS, or Ordering points to identify the clustering structure, is one of these algorithms. It is very similar to DBSCAN, which we already covered in another article. In this article, we'll be looking at how to use OPTICS for clustering with Python.

It is structured as follows. Firstly, in order to provide you with the necessary context, we will briefly look at clustering. We will see what it is and how it works generally speaking. Then, we'll move on to the conceptual details of OPTICS. We will take a look at its components, the algorithm and its dendogram output called a reachability plot, and how to generate clusters from the diagram.

Once we know the ins and outs of the components and the algorithm, we move forward to a practical implementation using OPTICS in Scikit-learn's sklearn.cluster module. We will see how we can generate a dataset for which we can generate clusters, and will apply OPTICS to generate them. Using this step-by-step example, you will see how you can build an OPTICS based clustering model with Python.

In other words, after reading this article, you'll both know how OPTICS works and have the skills to apply it to your own Machine Learning problem.

Let's take a look! :)

What is clustering?

Before we start looking at how OPTICS works, it is worthwhile to consider clustering in general first. Because, well, what is clustering?

Let's take a look at a definition.

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

Wikipedia (2004)

In other words, suppose that we have a dataset like this one:

Intuitively, we can already see that there are two groups of data: one towards the bottom left part of the plot, another towards the upper right part of the plot.

But the machine doesn't know this yet. Rather, it just has an array with samples and their corresponding \(X[0]\) and \(X[1]\) positions, allowing us to generate the plot.

Clustering algorithms are designed to select points which look like each other (and hence have high similarity) and assign them to the same group. In other words, if such an algorithm is deployed for the dataset visualized above, the desired end goal is that it finds the samples in the left corner so similar with respect to the ones in the right corner that it assigns group 0 to the bottom left ones, and 1 to the upper right ones.

Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision.

Wikipedia (2003)

In doing so, they have no information about the actual groups; they just have the positions. This is why clustering algorithms are called unsupervised: no pre-existing labels are there, and yet they are capable of finding patterns allowing us to group the samples. Really nice!

OPTICS is such a clustering algorithm. Now that we know about clustering in general, let's take a look at how it works :)

Introducing OPTICS: a relative of DBSCAN

Ordering points to identify the clustering structure, or OPTICS, is an algorithm for density based clustering. It's quite an old algorithm already, as it was presented in 1999. Nevertheless, it is still a good algorithm today - not everything that's no longer new and shiny must be discarded. It is similar to the DBSCAN algorithm for clustering, an extension even, and hence borrows some of its components as well as its algorithmic components.

Let's take a look at OPTICS here. Firstly, we'll take a look at OPTICS' components, followed by taking a look at its algorithm. The outcome of this algorithm is a dendrogram (which shows the tree-like structure of the data by means of the reachability distance, which is one of the components that we will cover next). Once we know the output of the algorithm, we'll move on to interpreting this diagram, answering the questions how we can generate the clusters from this reachability plot.

OPTICS components

Let's first take a look at the components of the OPTICS method.

Epsilon parameter

The first parameter is the epsilon parameter, or \(\epsilon\). It is a distance parameter in the sense that for any point \(p\), the epsilon defines a distance around the point, like this:

MinPts parameter

Another parameter is \(\text{minPts}\), or the minimum amount of points parameter. It is used together with epsilon because it illustrates how many points must be within the \(\epsilon\) distance of a point \(p\) (including the point) in order to form a cluster.

Core points

When the point \(p\) has \(\text{minPts}\) within its \(\epsilon\) distance including itself, we say that it is a core point and that it has sufficient amount of points in its \(\epsilon \text{-neighborhood}\) for becoming one. A core point always represents a cluster. Possibly, it is still in formation, meaning that it will merge with other clusters later. We'll see how this happens when we take a look at the OPTICS algorithm.

Core distance

If you have read the article about DBSCAN, you might have thought that many of these concepts are familiar. And in fact, they are! All concepts covered so far are also components of the DBSCAN algorithm. The next one, core distance, is however unique to OPTICS. Let's take a look.

Core distance is defined as follows. For any point \(p\) with some epsilon \(\epsilon\) and hence an epsilon neighborhood \(N_\epsilon(p)\):

[mathjax]

\begin{equation}
\text{core-dist}_{\epsilon , \text{minPts}}(p) =
\begin{cases}
\text{undefined}, & \text{if}\ |N_{\epsilon}(p)| < \text{minPts} \\
\text{minPts-th smallest distance in } N_{\epsilon}(p), & \text{otherwise} \
\end{cases}
\end{equation}

In other words, the core distance is undefined if the number of points in the neighborhood (including \(p\) itself) is lower than the minimum number of points required. This makes sense: if the point is no core point, it does not have a core distance.

In the other case, however, it's the \(\text{minPts-th smallest distance in } N_{\epsilon}(p)\). This is a very generic description of the point in the epsilon neighborhood located farthest away from the core point; in the case of \(\text{minPts} = 3\), it would be the 3rd smallest distance. For this reason, the core distance also describes the minimum value for epsilon in order to keep the point a core point. Using the distance, in the algorithm, we can keep merging clusters by (1) knowing that they are close to a core point and hence reachable thus part of the cluster, and (2) do so in an extensive way, growing the cluster time after time.

If this is a bit abstract to you, don't worry - we'll get back to this when describing the algorithm!

Reachability distance

While the core distance expresses the minimum distance to keep a point a core point, the reachability distance expresses the distance which is reachable from a core point.

It is expressed as follows in terms of an arbitrary point \(o\) that is reached from a point \(p\):

\begin{equation}
\text{reach-dist}_{\epsilon , \text{minPts}}(o, p) =
\begin{cases}
\text{undefined}, & \text{if}\ |N_{\epsilon}(p)| < \text{minPts} \\
\text{max}(\text{core-dist}_{\epsilon , \text{minPts}}(p), dist(p,o)), & \text{otherwise} \
\end{cases}
\end{equation}

If \(p\)'s epsilon neighborhood has insufficient points, it is not a core point and hence cannot be used in reaching another point. This is similar to direct reachability and reachability in DBSCAN. For this reason, if this happens, the reachability distance is set to undefined.

If it is a core point, the reachability distance is either the core distance or the distance between \(p\) and \(o\), whichever is bigger. In other words, any point within either the core or reachability distance can be reached from that particular core point. This allows us to continue constructing clusters.

In the example below, the reachability distance from the core point to point 1 equals the core distance, because it is bigger. However, for a random point R, the reachability distance equals the distance to that point, because that one is bigger than the core distance.

To summarize:

It's now time to look at the OPTICS algorithm itself. We shall see that from one core point forward, the algorithm will continue to search for cluster extensions by (1) identifying whether possible extensions are dense enough, by means of core distance, and (2) what their distance from the most dense parts of the cluster are, by ordering based on reachability distance.

OPTICS algorithm

Let's now take a closer look at the OPTICS algorithm. We'll start our algorithm with an ordered list that is empty. We also maintain a list with processed points.

ordered list = empty list.

processed points = empty list

Here's how OPTICS works.

The main loop

The main part of the OPTICS algorithm is the main loop (Wikipedia, 2009). It describes the optics function:

function OPTICS(DB, eps, MinPts) is
    for each point p of DB do
        p.reachability-distance = UNDEFINED
    for each unprocessed point p of DB do
        N = getNeighbors(p, eps)
        mark p as processed
        output p to the ordered list
        if core-distance(p, eps, MinPts) != UNDEFINED then
            Seeds = empty priority queue
            update(N, p, Seeds, eps, MinPts)
            for each next q in Seeds do
                N' = getNeighbors(q, eps)
                mark q as processed
                output q to the ordered list
                if core-distance(q, eps, MinPts) != UNDEFINED do
                    update(N', q, Seeds, eps, MinPts)

In the next sections, we will both describe the update function and how we can derive clusters from the output of the algorithm.

The update function

Above, at two places, a call is made to an update function which updates the Seeds queue, or in other words, the priority queue. Updating always happens because either the queue is empty (in other words, at the start of the algorithm, or when all close points have been covered and unprocessed points remain) or because new points have been found to extend the cluster with.

Updating happens in the following way:

In other words, during the extension, we continuously change the order in the priority queue, where points with lower reachability distance (and hence points closer to the cluster's center points) are added earlier.

function update(N, p, Seeds, eps, MinPts) is
    coredist = core-distance(p, eps, MinPts)
    for each o in N
        if o is not processed then
            new-reach-dist = max(coredist, dist(p,o))
            if o.reachability-distance == UNDEFINED then // o is not in Seeds
                o.reachability-distance = new-reach-dist
                Seeds.insert(o, new-reach-dist)
            else               // o in Seeds, check for improvement
                if new-reach-dist < o.reachability-distance then
                    o.reachability-distance = new-reach-dist
                    Seeds.move-up(o, new-reach-dist)

Combining the main loop and update function

Combining the main loop and the update function, we therefore see the following behavior emerge:

  1. We keep iterating until each point has been processed.
  2. For each unprocessed point, we attempt to process as many related points by taking a look at neighborhoods and extending these neighborhoods until no close points can be found anymore.
  3. All distances are added to an ordered list. This ordered list hence, for each cluster, contains reachability distances in an ascending way. Outliers represent the cutoff points for determining the clusters. We will illustrate how this works by means of the reachability plot.

Generating clusters from the reachability plot

The 'ordered list' we just covered above is displayed in the image below, a so-called reachability plot. In this plot, the reachability distances for each point are mapped. Clearly, from one point, we can see how the clusters have been extended in a reachability distance-based ascending ordering: for the blue cluster, the distance ascends until it moves towards a red one; then towards a green one.

The valleys therefore represent clusters and can be considered as being clusters, whereas the peaks represent the cutoff points, sometimes even with outliers in between.

Generating clusters from the reachability plot can therefore naïvely be performed by means of taking a cutoff value, where an arbitrary y value marks the change in cluster. If done so, the clustering algorithm works in a similar way to DBSCAN, because a threshold is taken just like DBSCAN does. However, more advanced methods exist as well (Wikipedia, 2009).

OPTICS vs DBSCAN

OPTICS shares many components with DBSCAN. The epsilon value and minimum number of points are shared and, together with the concepts of core points and reachable points (by implication of the reachability distance) overlap. One of the key differences is the way that clusters are computed: rather than by picking a fixed value (DBSCAN), a different method can be applied with OPTICS.

For this reason, OPTICS is preferable over DBSCAN when your clusters have varying density. In other cases, the choice for algorithm does not really matter.

Building an OPTICS model with Python and Scikit-learn

Now that we understand how OPTICS works, we can take a look at implementing it with Python and Scikit-learn.

With the following code, we can perform OPTICS based clustering on a random blob-like dataset. It works as follows.

from sklearn.datasets import make_blobs
from sklearn.cluster import OPTICS
import numpy as np
import matplotlib.pyplot as plt

# Configuration options
num_samples_total = 1000
cluster_centers = [(3,3), (7,7)]
num_classes = len(cluster_centers)
epsilon = 2.0
min_samples = 22
cluster_method = 'xi'
metric = 'minkowski'

# Generate data
X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5)

# Compute OPTICS
db = OPTICS(max_eps=epsilon, min_samples=min_samples, cluster_method=cluster_method, metric=metric).fit(X)
labels = db.labels_

no_clusters = len(np.unique(labels) )
no_noise = np.sum(np.array(labels) == -1, axis=0)

print('Estimated no. of clusters: %d' % no_clusters)
print('Estimated no. of noise points: %d' % no_noise)

# Generate scatter plot for training data
colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', labels))
plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True)
plt.title(f'OPTICS clustering')
plt.xlabel('Axis X[0]')
plt.ylabel('Axis X[1]')
plt.show()

Running the algorithm yields the following scatter plot:

We can also easily generate the reachability plot:

# Generate reachability plot
reachability = db.reachability_[db.ordering_]
plt.plot(reachability)
plt.title('Reachability plot')
plt.show()

Summary

In this article, we took a look at the OPTICS algorithm for clustering. Similar to the DBSCAN algorithm, but notably different, it can be used for clustering when the density of your clusters is different. This is something that DBSCAN cannot do so well.

First of all, we looked at what clustering is in the first place. We saw that clustering algorithms can be used to group samples in a dataset based on similarity. Then, we moved on to OPTICS, and studied its components. We saw that similar to DBSCAN, OPTICS also works with epsilon and a minimum number of points, which specifies a distance around a point and a minimum number of points (including the point itsefl) to be within this distance in order to classify the point as a core point. Core points represent dense points, and using the core distance and reachability distance, OPTICS is capable of grouping samples together.

We saw that OPTICS works by ordering based on reachability distance while expanding the clusters at the same time. The output of the OPTICS algorithm is therefore an ordered list of reachability distances, which by means of thresholds or different techniques we can split into clusters. This way, we're able of generating clusters for groups of data that have varying densities.

I hope that you have learned something from today's article. If you did, please feel free to leave a message in the comments section! 💬 Please also leave remarks and comments, or leave them through the Ask Questions button. Thank you for reading MachineCurve today and happy engineering! 😎

References

Wikipedia. (2009, April 21). OPTICS algorithm. Wikipedia, the free encyclopedia. Retrieved December 9, 2020, from https://en.wikipedia.org/wiki/OPTICS_algorithm

Scikit-learn. (n.d.). Sklearn.cluster.OPTICS — scikit-learn 0.23.2 documentation. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved December 9, 2020, from https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html

Wikipedia. (2004, May 21). Cluster analysis. Wikipedia, the free encyclopedia. Retrieved December 11, 2020, from https://en.wikipedia.org/wiki/Cluster_analysis

Wikipedia. (2003, May 25). Unsupervised learning. Wikipedia, the free encyclopedia. Retrieved December 11, 2020, from https://en.wikipedia.org/wiki/Unsupervised_learning

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.