Is it possible to compute Davies Bouldin score from a precomputed distance matrix using sklearn?

I’m trying to compute the Davies Bouldin score to compare different clustering approach. I have a precomputed distance matrix (that represents edit-based distance between texts).

I’m using the scikit-learn implementation of silhouette_score, davies_bouldin_score and calinski_harabasz_score available in the sklearn.metrics module.

It works but for the sihouette_score, it specifies that I can use a distance matrix. It’s not the case for the other two metrics.

I can pass the distance matrix to both davies_bouldin_score and calinski_harabasz_score but they are expecting a n_samples x N_features matrix. My guess is that they are treating the distance matrix as a feature matrix. Would that be correct to compute the score this way ?

This response to a previous question mentions that it is possible.

My question is then: is it ok, mathematically speaking, to use the distance matrix to compute those scores ?

My code works and I can output the score so my problem is more: is what I’m doing makes sense.

Here is the minimum working example:

import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics

# Generate random points in a 2D space
num_points = 5  # Adjust the number of points
points = np.random.rand(num_points, 2)

# Compute the Euclidean distance matrix
distance_matrix = np.linalg.norm(points[:, np.newaxis] - points, axis=2)

# Clustering
thresholds = [0.25, 0.35]

for t in thresholds:
    c = AgglomerativeClustering(n_clusters=None,
                                metric="precomputed",
                                linkage="average",
                                distance_threshold=t)

    clusters = c.fit(distance_matrix)

    s_score = metrics.silhouette_score(distance_matrix,
                                       clusters.labels_,
                                       metric="precomputed")
    dbi_score = metrics.davies_bouldin_score(distance_matrix,
                                             clusters.labels_)
    ch_score = metrics.calinski_harabasz_score(distance_matrix,
                                               clusters.labels_)

    print(f"Threshold = {t}")
    print(f"\tSilhouette score:\t\t{s_score:.3f}")
    print(f"\tDavies-Bouldin score:\t\t{dbi_score:.3f}")
    print(f"\tCalinski-Harabasz score:\t{ch_score:.3f}")


# Output:
# Threshold = 0.25
#   Silhouette score:       0.311
#   Davies-Bouldin score:       0.114
#   Calinski-Harabasz score:    21.653
# Threshold = 0.35
#   Silhouette score:       0.533
#   Davies-Bouldin score:       0.343
#   Calinski-Harabasz score:    9.360

If there is a ValueError: Number of labels is 5., just rerun the code (the problem comes from the randomness of the generated test distance matrix).

Thanks !

You need to sign in to view this answers

Related Post