Is it possible to compute Davies Bouldin score from a precomputed distance matrix using sklearn?

I’m trying to compute the Davies Bouldin score to compare different clustering approach. I have a precomputed distance matrix (that represents edit-based distance between texts).

I’m using the scikit-learn implementation of silhouette_score, davies_bouldin_score and calinski_harabasz_score available in the sklearn.metrics module.

It works but for the sihouette_score, it specifies that I can use a distance matrix. It’s not the case for the other two metrics.

I can pass the distance matrix to both davies_bouldin_score and calinski_harabasz_score but they are expecting a n_samples x N_features matrix. My guess is that they are treating the distance matrix as a feature matrix. Would that be correct to compute the score this way ?

This response to a previous question mentions that it is possible.

My question is then: is it ok, mathematically speaking, to use the distance matrix to compute those scores ?

My code works and I can output the score so my problem is more: is what I’m doing makes sense.

Here is the minimum working example:

import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics

# Generate random points in a 2D space
num_points = 5  # Adjust the number of points
points = np.random.rand(num_points, 2)

# Compute the Euclidean distance matrix
distance_matrix = np.linalg.norm(points[:, np.newaxis] - points, axis=2)

# Clustering
thresholds = [0.25, 0.35]

for t in thresholds:
    c = AgglomerativeClustering(n_clusters=None,
                                metric="precomputed",
                                linkage="average",
                                distance_threshold=t)

    clusters = c.fit(distance_matrix)

    s_score = metrics.silhouette_score(distance_matrix,
                                       clusters.labels_,
                                       metric="precomputed")
    dbi_score = metrics.davies_bouldin_score(distance_matrix,
                                             clusters.labels_)
    ch_score = metrics.calinski_harabasz_score(distance_matrix,
                                               clusters.labels_)

    print(f"Threshold = {t}")
    print(f"\tSilhouette score:\t\t{s_score:.3f}")
    print(f"\tDavies-Bouldin score:\t\t{dbi_score:.3f}")
    print(f"\tCalinski-Harabasz score:\t{ch_score:.3f}")


# Output:
# Threshold = 0.25
#   Silhouette score:       0.311
#   Davies-Bouldin score:       0.114
#   Calinski-Harabasz score:    21.653
# Threshold = 0.35
#   Silhouette score:       0.533
#   Davies-Bouldin score:       0.343
#   Calinski-Harabasz score:    9.360

If there is a ValueError: Number of labels is 5., just rerun the code (the problem comes from the randomness of the generated test distance matrix).

Thanks !

You need to sign in to view this answers

About Us

Categories

Android

C#

C++

CSS

GPL

HTML

Contact Info

Is it possible to compute Davies Bouldin score from a precomputed distance matrix using sklearn?

Leave feedback about this Cancel Reply

PROS

CONS

Categories

Android

C#

C++

CSS

GPL

HTML

java

javascript

jQuery

Node.js

pdf

PHP

Recent Posts

Postgres drop type XX000 “cache lookup failed for type”

Login servlet app with session and cookies

About Us

Categories

Android

C#

C++

CSS

GPL

HTML

Contact Info

Follow Us

Is it possible to compute Davies Bouldin score from a precomputed distance matrix using sklearn?

Share This Post:

Leave feedback about this Cancel Reply

PROS

CONS

Related Post

Android

C#

C++

CSS

GPL

HTML

java

javascript

jQuery

Node.js

pdf

PHP