I’m trying to compute the Davies Bouldin score to compare different clustering approach. I have a precomputed distance matrix (that represents edit-based distance between texts).
I’m using the scikit-learn implementation of silhouette_score
, davies_bouldin_score
and calinski_harabasz_score
available in the sklearn.metrics
module.
It works but for the sihouette_score
, it specifies that I can use a distance matrix. It’s not the case for the other two metrics.
I can pass the distance matrix to both davies_bouldin_score
and calinski_harabasz_score
but they are expecting a n_samples
x N_features
matrix. My guess is that they are treating the distance matrix as a feature matrix. Would that be correct to compute the score this way ?
This response to a previous question mentions that it is possible.
My question is then: is it ok, mathematically speaking, to use the distance matrix to compute those scores ?
My code works and I can output the score so my problem is more: is what I’m doing makes sense.
Here is the minimum working example:
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
# Generate random points in a 2D space
num_points = 5 # Adjust the number of points
points = np.random.rand(num_points, 2)
# Compute the Euclidean distance matrix
distance_matrix = np.linalg.norm(points[:, np.newaxis] - points, axis=2)
# Clustering
thresholds = [0.25, 0.35]
for t in thresholds:
c = AgglomerativeClustering(n_clusters=None,
metric="precomputed",
linkage="average",
distance_threshold=t)
clusters = c.fit(distance_matrix)
s_score = metrics.silhouette_score(distance_matrix,
clusters.labels_,
metric="precomputed")
dbi_score = metrics.davies_bouldin_score(distance_matrix,
clusters.labels_)
ch_score = metrics.calinski_harabasz_score(distance_matrix,
clusters.labels_)
print(f"Threshold = {t}")
print(f"\tSilhouette score:\t\t{s_score:.3f}")
print(f"\tDavies-Bouldin score:\t\t{dbi_score:.3f}")
print(f"\tCalinski-Harabasz score:\t{ch_score:.3f}")
# Output:
# Threshold = 0.25
# Silhouette score: 0.311
# Davies-Bouldin score: 0.114
# Calinski-Harabasz score: 21.653
# Threshold = 0.35
# Silhouette score: 0.533
# Davies-Bouldin score: 0.343
# Calinski-Harabasz score: 9.360
If there is a ValueError: Number of labels is 5.
, just rerun the code (the problem comes from the randomness of the generated test distance matrix).
Thanks !
You need to sign in to view this answers
Leave feedback about this