DBSCAN clustering geolocations beyond the epsilon value

I am trying to analyse some job latitude and longitude data. The nature of the jobs means they tend to happen at similar (although not identical) latitude/longitude locations

In order to reduce the amount of data points to display and analyse I want to cluster the jobs in similar geographical region together. To do this I was using DBSCAN to cluster the jobs and use a job close to the centre of the cluster to act as the representative point.

import pandas as pd, numpy as np
from sklearn.cluster import DBSCAN
from geopy.distance import great_circle
from shapely.geometry import MultiPoint

def cluster_with_dbscan(jobs, radius_km, min_samples):

    # ignore jobs that are missing Lat Long coords for now
    jobs_cluster = jobs[['Job ID', 'Lat', 'Long']].dropna()

    # run dbscan
    kms_per_radian = 6371.0088
    epsilon = radius_km / kms_per_radian
    coords = jobs_cluster[['Lat', 'Long']].values
    db = DBSCAN(eps=epsilon, min_samples=min_samples, algorithm='ball_tree', metric="haversine").fit(np.radians(coords))

    # appending cluster data onto original jobs, preserving jobs that never had location data
    jobs_cluster['Cluster ID'] = db.labels_
    jobs_with_cluster = pd.merge(jobs, jobs_cluster[['Job ID', 'Cluster ID']], how='left', on=['Job ID'])

    # capture cluster data, including centroids
    num_clusters = len(set(db.labels_))
    clusters = pd.Series([coords[db.labels_ == n] for n in range(num_clusters)])

    def get_centermost_point(cluster):
        if len(cluster) == 0:
            return tuple([None,None])
        centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
        centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
        return tuple(centermost_point)

    centermost_points = clusters.map(get_centermost_point)

    lats, lons = zip(*centermost_points)
    clusters = pd.DataFrame({'Lat':lats, 'Long':lons}).reset_index().rename(columns={'index':'Cluster ID'})

    return jobs_with_cluster,clusters

run with

radius_km = 2
min_samples = 1 //want to keep outliers
jobs,clusters = cluster_with_dbscan(jobs, radius_km , min_samples )

When running I do get clustered data, but the clusters contain jobs that are far more than 2km apart (some clusters have jobs spanning 100s of kilometres). From my understanding of DBSCAN they should only be at most 2km from the core point

Is my understanding of DBSCAN wrong? can clusters cover areas greater than equivalent epsilon value? If so is there a more appropriate cluster algorithm?

Or is my implementation of DBSCAN flawed in someway?

You need to sign in to view this answers

Related Post