Programming Computer Science Worksheet

module-8November 8, 2023
1
Module 8: Cluster Analysis
The following tutorial contains Python examples for solving classification problems. You
should refer to Chapters 7 and 8 of the “Introduction to Data Mining” book to understand
some of the concepts introduced in this tutorial. The notebook can be downloaded from
http://www.cse.msu.edu/~ptan/dmbook/tutorials/tutorial8/tutorial8.ipynb.
Cluster analysis seeks to partition the input data into groups of closely related instances so that
instances that belong to the same cluster are more similar to each other than to instances that
belong to other clusters. In this tutorial, we will provide examples of using different clustering
techniques provided by the scikit-learn library package.
Read the step-by-step instructions below carefully. To execute the code, click on the corresponding
cell and press the SHIFT-ENTER keys simultaneously.
[ ]: from google.colab import drive
drive.mount(‘/content/drive’)
Mounted at /content/drive
[14]: import numpy as np
import pandas as pd
import math
from sklearn import cluster
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.cluster import hierarchy
from sklearn.cluster import DBSCAN, k_means, KMeans
1.1
8.1 K-means Clustering
The k-means clustering algorithm represents each cluster by its corresponding cluster centroid. The
algorithm would partition the input data into k disjoint clusters by iteratively applying the following
two steps: 1. Form k clusters by assigning each instance to its nearest centroid. 2. Recompute the
centroid of each cluster.
In this section, we perform k-means clustering on a toy example of movie ratings dataset. We first
create the dataset as follows.
1
[ ]: ratings =␣
↪[[‘john’,5,5,2,1],[‘mary’,4,5,3,2],[‘bob’,4,4,4,3],[‘lisa’,2,2,4,5],[‘lee’,1,2,3,4],[‘harry’
titles = [‘user’,’Jaws’,’Star Wars’,’Exorcist’,’Omen’]
movies = pd.DataFrame(ratings,columns=titles)
movies
[ ]:
0
1
2
3
4
5
user
john
mary
bob
lisa
lee
harry
Jaws
5
4
4
2
1
2
Star Wars
5
5
4
2
2
1
Exorcist
2
3
4
4
3
5
Omen
1
2
3
5
4
5
In this example dataset, the first 3 users liked action movies (Jaws and Star Wars) while the last 3
users enjoyed horror movies (Exorcist and Omen). Our goal is to apply k-means clustering on the
users to identify groups of users with similar movie preferences.
The example below shows how to apply k-means clustering (with k=2) on the movie ratings data.
We must remove the “user” column first before applying the clustering algorithm. The cluster
assignment for each user is displayed as a dataframe object.
[ ]: data = movies.drop(‘user’,axis=1)
k_means = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)
k_means.fit(data)
labels = k_means.labels_
pd.DataFrame(labels, index=movies.user, columns=[‘Cluster ID’])
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to ‘auto’ in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
[ ]:
Cluster ID
user
john
mary
bob
lisa
lee
harry
1
1
1
0
0
0
The k-means clustering algorithm assigns the first three users to one cluster and the last three users
to the second cluster. The results are consistent with our expectation. We can also display the
centroid for each of the two clusters.
[ ]: centroids = k_means.cluster_centers_
pd.DataFrame(centroids,columns=data.columns)
2
[ ]:
0
1
Jaws
1.666667
4.333333
Star Wars
1.666667
4.666667
Exorcist
4.0
3.0
Omen
4.666667
2.000000
Observe that cluster 0 has higher ratings for the horror movies whereas cluster 1 has higher ratings
for action movies. The cluster centroids can be applied to other users to determine their cluster
assignments.
[ ]: testData = np.array([[4,5,1,2],[3,2,4,4],[2,3,4,1],[3,2,3,3],[5,4,1,4]])
labels = k_means.predict(testData)
labels = labels.reshape(-1,1)
usernames = np.array([‘paul’,’kim’,’liz’,’tom’,’bill’]).reshape(-1,1)
cols = movies.columns.tolist()
cols.append(‘Cluster ID’)
newusers = pd.DataFrame(np.concatenate((usernames, testData, labels),␣
↪axis=1),columns=cols)
newusers
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does
not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
[ ]:
0
1
2
3
4
user Jaws Star Wars Exorcist Omen Cluster ID
paul
4
5
1
2
1
kim
3
2
4
4
0
liz
2
3
4
1
1
tom
3
2
3
3
0
bill
5
4
1
4
1
To determine the number of clusters in the data, we can apply k-means with varying number of
clusters from 1 to 6 and compute their corresponding sum-of-squared errors (SSE) as shown in the
example below. The “elbow” in the plot of SSE versus number of clusters can be used to estimate
the number of clusters.
[ ]: numClusters = [1,2,3,4,5,6]
SSE = []
for k in numClusters:
k_means = cluster.KMeans(n_clusters=k)
k_means.fit(data)
SSE.append(k_means.inertia_)
plt.plot(numClusters, SSE)
plt.xlabel(‘Number of Clusters’)
plt.ylabel(‘SSE’)
3
1.2
8.2 Hierarchical Clustering
This section demonstrates examples of applying hierarchical clustering to the vertebrate dataset
used in Module 6 (Classification). Specifically, we illustrate the results of using 3 hierarchical
clustering algorithms provided by the Python scipy library: (1) single link (MIN), (2) complete
link (MAX), and (3) group average. Other hierarchical clustering algorithms provided by the
library include centroid-based and Ward’s method.
[ ]: data = pd.read_csv(‘/content/drive/MyDrive/datamining/vertebrate.
↪csv’,header=’infer’)
data
[ ]:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
Warm-blooded
1
0
0
1
0
0
1
1
1
0
0
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Aerial Creature
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
Has Legs
1
0
0
0
1
1
1
1
1
0
1
1
1
0
1
Gives Birth
1
0
0
1
0
0
1
0
1
1
0
0
1
0
0
Hibernates
0
1
0
0
1
0
1
0
0
0
0
0
1
0
1
4
Aquatic Creature
0
0
1
1
1
0
0
0
0
1
1
1
0
1
1
Class
mammals
reptiles
fishes
mammals
amphibians
reptiles
mammals
birds
mammals
fishes
reptiles
birds
mammals
fishes
amphibians
\
1.2.1
8.2.1 Single Link (MIN)
[ ]: names = data[‘Name’]
Y = data[‘Class’]
X = data.drop([‘Name’,’Class’],axis=1)
Z = hierarchy.linkage(X, ‘single’)
dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation=’right’)
1.2.2
8.2.2 Complete Link (MAX)
[ ]: Z = hierarchy.linkage(X, ‘complete’)
dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation=’right’)
5
1.2.3
8.3.3 Group Average
[ ]: Z = hierarchy.linkage(X, ‘average’)
dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation=’right’)
6
1.3
8.3 Density-Based Clustering
Density-based clustering identifies the individual clusters as high-density regions that are separated
by regions of low density. DBScan is one of the most popular density based clustering algorithms. In
DBScan, data points are classified into 3 types—core points, border points, and noise points—based
on the density of their local neighborhood. The local neighborhood density is defined according to 2
parameters: radius of neighborhood size (eps) and minimum number of points in the neighborhood
(min_samples).
For this approach, we will use a noisy, 2-dimensional dataset originally created by Karypis et al.
[1] for evaluating their proposed CHAMELEON algorithm. The example code shown below will
load and plot the distribution of the data.
[ ]: data = pd.read_csv(‘/content/drive/MyDrive/datamining/chameleon.data’,␣
↪delimiter=’ ‘, names=[‘x’,’y’])
data.plot.scatter(x=’x’,y=’y’)
[ ]:
7
We apply the DBScan clustering algorithm on the data by setting the neighborhood radius (eps)
to 15.5 and minimum number of points (min_samples) to be 5. The clusters are assigned to IDs
between 0 to 8 while the noise points are assigned to a cluster ID equals to -1.
[ ]: print(data.shape)
db = DBSCAN(eps=15.5, min_samples=5).fit(data)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = pd.DataFrame(db.labels_,columns=[‘Cluster ID’])
result = pd.concat((data,labels), axis=1)
result.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’, colormap=’jet’)
(1971, 2)
[ ]:
8
1.4
8.4 Spectral Clustering
One of the main limitations of the k-means clustering algorithm is its tendency to seek for globularshaped clusters. Thus, it does not work when applied to datasets with arbitrary-shaped clusters
or when the cluster centroids overlapped with one another. Spectral clustering can overcome this
limitation by exploiting properties of the similarity graph to overcome such limitations. To illustrate
this, consider the following two-dimensional datasets.
[ ]: import pandas as pd
data1 = pd.read_csv(‘/content/drive/MyDrive/datamining/2d_data.txt’,␣
↪delimiter=’ ‘, names=[‘x’,’y’])
data2 = pd.read_csv(‘/content/drive/MyDrive/datamining/elliptical.txt’,␣
↪delimiter=’ ‘, names=[‘x’,’y’])
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
data1.plot.scatter(x=’x’,y=’y’,ax=ax1)
data2.plot.scatter(x=’x’,y=’y’,ax=ax2)
[ ]:
9
Below, we demonstrate the results of applying k-means to the datasets (with k=2).
[ ]: from sklearn import cluster
k_means = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)
k_means.fit(data1)
labels1 = pd.DataFrame(k_means.labels_,columns=[‘Cluster ID’])
result1 = pd.concat((data1,labels1), axis=1)
k_means2 = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)
k_means2.fit(data2)
labels2 = pd.DataFrame(k_means2.labels_,columns=[‘Cluster ID’])
result2 = pd.concat((data2,labels2), axis=1)
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
result1.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’,colormap=’jet’,ax=ax1)
ax1.set_title(‘K-means Clustering’)
result2.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’,colormap=’jet’,ax=ax2)
ax2.set_title(‘K-means Clustering’)
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to ‘auto’ in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to ‘auto’ in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
[ ]: Text(0.5, 1.0, ‘K-means Clustering’)
10
The plots above show the poor performance of k-means clustering. Next, we apply spectral clustering to the datasets. Spectral clustering converts the data into a similarity graph and applies the
normalized cut graph partitioning algorithm to generate the clusters. In the example below, we
use the Gaussian radial basis function as our affinity (similarity) measure. Users need to tune the
kernel parameter (gamma) value in order to obtain the appropriate clusters for the given dataset.
[ ]: from sklearn import cluster
import pandas as pd
spectral = cluster.
↪SpectralClustering(n_clusters=2,random_state=1,affinity=’rbf’,gamma=5000)
spectral.fit(data1)
labels1 = pd.DataFrame(spectral.labels_,columns=[‘Cluster ID’])
result1 = pd.concat((data1,labels1), axis=1)
spectral2 = cluster.
↪SpectralClustering(n_clusters=2,random_state=1,affinity=’rbf’,gamma=100)
spectral2.fit(data2)
labels2 = pd.DataFrame(spectral2.labels_,columns=[‘Cluster ID’])
result2 = pd.concat((data2,labels2), axis=1)
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
result1.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’,colormap=’jet’,ax=ax1)
ax1.set_title(‘Spectral Clustering’)
result2.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’,colormap=’jet’,ax=ax2)
ax2.set_title(‘Spectral Clustering’)
[ ]: Text(0.5, 1.0, ‘Spectral Clustering’)
11
1.5
8.5 Summary
This tutorial illustrates examples of using different Python’s implementation of clustering algorithms. Algorithms such as k-means, spectral clustering, and DBScan are designed to create disjoint partitions of the data whereas the single-link, complete-link, and group average algorithms
are designed to generate a hierarchy of cluster partitions.
References: [1] George Karypis, Eui-Hong Han, and Vipin Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. IEEE Computer 32(8): 68-75, 1999.
1.6
In-class Clustering Practice
1. Given college-and-university.csv, conduct K-means clustering analysis based on Median
SAT,Acceptance Rate,Expenditures/Student,Top 10% HS, and Graduation %. Do not use
Type and School in your analysis. Find out the the best K value.
2. Given college-and-university.csv, conduct Hierarchical clustering analysis based on Median
SAT,Acceptance Rate,Expenditures/Student,Top 10% HS, and Graduation %. Do not use
Type and School in your analysis.
[3]: np.random.seed(42)
# Function for creating datapoints in the form of a circle
def PointsInCircum(r,n=100):
return [(math.cos(2*math.pi/n*x)*r+np.random.normal(-30,30),math.sin(2*math.
↪pi/n*x)*r+np.random.normal(-30,30)) for x in range(1,n+1)]
# Creating data points in the form of a circle
df1 = pd.DataFrame(PointsInCircum(500,1000))
df2 = pd.DataFrame(PointsInCircum(300,700))
df3 = pd.DataFrame(PointsInCircum(100,300))
12
# Adding noise to the dataset
df4 = pd.DataFrame([(np.random.randint(-600,600),np.random.randint(-600,600))␣
↪for i in range(300)])
df = pd.concat([df1, df2, df3, df4], axis= 0)
plt.figure(figsize=(6,6))
plt.scatter(df[0],df[1],s=15,color=’grey’)
plt.title(‘Dataset’,fontsize=20)
plt.xlabel(‘Feature 1’,fontsize=14)
plt.ylabel(‘Feature 2’,fontsize=14)
plt.show()
13
3. Using the data in df above, conduct K-means and density-based clustering. Provide visulization for the clustering results. Use eps=30 and min_samples=6 for DBScan.
4. Using the data in df above, conduct spectral clustering and provide visualization for the
clustering result.
1.7
Spectral Clustering
https://towardsdatascience.com/spectral-clustering-aba2640c0d5b
The data rerpesented as a graph.
[17]: # Adjacency Matrix
A = np.array([
[0, 1, 1, 0, 0, 0, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0]])
# Degree Matrix
D = np.diag(A.sum(axis=1))
print(D)
L = D-A
print(L)
# graph laplacian
L = D-A
# eigenvalues and eigenvectors
vals, vecs = np.linalg.eig(L)
# sort these based on the eigenvalues
vecs = vecs[:,np.argsort(vals)]
vals = vals[np.argsort(vals)]
[[4 0 0 0 0 0 0 0 0 0]
[0 2 0 0 0 0 0 0 0 0]
[0 0 2 0 0 0 0 0 0 0]
[0 0 0 2 0 0 0 0 0 0]
[0 0 0 0 2 0 0 0 0 0]
[0 0 0 0 0 4 0 0 0 0]
[0 0 0 0 0 0 2 0 0 0]
14
[0 0 0 0 0 0 0 2 0 0]
[0 0 0 0 0 0 0 0 2 0]
[0 0 0 0 0 0 0 0 0 2]]
[[ 4 -1 -1 0 0 0 0 0 -1 -1]
[-1 2 -1 0 0 0 0 0 0 0]
[-1 -1 2 0 0 0 0 0 0 0]
[ 0 0 0 2 -1 -1 0 0 0 0]
[ 0 0 0 -1 2 -1 0 0 0 0]
[ 0 0 0 -1 -1 4 -1 -1 0 0]
[ 0 0 0 0 0 -1 2 -1 0 0]
[ 0 0 0 0 0 -1 -1 2 0 0]
[-1 0 0 0 0 0 0 0 2 -1]
[-1 0 0 0 0 0 0 0 -1 2]]
[18]: plt.scatter(range(len(vals)), vals)
plt.xlabel(‘order’)
plt.ylabel(‘values’)
[18]: Text(0, 0.5, ‘values’)
15
[19]: # kmeans on first three vectors with nonzero eigenvalues
kmeans = KMeans(n_clusters=4, n_init = “auto”)
kmeans.fit(vecs[:,1:4])
colors = kmeans.labels_
print(“Clusters:”, colors)
# Clusters: [2 1 1 0 0 0 3 3 2 2]
Clusters: [0 2 2 0 0 0 3 3 1 1]
16
Clustering
October 25, 2023
0.1
Loading the data
[4]: import pandas as pd
import numpy as np
from scipy.stats import zscore
# Load the CSV file into a pandas DataFrame
data = pd.read_csv(‘Data.csv’)
# Display the first few rows of the DataFrame
print(data.head(10))
School Type
Amherst
Lib Arts
Barnard
Lib Arts
Bates
Lib Arts
Berkeley
University
Bowdoin
Lib Arts
Brown
University
Bryn Mawr
Lib Arts
Cal Tech
University
Carleton
Lib Arts
Carnegie Mellon University
Amherst
Barnard
Bates
Berkeley
Bowdoin
Brown
Bryn Mawr
Cal Tech
Carleton
Carnegie Mellon
0.2
Median
1315
1220
1240
1176
1300
1281
1255
1400
1300
1225
SAT
22
53
36
37
24
24
56
31
40
64
Expenditures/Student
85
69
58
95
78
80
70
98
75
52
Acceptance Rate
26636
17653
17554
23665
25703
24201
18847
102262
15904
33607
\
Top 10% HS Graduation %
93
80
88
68
90
90
84
75
80
77
Data cleaning
Handle the missing values
1
[5]: # Drop rows with any missing values
data_cleaned = data.dropna()
# Fill missing values in a specific column (e.g., ‘Acceptance Rate’) with the␣
↪mean of that column
data[‘Acceptance Rate’].fillna(data[‘Acceptance Rate’].mean(), inplace=True)
0.2.1
Removing duplicates
[6]: # Remove duplicate rows based on all columns
data_cleaned = data.drop_duplicates()
# Remove duplicates based on specific columns
data_cleaned = data.drop_duplicates(subset=[‘School Type’, ‘Median’, ‘SAT’,␣
↪’Acceptance Rate’, ‘Expenditures/Student’, ‘Top 10% HS Graduation %’])
0.2.2
Correcting data formats
[7]: # Convert ‘Acceptance Rate’ to numeric (assuming it contains numerical values)
data[‘Acceptance Rate’] = pd.to_numeric(data[‘Acceptance Rate’],␣
↪errors=’coerce’)
# Convert ‘Expenditures/Student’ to numeric (assuming it contains numerical␣
↪values)
data[‘Expenditures/Student’] = pd.to_numeric(data[‘Expenditures/Student’],␣
↪errors=’coerce’)
# Convert ‘Top 10% HS Graduation %’ to numeric (assuming it contains numerical␣
↪values)
data[‘Top 10% HS Graduation %’] = pd.to_numeric(data[‘Top 10% HS Graduation␣
↪%’], errors=’coerce’)
0.2.3
Handling outliers
[8]: # Identify and remove outliers using Z-score for ‘Acceptance Rate’
data = data[(np.abs(zscore(data[‘Acceptance Rate’])) < 3)] # Replace outliers with the median for 'Expenditures/Student' data['Expenditures/Student'] = np.where((np.abs(zscore(data['Expenditures/ ↪Student'])) < 3), data['Expenditures/Student'], data['Expenditures/Student'].median()) # Replace outliers with the median for 'Top 10% HS Graduation %' data['Top 10% HS Graduation %'] = np.where((np.abs(zscore(data['Top 10% HS␣ ↪Graduation %'])) < 3), 2 data['Top 10% HS Graduation %'], data['Top 10% HS Graduation %']. median()) ↪ 0.3 Clustering analysis K-means clustering An iterative process called K-Means clustering divides a dataset into K unique, non-overlapping clusters. In the feature space, K centroids are initially distributed at random. The closest centroid is then assigned to each data point, creating clusters. The centroids are recalculated as the average of each data point in each cluster. Until the centroids stop changing considerably or a predetermined number of repetitions has been reached, this assignment and update process is repeated. To lessen variance inside each cluster, K-Means optimizes cluster centroids to minimize the within-cluster sum of squares. The clusters are defined by the K cluster centroids and the data points assigned to these centroids in the final output. The elbow technique or silhouette score can be used to select the appropriate number of clusters (K), which will increase the algorithm’s ability to identify significant patterns in the data. [18]: from sklearn.cluster import KMeans # Choose the number of clusters (K) - you can use techniques like the elbow␣ ↪method to find the optimal K k = 3 # Initialize the KMeans model kmeans = KMeans(n_clusters=k, n_init=10) # Fit the model to your data_encoded kmeans.fit(data_encoded) # Get cluster labels for each data point cluster_labels = kmeans.labels_ # Add the cluster labels back to your DataFrame data_encoded['Cluster'] = cluster_labels # Display the first few rows of the DataFrame with cluster labels print(data_encoded.head()) Amherst Barnard Bates Berkeley Bowdoin Median 1315 1220 1240 1176 1300 SAT 22 53 36 37 24 Acceptance Rate 26636 17653 17554 23665 25703 Amherst Barnard Top 10% HS Graduation % 93.0 80.0 Expenditures/Student 85.0 69.0 58.0 95.0 78.0 School Type_Lib Arts True True 3 \ \ Bates Berkeley Bowdoin 88.0 68.0 90.0 Amherst Barnard Bates Berkeley Bowdoin School Type_University False False False True False 0.3.1 True False True Cluster 0 0 0 0 0 Aspects of the k-means Elbow Method for Optimal K: As the number of clusters rises, the Elbow Method graph shows the distortion (inertia). It aids in locating the optimal K below which the model is not considerably enhanced by the addition of more clusters. In the graph, we seek out the location where the rate of decline suddenly changes direction, creating a “elbow.” This idea advises striking a balance between minimizing overfitting and maximizing the volatility in the data. A more distinct elbow denotes a clearer preference for the number of clusters. [46]: import warnings import matplotlib.pyplot as plt warnings.simplefilter(action='ignore', category=FutureWarning) distortions = [] K = range(1,10) for k in K: kmeanModel = KMeans(n_clusters=k) kmeanModel.fit(data_encoded) distortions.append(kmeanModel.inertia_) plt.figure(figsize=(16,8)) plt.plot(K, distortions, 'bx-') plt.xlabel('k') plt.ylabel('Distortion') plt.title('The Elbow Method showing the optimal k') plt.show() 4 Cluster statistics The features of each cluster can be seen by computing statistics like mean and median for each attribute within clusters. For instance, median values reveal the core tendency, which is less influenced by outliers, while mean values reveal the average behavior of data points in a cluster. Understanding the characteristic characteristics of the clusters facilitates their comparison and interpretation through statistical analysis. [48]: cluster_stats = data_encoded.groupby('Cluster').agg(['mean', 'median']) print(cluster_stats) Cluster 0 1 2 Median mean median 1335.000000 1246.235294 1253.571429 1350.0 1248.5 1280.0 SAT Acceptance Rate mean median mean 25.571429 39.470588 45.000000 19.0 37.0 45.0 51076.428571 22194.470588 36935.285714 \ median 48123.0 22077.0 38597.0 Expenditures/Student Top 10% HS Graduation % mean median mean median Cluster 0 1 2 86.142857 71.264706 73.142857 90.0 72.0 74.0 89.428571 83.632353 79.857143 90.0 85.0 77.0 School Type_Lib Arts School Type_University mean median mean median Cluster 0 0.000000 0.0 1.000000 5 1.0 \ 1 2 0.735294 0.000000 1.0 0.0 0.264706 1.000000 0.0 1.0 Visualizing clusters Visualization is essential for understanding the cluster distribution in the dataset on an intuitive level. We can visualize clusters in 2D (or 3D) space by applying PCA to reduce the number of dimensions. Each dot on the figure represents a data point, and the color of the dot indicates the cluster to which it belongs. Visualization is an effective tool for understanding cluster analysis because it frequently exposes patterns that may be diﬀicult to see in raw data. [49]: from sklearn.decomposition import PCA pca = PCA(n_components=2) # Change to 3 for 3D visualization reduced_features = pca.fit_transform(data_encoded) plt.scatter(reduced_features[:,0], reduced_features[:,1], c=cluster_labels,␣ ↪cmap='viridis') plt.title('PCA - K-means Clustering') plt.show() Cluster Profilings Cluster centroids—which reflect the average of the attributes for each cluster—are analyzed during cluster profiling. These centroids can be examined to determine the key characteristics of each cluster. It is easier to meaningfully name and interpret clusters 6 when you are aware of their key characteristics. This makes it possible for stakeholders to use the cluster profiles to inform their decisions. [53]: k = 8 kmeans = KMeans(n_clusters=k, n_init=10) kmeans.fit(data_encoded) cluster_centers = kmeans.cluster_centers_ # Now cluster_centers will have the shape (k, 8), where k is the number of␣ ↪clusters print(cluster_centers) [[1.23568750e+03 3.99375000e+01 1.88721250e+04 6.57500000e+01 8.26250000e+01 8.75000000e-01 1.25000000e-01 1.00000000e+00] [1.31700000e+03 2.80000000e+01 4.65950000e+04 8.15000000e+01 8.97500000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00] [1.26544444e+03 3.58888889e+01 2.68982222e+04 7.98888889e+01 8.66111111e+01 6.66666667e-01 3.33333333e-01 1.00000000e+00] [1.35350000e+03 2.45000000e+01 5.46170000e+04 9.25000000e+01 8.95000000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00] [1.25250000e+03 5.25000000e+01 3.22445000e+04 6.95000000e+01 8.15000000e+01 0.00000000e+00 1.00000000e+00 2.00000000e+00] [1.37000000e+03 1.80000000e+01 6.19210000e+04 9.20000000e+01 8.80000000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00] [1.25400000e+03 4.20000000e+01 3.88116000e+04 7.46000000e+01 7.92000000e+01 0.00000000e+00 1.00000000e+00 2.00000000e+00] [1.24577778e+03 4.22222222e+01 2.33971111e+04 7.24444444e+01 8.24444444e+01 5.55555556e-01 4.44444444e-01 1.00000000e+00]] 0.4 Hierarchical clustering A hierarchy of clusters is created by the cluster analysis technique known as hierarchical clustering. To produce a structure resembling a tree known as a dendrogram, it first treats each data point as a separate cluster before iteratively merging or dividing clusters. The final clusters are determined by where the dendrogram is cut. Hierarchical clustering does not need a predetermined number of clusters, unlike KMeans. When the underlying data is hierarchical by nature or when the number of clusters is unknown in advance, it is very helpful. Dendrograms, a visual representation of the links between clusters, are provided by hierarchical clustering, which enables analysts to decide on the ideal number of clusters based on the structure of the data. In contrast to KMeans, it can be computationally demanding for large datasets, and the interpretation of clusters may be less objective. [54]: from scipy.cluster.hierarchy import linkage, dendrogram import matplotlib.pyplot as plt # Perform hierarchical clustering hierarchical_clusters = linkage(data_encoded, method='ward') 7 # Plot the dendrogram plt.figure(figsize=(12, 8)) dendrogram(hierarchical_clusters, labels=data_encoded.index, leaf_rotation=90) plt.title('Hierarchical Clustering Dendrogram') plt.xlabel('Universities') plt.ylabel('Distance') plt.show() Single-Linkage clustering The nearest-neighbor or minimal approach, which is another name for single-linkage clustering, calculates the separation between two clusters’ closest points. It is prone to producing extended clusters and is outlier-sensitive. The chaining phenomena can result from single-linkage, which can combine clusters based on just one or a few comparable data points. This approach may have trouble with complex cluster shapes, but it can be effective for compact and well-separated clusters. [57]: from scipy.cluster.hierarchy import linkage, dendrogram import matplotlib.pyplot as plt # Perform single-linkage hierarchical clustering and create dendrogram 8 single_linkage_dendrogram = dendrogram(linkage(data_encoded, method='single')) # Display the dendrogram plt.title('Single-Linkage Hierarchical Clustering Dendrogram') plt.xlabel('Sample Index') plt.ylabel('Distance') plt.show() 0.4.1 Complete linkage The furthest-neighbor method, or complete-linkage clustering, determines the separation between two clusters’ farthest points. Compared to single-linkage, it is less susceptible to outliers and has a tendency to form tight, spherical clusters. Complete-linkage is especially helpful for locating dense, well-defined clusters in the midst of noise since it is less impacted by noise and outliers. It can, however, have trouble handling extended clusters. [59]: # Perform complete-linkage hierarchical clustering and create dendrogram complete_linkage_dendrogram = dendrogram(linkage(data_encoded,␣ ↪method='complete')) 9 # Display the dendrogram plt.title('Complete-Linkage Hierarchical Clustering Dendrogram') plt.xlabel('Sample Index') plt.ylabel('Distance') plt.show() Group average By using group average clustering, the average distance between each pair of points in two groups is calculated. It achieves a balance between the complete-linkage and singlelinkage techniques. The group average is less susceptible to chaining effects than complete-linkage, but it is also less susceptible to outliers than single-linkage. When the data includes a combination of compact and elongated clusters, it is frequently chosen since it can manage clusters with different densities and shapes. [60]: # Perform group average hierarchical clustering and create dendrogram average_linkage_dendrogram = dendrogram(linkage(data_encoded, method='average')) # Display the dendrogram plt.title('Group Average Hierarchical Clustering Dendrogram') plt.xlabel('Sample Index') plt.ylabel('Distance') 10 plt.show() 0.5 Density Based clustering A method called density-based clustering uses the density of data points in the feature space to determine the locations of clusters. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is one of the most widely used density-based clustering techniques. DBSCAN classifies data points in low-density areas as outliers and aggregates data points that are densely packed together. Finding clusters of any shape can be facilitated by the fact that the number of clusters need not be predetermined. [61]: from sklearn.cluster import DBSCAN # Initialize the DBSCAN model with appropriate parameters # `eps` controls the maximum distance between two samples for one to be␣ ↪considered as in the neighborhood of the other # `min_samples` is the number of samples (or total weight) in a neighborhood␣ ↪for a point to be considered as a core point dbscan = DBSCAN(eps=0.5, min_samples=5) 11 # Fit the DBSCAN model to your preprocessed data cluster_labels = dbscan.fit_predict(data_encoded) # Add the cluster labels back to your DataFrame data_encoded['Cluster'] = cluster_labels # Check the clusters print(data_encoded['Cluster'].value_counts()) Cluster -1 48 Name: count, dtype: int64 [65]: import pandas as pd from sklearn.cluster import DBSCAN import matplotlib.pyplot as plt # Assuming 'data' is your DataFrame containing the features you want to cluster # For example, if you want to cluster based on 'SAT' and 'Expenditures/Student': features = ['SAT', 'Expenditures/Student'] X = data[features] # Instantiate and fit DBSCAN model dbscan = DBSCAN(eps=0.3, min_samples=5) ↪'min_samples' based on your data clusters = dbscan.fit_predict(X) # You might need to adjust 'eps' and␣ # Add the cluster labels to the original DataFrame data['Cluster'] = clusters # Plotting the clusters plt.figure(figsize=(8, 6)) for cluster_id in data['Cluster'].unique(): if cluster_id == -1: # -1 represents noise points in DBSCAN plt.scatter(data.loc[data['Cluster'] == cluster_id, 'SAT'], data.loc[data['Cluster'] == cluster_id, 'Expenditures/ ↪Student'], label=f'Noise', color='gray', alpha=0.5) else: plt.scatter(data.loc[data['Cluster'] == cluster_id, 'SAT'], data.loc[data['Cluster'] == cluster_id, 'Expenditures/ ↪Student'], label=f'Cluster {cluster_id}') plt.xlabel('SAT Scores') plt.ylabel('Expenditures per Student') 12 plt.title('DBSCAN Clustering') plt.legend() plt.show() 0.5.1 Ordering Points To Determine the Clustering Structure, or OPTICS: Ordering Points To Identify the Clustering Structure, or OPTICS, is a flexible density-based clustering technique that finds clusters in big datasets with different densities and shapes. OPTICS is unique in that it can find clusters without requiring one to know how many clusters there are. This makes it especially helpful in situations where the underlying structure of the data is complex and poorly defined. By sorting the data points according to their reachability distance, OPTICS enables the algorithm to reveal the underlying clustering structure in the form of a reachability plot. By using OPTICS, we can avoid assuming anything about the sizes or shapes of the clusters and instead obtain important insights about the natural groups that exist in our data. [70]: from sklearn.cluster import OPTICS # Assuming you have defined your data matrix X clusterer = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.05) 13 clusters = clusterer.fit_predict(X) # Print unique cluster labels print("Unique Cluster Labels:", set(clusters)) Unique Cluster Labels: {0, 1, 2, -1} 0.5.2 Visualize the clusters [20]: from sklearn.cluster import OPTICS # Define your feature matrix X X = data_encoded.drop('Cluster', axis=1) # Reset the index of the DataFrame X.reset_index(drop=True, inplace=True) # Initialize the OPTICS clusterer clusterer = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.05) # Perform clustering clusters = clusterer.fit_predict(X) plt.figure(figsize=(8, 6)) plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap='viridis', s=50,␣ ↪edgecolors='k') plt.xlabel('Mean') plt.ylabel('SAT') plt.title('OPTICS Clustering Result') plt.colorbar(label='Cluster Label') plt.show() 14 0.5.3 Mean Shift: This non-parametric, intuitive clustering approach works well at finding clusters in data without requiring a predetermined shape for the clusters. Mean Shift is very adaptable to various datasets since, in contrast to many other clustering approaches, it does not require prior knowledge of the number of clusters. Data points are iteratively moved toward the mode, or peak, of the underlying data distribution in order for the method to function. Clusters spontaneously form as points converge towards the local maxima. Mean Shift is resistant against outliers and especially useful for capturing intricate cluster patterns. We can uncover hidden patterns within our data without making strict assumptions about the cluster geometry thanks to its versatility and ease of use in finding clusters of varied shapes. [75]: from sklearn.cluster import MeanShift clusterer = MeanShift(bandwidth=0.5) clusters = clusterer.fit_predict(X) print(clusters) 15 [42 10 27 23 0 40 6 20 2 26 15 21 34 33 25 37 41 1 22 45 29 13 38 32 3 14 9 12 30 46 39 5 44 0 4 16 7 19 24 17 31 18 8 35 11 28 36 43] 0.5.4 Visualizing the clusters [79]: import matplotlib.pyplot as plt from sklearn.cluster import MeanShift # Assuming you have defined your data matrix X as a pandas DataFrame clusterer = MeanShift(bandwidth=0.5) clusters = clusterer.fit_predict(X) # Visualize the clusters plt.figure(figsize=(8, 6)) plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap='viridis', s=50,␣ ↪edgecolors='k') plt.xlabel('Mean') plt.ylabel('SAT') plt.title('MeanShift Clustering Result') plt.colorbar(label='Cluster Label') plt.show() 16 0.6 Summary We used a variety of approaches during the cluster analysis process to identify underlying patterns in a dataset. We started by managing missing values, switching data types, and standardizing characteristics as part of the preprocessing step of the data. We then used the partitioning technique known as k-means clustering to organize related data points into discrete clusters. We improved the effectiveness of our clustering model by figuring out the ideal number of clusters using the elbow approach. The results were made easier to grasp with the creation of cluster visualizations. After that, we looked into hierarchical clustering and created dendrograms using a variety of linkage techniques, including single, complete, and group average. An understanding of the hierarchical relationships between the data points was given by these dendrograms. Moreover, dense clusters of data points were found using density-based clustering techniques, such as DBSCAN. We also talked about OPTICS and mean shift algorithms, which are helpful in recognizing clusters with different densities and allow for more variable cluster designs. By means of these techniques, we have acquired a thorough comprehension of the underlying structures present in the datasets, which has facilitated eﬀicient analysis and interpretation. 17 Clustering October 25, 2023 0.1 Loading the data [4]: import pandas as pd import numpy as np from scipy.stats import zscore # Load the CSV file into a pandas DataFrame data = pd.read_csv('Data.csv') # Display the first few rows of the DataFrame print(data.head(10)) School Type Amherst Lib Arts Barnard Lib Arts Bates Lib Arts Berkeley University Bowdoin Lib Arts Brown University Bryn Mawr Lib Arts Cal Tech University Carleton Lib Arts Carnegie Mellon University Amherst Barnard Bates Berkeley Bowdoin Brown Bryn Mawr Cal Tech Carleton Carnegie Mellon 0.2 Median 1315 1220 1240 1176 1300 1281 1255 1400 1300 1225 SAT 22 53 36 37 24 24 56 31 40 64 Expenditures/Student 85 69 58 95 78 80 70 98 75 52 Acceptance Rate 26636 17653 17554 23665 25703 24201 18847 102262 15904 33607 \ Top 10% HS Graduation % 93 80 88 68 90 90 84 75 80 77 Data cleaning Handle the missing values 1 [5]: # Drop rows with any missing values data_cleaned = data.dropna() # Fill missing values in a specific column (e.g., 'Acceptance Rate') with the␣ ↪mean of that column data['Acceptance Rate'].fillna(data['Acceptance Rate'].mean(), inplace=True) 0.2.1 Removing duplicates [6]: # Remove duplicate rows based on all columns data_cleaned = data.drop_duplicates() # Remove duplicates based on specific columns data_cleaned = data.drop_duplicates(subset=['School Type', 'Median', 'SAT',␣ ↪'Acceptance Rate', 'Expenditures/Student', 'Top 10% HS Graduation %']) 0.2.2 Correcting data formats [7]: # Convert 'Acceptance Rate' to numeric (assuming it contains numerical values) data['Acceptance Rate'] = pd.to_numeric(data['Acceptance Rate'],␣ ↪errors='coerce') # Convert 'Expenditures/Student' to numeric (assuming it contains numerical␣ ↪values) data['Expenditures/Student'] = pd.to_numeric(data['Expenditures/Student'],␣ ↪errors='coerce') # Convert 'Top 10% HS Graduation %' to numeric (assuming it contains numerical␣ ↪values) data['Top 10% HS Graduation %'] = pd.to_numeric(data['Top 10% HS Graduation␣ ↪%'], errors='coerce') 0.2.3 Handling outliers [8]: # Identify and remove outliers using Z-score for 'Acceptance Rate' data = data[(np.abs(zscore(data['Acceptance Rate'])) < 3)] # Replace outliers with the median for 'Expenditures/Student' data['Expenditures/Student'] = np.where((np.abs(zscore(data['Expenditures/ ↪Student'])) < 3), data['Expenditures/Student'], data['Expenditures/Student'].median()) # Replace outliers with the median for 'Top 10% HS Graduation %' data['Top 10% HS Graduation %'] = np.where((np.abs(zscore(data['Top 10% HS␣ ↪Graduation %'])) < 3), 2 data['Top 10% HS Graduation %'], data['Top 10% HS Graduation %']. median()) ↪ 0.3 Clustering analysis K-means clustering An iterative process called K-Means clustering divides a dataset into K unique, non-overlapping clusters. In the feature space, K centroids are initially distributed at random. The closest centroid is then assigned to each data point, creating clusters. The centroids are recalculated as the average of each data point in each cluster. Until the centroids stop changing considerably or a predetermined number of repetitions has been reached, this assignment and update process is repeated. To lessen variance inside each cluster, K-Means optimizes cluster centroids to minimize the within-cluster sum of squares. The clusters are defined by the K cluster centroids and the data points assigned to these centroids in the final output. The elbow technique or silhouette score can be used to select the appropriate number of clusters (K), which will increase the algorithm’s ability to identify significant patterns in the data. [18]: from sklearn.cluster import KMeans # Choose the number of clusters (K) - you can use techniques like the elbow␣ ↪method to find the optimal K k = 3 # Initialize the KMeans model kmeans = KMeans(n_clusters=k, n_init=10) # Fit the model to your data_encoded kmeans.fit(data_encoded) # Get cluster labels for each data point cluster_labels = kmeans.labels_ # Add the cluster labels back to your DataFrame data_encoded['Cluster'] = cluster_labels # Display the first few rows of the DataFrame with cluster labels print(data_encoded.head()) Amherst Barnard Bates Berkeley Bowdoin Median 1315 1220 1240 1176 1300 SAT 22 53 36 37 24 Acceptance Rate 26636 17653 17554 23665 25703 Amherst Barnard Top 10% HS Graduation % 93.0 80.0 Expenditures/Student 85.0 69.0 58.0 95.0 78.0 School Type_Lib Arts True True 3 \ \ Bates Berkeley Bowdoin 88.0 68.0 90.0 Amherst Barnard Bates Berkeley Bowdoin School Type_University False False False True False 0.3.1 True False True Cluster 0 0 0 0 0 Aspects of the k-means Elbow Method for Optimal K: As the number of clusters rises, the Elbow Method graph shows the distortion (inertia). It aids in locating the optimal K below which the model is not considerably enhanced by the addition of more clusters. In the graph, we seek out the location where the rate of decline suddenly changes direction, creating a “elbow.” This idea advises striking a balance between minimizing overfitting and maximizing the volatility in the data. A more distinct elbow denotes a clearer preference for the number of clusters. [46]: import warnings import matplotlib.pyplot as plt warnings.simplefilter(action='ignore', category=FutureWarning) distortions = [] K = range(1,10) for k in K: kmeanModel = KMeans(n_clusters=k) kmeanModel.fit(data_encoded) distortions.append(kmeanModel.inertia_) plt.figure(figsize=(16,8)) plt.plot(K, distortions, 'bx-') plt.xlabel('k') plt.ylabel('Distortion') plt.title('The Elbow Method showing the optimal k') plt.show() 4 Cluster statistics The features of each cluster can be seen by computing statistics like mean and median for each attribute within clusters. For instance, median values reveal the core tendency, which is less influenced by outliers, while mean values reveal the average behavior of data points in a cluster. Understanding the characteristic characteristics of the clusters facilitates their comparison and interpretation through statistical analysis. [48]: cluster_stats = data_encoded.groupby('Cluster').agg(['mean', 'median']) print(cluster_stats) Cluster 0 1 2 Median mean median 1335.000000 1246.235294 1253.571429 1350.0 1248.5 1280.0 SAT Acceptance Rate mean median mean 25.571429 39.470588 45.000000 19.0 37.0 45.0 51076.428571 22194.470588 36935.285714 \ median 48123.0 22077.0 38597.0 Expenditures/Student Top 10% HS Graduation % mean median mean median Cluster 0 1 2 86.142857 71.264706 73.142857 90.0 72.0 74.0 89.428571 83.632353 79.857143 90.0 85.0 77.0 School Type_Lib Arts School Type_University mean median mean median Cluster 0 0.000000 0.0 1.000000 5 1.0 \ 1 2 0.735294 0.000000 1.0 0.0 0.264706 1.000000 0.0 1.0 Visualizing clusters Visualization is essential for understanding the cluster distribution in the dataset on an intuitive level. We can visualize clusters in 2D (or 3D) space by applying PCA to reduce the number of dimensions. Each dot on the figure represents a data point, and the color of the dot indicates the cluster to which it belongs. Visualization is an effective tool for understanding cluster analysis because it frequently exposes patterns that may be diﬀicult to see in raw data. [49]: from sklearn.decomposition import PCA pca = PCA(n_components=2) # Change to 3 for 3D visualization reduced_features = pca.fit_transform(data_encoded) plt.scatter(reduced_features[:,0], reduced_features[:,1], c=cluster_labels,␣ ↪cmap='viridis') plt.title('PCA - K-means Clustering') plt.show() Cluster Profilings Cluster centroids—which reflect the average of the attributes for each cluster—are analyzed during cluster profiling. These centroids can be examined to determine the key characteristics of each cluster. It is easier to meaningfully name and interpret clusters 6 when you are aware of their key characteristics. This makes it possible for stakeholders to use the cluster profiles to inform their decisions. [53]: k = 8 kmeans = KMeans(n_clusters=k, n_init=10) kmeans.fit(data_encoded) cluster_centers = kmeans.cluster_centers_ # Now cluster_centers will have the shape (k, 8), where k is the number of␣ ↪clusters print(cluster_centers) [[1.23568750e+03 3.99375000e+01 1.88721250e+04 6.57500000e+01 8.26250000e+01 8.75000000e-01 1.25000000e-01 1.00000000e+00] [1.31700000e+03 2.80000000e+01 4.65950000e+04 8.15000000e+01 8.97500000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00] [1.26544444e+03 3.58888889e+01 2.68982222e+04 7.98888889e+01 8.66111111e+01 6.66666667e-01 3.33333333e-01 1.00000000e+00] [1.35350000e+03 2.45000000e+01 5.46170000e+04 9.25000000e+01 8.95000000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00] [1.25250000e+03 5.25000000e+01 3.22445000e+04 6.95000000e+01 8.15000000e+01 0.00000000e+00 1.00000000e+00 2.00000000e+00] [1.37000000e+03 1.80000000e+01 6.19210000e+04 9.20000000e+01 8.80000000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00] [1.25400000e+03 4.20000000e+01 3.88116000e+04 7.46000000e+01 7.92000000e+01 0.00000000e+00 1.00000000e+00 2.00000000e+00] [1.24577778e+03 4.22222222e+01 2.33971111e+04 7.24444444e+01 8.24444444e+01 5.55555556e-01 4.44444444e-01 1.00000000e+00]] 0.4 Hierarchical clustering A hierarchy of clusters is created by the cluster analysis technique known as hierarchical clustering. To produce a structure resembling a tree known as a dendrogram, it first treats each data point as a separate cluster before iteratively merging or dividing clusters. The final clusters are determined by where the dendrogram is cut. Hierarchical clustering does not need a predetermined number of clusters, unlike KMeans. When the underlying data is hierarchical by nature or when the number of clusters is unknown in advance, it is very helpful. Dendrograms, a visual representation of the links between clusters, are provided by hierarchical clustering, which enables analysts to decide on the ideal number of clusters based on the structure of the data. In contrast to KMeans, it can be computationally demanding for large datasets, and the interpretation of clusters may be less objective. [54]: from scipy.cluster.hierarchy import linkage, dendrogram import matplotlib.pyplot as plt # Perform hierarchical clustering hierarchical_clusters = linkage(data_encoded, method='ward') 7 # Plot the dendrogram plt.figure(figsize=(12, 8)) dendrogram(hierarchical_clusters, labels=data_encoded.index, leaf_rotation=90) plt.title('Hierarchical Clustering Dendrogram') plt.xlabel('Universities') plt.ylabel('Distance') plt.show() Single-Linkage clustering The nearest-neighbor or minimal approach, which is another name for single-linkage clustering, calculates the separation between two clusters’ closest points. It is prone to producing extended clusters and is outlier-sensitive. The chaining phenomena can result from single-linkage, which can combine clusters based on just one or a few comparable data points. This approach may have trouble with complex cluster shapes, but it can be effective for compact and well-separated clusters. [57]: from scipy.cluster.hierarchy import linkage, dendrogram import matplotlib.pyplot as plt # Perform single-linkage hierarchical clustering and create dendrogram 8 single_linkage_dendrogram = dendrogram(linkage(data_encoded, method='single')) # Display the dendrogram plt.title('Single-Linkage Hierarchical Clustering Dendrogram') plt.xlabel('Sample Index') plt.ylabel('Distance') plt.show() 0.4.1 Complete linkage The furthest-neighbor method, or complete-linkage clustering, determines the separation between two clusters’ farthest points. Compared to single-linkage, it is less susceptible to outliers and has a tendency to form tight, spherical clusters. Complete-linkage is especially helpful for locating dense, well-defined clusters in the midst of noise since it is less impacted by noise and outliers. It can, however, have trouble handling extended clusters. [59]: # Perform complete-linkage hierarchical clustering and create dendrogram complete_linkage_dendrogram = dendrogram(linkage(data_encoded,␣ ↪method='complete')) 9 # Display the dendrogram plt.title('Complete-Linkage Hierarchical Clustering Dendrogram') plt.xlabel('Sample Index') plt.ylabel('Distance') plt.show() Group average By using group average clustering, the average distance between each pair of points in two groups is calculated. It achieves a balance between the complete-linkage and singlelinkage techniques. The group average is less susceptible to chaining effects than complete-linkage, but it is also less susceptible to outliers than single-linkage. When the data includes a combination of compact and elongated clusters, it is frequently chosen since it can manage clusters with different densities and shapes. [60]: # Perform group average hierarchical clustering and create dendrogram average_linkage_dendrogram = dendrogram(linkage(data_encoded, method='average')) # Display the dendrogram plt.title('Group Average Hierarchical Clustering Dendrogram') plt.xlabel('Sample Index') plt.ylabel('Distance') 10 plt.show() 0.5 Density Based clustering A method called density-based clustering uses the density of data points in the feature space to determine the locations of clusters. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is one of the most widely used density-based clustering techniques. DBSCAN classifies data points in low-density areas as outliers and aggregates data points that are densely packed together. Finding clusters of any shape can be facilitated by the fact that the number of clusters need not be predetermined. [61]: from sklearn.cluster import DBSCAN # Initialize the DBSCAN model with appropriate parameters # `eps` controls the maximum distance between two samples for one to be␣ ↪considered as in the neighborhood of the other # `min_samples` is the number of samples (or total weight) in a neighborhood␣ ↪for a point to be considered as a core point dbscan = DBSCAN(eps=0.5, min_samples=5) 11 # Fit the DBSCAN model to your preprocessed data cluster_labels = dbscan.fit_predict(data_encoded) # Add the cluster labels back to your DataFrame data_encoded['Cluster'] = cluster_labels # Check the clusters print(data_encoded['Cluster'].value_counts()) Cluster -1 48 Name: count, dtype: int64 [65]: import pandas as pd from sklearn.cluster import DBSCAN import matplotlib.pyplot as plt # Assuming 'data' is your DataFrame containing the features you want to cluster # For example, if you want to cluster based on 'SAT' and 'Expenditures/Student': features = ['SAT', 'Expenditures/Student'] X = data[features] # Instantiate and fit DBSCAN model dbscan = DBSCAN(eps=0.3, min_samples=5) ↪'min_samples' based on your data clusters = dbscan.fit_predict(X) # You might need to adjust 'eps' and␣ # Add the cluster labels to the original DataFrame data['Cluster'] = clusters # Plotting the clusters plt.figure(figsize=(8, 6)) for cluster_id in data['Cluster'].unique(): if cluster_id == -1: # -1 represents noise points in DBSCAN plt.scatter(data.loc[data['Cluster'] == cluster_id, 'SAT'], data.loc[data['Cluster'] == cluster_id, 'Expenditures/ ↪Student'], label=f'Noise', color='gray', alpha=0.5) else: plt.scatter(data.loc[data['Cluster'] == cluster_id, 'SAT'], data.loc[data['Cluster'] == cluster_id, 'Expenditures/ ↪Student'], label=f'Cluster {cluster_id}') plt.xlabel('SAT Scores') plt.ylabel('Expenditures per Student') 12 plt.title('DBSCAN Clustering') plt.legend() plt.show() 0.5.1 Ordering Points To Determine the Clustering Structure, or OPTICS: Ordering Points To Identify the Clustering Structure, or OPTICS, is a flexible density-based clustering technique that finds clusters in big datasets with different densities and shapes. OPTICS is unique in that it can find clusters without requiring one to know how many clusters there are. This makes it especially helpful in situations where the underlying structure of the data is complex and poorly defined. By sorting the data points according to their reachability distance, OPTICS enables the algorithm to reveal the underlying clustering structure in the form of a reachability plot. By using OPTICS, we can avoid assuming anything about the sizes or shapes of the clusters and instead obtain important insights about the natural groups that exist in our data. [70]: from sklearn.cluster import OPTICS # Assuming you have defined your data matrix X clusterer = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.05) 13 clusters = clusterer.fit_predict(X) # Print unique cluster labels print("Unique Cluster Labels:", set(clusters)) Unique Cluster Labels: {0, 1, 2, -1} 0.5.2 Visualize the clusters [20]: from sklearn.cluster import OPTICS # Define your feature matrix X X = data_encoded.drop('Cluster', axis=1) # Reset the index of the DataFrame X.reset_index(drop=True, inplace=True) # Initialize the OPTICS clusterer clusterer = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.05) # Perform clustering clusters = clusterer.fit_predict(X) plt.figure(figsize=(8, 6)) plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap='viridis', s=50,␣ ↪edgecolors='k') plt.xlabel('Mean') plt.ylabel('SAT') plt.title('OPTICS Clustering Result') plt.colorbar(label='Cluster Label') plt.show() 14 0.5.3 Mean Shift: This non-parametric, intuitive clustering approach works well at finding clusters in data without requiring a predetermined shape for the clusters. Mean Shift is very adaptable to various datasets since, in contrast to many other clustering approaches, it does not require prior knowledge of the number of clusters. Data points are iteratively moved toward the mode, or peak, of the underlying data distribution in order for the method to function. Clusters spontaneously form as points converge towards the local maxima. Mean Shift is resistant against outliers and especially useful for capturing intricate cluster patterns. We can uncover hidden patterns within our data without making strict assumptions about the cluster geometry thanks to its versatility and ease of use in finding clusters of varied shapes. [75]: from sklearn.cluster import MeanShift clusterer = MeanShift(bandwidth=0.5) clusters = clusterer.fit_predict(X) print(clusters) 15 [42 10 27 23 0 40 6 20 2 26 15 21 34 33 25 37 41 1 22 45 29 13 38 32 3 14 9 12 30 46 39 5 44 0 4 16 7 19 24 17 31 18 8 35 11 28 36 43] 0.5.4 Visualizing the clusters [79]: import matplotlib.pyplot as plt from sklearn.cluster import MeanShift # Assuming you have defined your data matrix X as a pandas DataFrame clusterer = MeanShift(bandwidth=0.5) clusters = clusterer.fit_predict(X) # Visualize the clusters plt.figure(figsize=(8, 6)) plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap='viridis', s=50,␣ ↪edgecolors='k') plt.xlabel('Mean') plt.ylabel('SAT') plt.title('MeanShift Clustering Result') plt.colorbar(label='Cluster Label') plt.show() 16 0.6 Summary We used a variety of approaches during the cluster analysis process to identify underlying patterns in a dataset. We started by managing missing values, switching data types, and standardizing characteristics as part of the preprocessing step of the data. We then used the partitioning technique known as k-means clustering to organize related data points into discrete clusters. We improved the effectiveness of our clustering model by figuring out the ideal number of clusters using the elbow approach. The results were made easier to grasp with the creation of cluster visualizations. After that, we looked into hierarchical clustering and created dendrograms using a variety of linkage techniques, including single, complete, and group average. An understanding of the hierarchical relationships between the data points was given by these dendrograms. Moreover, dense clusters of data points were found using density-based clustering techniques, such as DBSCAN. We also talked about OPTICS and mean shift algorithms, which are helpful in recognizing clusters with different densities and allow for more variable cluster designs. By means of these techniques, we have acquired a thorough comprehension of the underlying structures present in the datasets, which has facilitated eﬀicient analysis and interpretation. 17

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Programming Computer Science Worksheet ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now