Clustering is an unsupervised machine learning method used for grouping similar data in datasets so it can be easily understood and manipulated. One such algorithm, k-means, takes data and learns how it can be grouped. Some real-world examples of its use include fake news identification, fantasy league stat analysis, insurance fraud detection, or customer/market segmentation.
To perform a k-means analysis using the k-means algorithm, complete the following:
Access the “UCI Machine Learning Repository,” https://archive.ics.uci.edu/datasets . Note: There are about 120 data sets that are suitable for use in a clustering task. For this part of the exercise, you must choose two of these datasets, provided they include at least 10 attributes and 10,000 instances.
For your selected datasets, build a K-means clustering model.
- Start by choosing the number of clusters. Discuss how you would find the optimal number of clusters that best fits the dataset.
- Randomly pick k centroids “not necessarily from your dataset” (or points that will be the center of your clusters) in d-space. Try to make them near the data but different from one another.
- Assign each data point to the closest centroid. This will form your k clusters. Apply the Euclidian distance to form your clusters.
- Move the centroids to the average location of the data points assigned to it.
- Repeat the preceding two steps until the assignments do not change or change very little.
Note: A key objective is to minimize the variation within the clusters defined as the sum of squared Euclidean distances between items and the corresponding centroid.
- Explain the dataset and the type of information you wish to gain by applying a clustering method.
- Explain the k-means algorithm and how you will be using it in your analysis (list the steps, the intuition behind the mathematical representation, and address its assumptions).
- Import the necessary libraries, then read the dataset into a data frame and perform initial statistical exploration.
- Clean the data and address unusual phenomena (e.g., outliers); use illustrative diagrams and plots and explain them.
- Formulate two questions that can be answered by performing a clustering analysis using the k-means.
- Use the elbow method to find the optimal number of clusters for your chosen dataset. Justify your chosen (final) value of k.
- Perform k-means analysis. Explain the intuition behind each mathematical step.
- Interpret the results in the context of the questions you asked.
- Discuss how you minimized the variation within the clusters.
- Validate your model. Then, explain the results.
- Include all mathematical formulas used and graphs representing the final outcomes.
Prepare a comprehensive technical report as a Jupyter notebook, including all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains
a) Problem statement
b) Algorithm of the solution
c) Analysis of the findings
d) References