I believe the Q4 data is from
http://archive.ics.uci.edu/ml/datasets/Pen-Based+R…
my instructor just type (see lecture 11-28)
from sklearn import datasets
digits = datasets.load_digits()
for 6e, I believe there are typo, 5c,5d, are actually 6c,6d
Problem 4: Classifying digits using K-means
Grading criteria: correctness of code and thoroughness of analysis.
4a. Use the K-means classifier demonstrated in lecture on 11/30 to sort the digits dataset used in lecture on 11/28 into 10 clusters. Note that you should not enter the labels of the digits, just the 8×8
arrays.
In [ ]:
4b. Relabel the clusters so that the label of each cluster is the most common digit in that cluster.
In ():
4c. How effective would a model be which takes an array corresponding to a digit, sorts it into a cluster, and predicts that the digit is the same as the label of that cluster? What percent of the digits
would be correctly labeled? Which digit would be correctly labeled most frequently? Least frequently?
In [ ]:
4d. How does this model compare to the support vector machine we used in lecture on 11/28? Why do you think this is?
In ( ):
Problem 5: Visualizing digits with principal component analysis
Grading criteria: correctness of code
5a. Use principal component analysis as demonstrated in lecture on 11/30 to project the points of the digit dataset into two dimensions.
In ():
5b. Make two different plots of this projection, one color-coded using the correct labels of the dataset and one color-coded using the labels you obtained in problem 4 using K-means classification.
In ():
Problem 6: Implementing two-dimensional K-means
Grading criteria: correctness of code
The goal of this problem is to implement K-means clustering for a two-dimensional dataset. Throughout this problem, we’ll represent a point in the plane as a tuple (x,y) containing two floats.
6a. Write a Python function that, given two points in the plane, returns the distance between the two points.
In [ ]:
6b. Write a Python function that, given a list k points [p1,…,pk] in the plane and another point q, returns the point pi in the list that is closest to q.
In [ ]:
6c. Write a Python function that takes as input a list of K ‘center points’ [p1,…,pk] in the plane and a list of points in the plane. It should then return a dictionary with K keys, corresponding to the
points p1, …, pk, such that the value corresponding to the key pi is a list of all the points in l which are closer to pi than to any of the other center points.
In ( ):
6d. Write a Python function that, given a list 1 of points in the plane, returns the centroid. The x-coordinate of the centroid is the average of the x-coordinates of the points in l, and the y-coordinate of
the centroid is the average of the y-coordinates of the points in l.
In [ ]:
6e. Combining the above functions, write a Python function to implement K-means clustering. Your function should take as input a list l of points in the plane and a positive integer K. It should then
choose K random points in the plane and sort the points of l into K lists using your function from 5c. It should use your function from 5d to compute the centroid of each list, and rerun your function
from 5c using these K points. It should repeat this process until the lists stop changing, then return the Klists.
In ( ):
6f. Apply your function to the two-dimensional projection of the digits dataset that you made in 5a. Make a plot of this projection, colored using the labels generated by your function.
e Edit
C Edit
Machine learning with scikit-learn
A great package for machine learning with Python is scikit-learn. It is open source and works with packages like Numpy and matplotlib that we have already worked with.
Today we’ll see an example of machine learning predicting text. Let’s load the digits dataset and take a look at what it contains.
In [1]: from sklearn import datasets
digits = datasets. load_digits()
In [2]: len (digits.data)
Out[2]: 1797
In [3]: digits.images(0), digits.target[0]
0.,
5.,
9.,
(array([
0.,
13.,
0., 0. 13., 15., 10.,
0.,
3. 15., 2.1
0.,
0.1
4.
12.,
0.,
8.
[ 0.1
4.,
11.,
1.1
2. 14.,
5., 10.,
0., 6. 13., 10.,
10.,
Out [3]:
0.,
5.
0.,
0.
0.,
0.1
0.1
1.,
15.,
11.,
8.
9.,
12.,
12.,
0.,
8.
8.
8.
7.
0.),
0.)
0.),
0.),
0.),
0.),
0.),
5.1
10.,
0.
0.
0.]]), 0)
The dataset consists of 1797 very low resolution pictures of digits, along with the digits that are being represented. Let’s plot one of these images.
In [4): import matplotlib.pyplot as plt
In [5]: show(plt.matshow(digits.images[0], interpolation=”nearest”, cmap=plt.cm.gray_r))
Axes Image (43.2, 25.92;223.2×223.2)
Support vector classification
To build our model, we’ll use a support vector machine. Understanding exactly how these work is beyond the scope of this class, but I’ll try to give the basic idea.
Each data point in the digits dataset consists of an 8 by 8 array, so we can think of our data as sitting in 64-dimensional space. We know which points correspond to the digits 0,1,…,9, and we hope that
nearby points in this 64-dimensional space should correspond to the same digit. This seems plausible, as if two points are very close to each other the images should look similar, just with slightly
different shading.
Our goal is therefore to partition the space of all possible arrays into 10 pieces, one for each digit. A basic (and computationally simple) way to split up n-dimensional space into pieces is by using
hyperplanes. In 2-space, this means we would use a line to split the plane into two pieces. In 3-space, we would use a plane to split our space into two pieces, and in general n-space, we use a linear
n – 1-dimensional subspace to split into two pieces. Using multiple hyperplanes allows us to split our space into more than two pieces.
Given a pair of digits, we hope to find a hyperplane in 64-space that completely separates the pair. For example, we might be able to separate 0 from 8 by seeing if the center of the array has large or
small numbers. There are two natural questions that we have to think about at this point:
• If there are multiple hyperplanes that separate two digits, which one should we choose?
• What do we do if there are no hyperplanes separating the digits?
To choose the best hyperplane separating two classes of data, we need to decide what “best” should mean. There are plenty of valid ways to define this best in this situation. One possibility is to choose
the hyperplane that is as far as possible from the nearest data point on each side. This is demonstrated by the following picture from Wikipedia.
X2
H
Hz