Brooklyn College Programming Worksheet

Programming Assignment: PCAMachine Learning
Total points: 100
Note: This assignment is for each individual student to complete on his or her
own.
In this assignment, you will implement PCA to compress images. To get started,
you will need to download the starter code and unzip its contents to the directory
where you wish to complete the assignment.
The problem considered in this assignment is to compress the image downloaded
from https://leafyplace.com/types-of-birds/
You are required the complete the following steps:
1. Implement the PCA to reduce the dimension of the data.
2. Construct the compressed image.
To get started, open the main script assignmentPCA.m. You are required to
modify this script as well as all the other three scripts, including
• findPCs.m – Function to find the principle components
• PCAtransform.m – Function to transform data into the PC space
• PCAtransform_inv.m – Function to transform data back to the
original space
What to submit?
A zip file that includes the following items:
1) All codes (70 points)
2) A report that includes (30 points):
a. (20 points) Compressed images when K = 5, 30, 100
b. (5 points) Explain the impact of K
c. (5 points) Describe how your current implementation can be
potentially improved to achieve better performance.
1
Introduction of Dimensionality Reduction
The dimensionality reduction is the process of reducing the number of
input features to capture the most intrinsic dimensionality of the data.
It has following attractive advantages:
• It enables the compression of the data and thus reduces the storage
space required.
• It reduces the computational cost. Less number of features leads
to less computational cost.
• It allows the training data to be visualized for better analysis, if
the number of dimensions is reduced to 2-D or 3-D.
• It improves the performance of the machining learning models, by
removing redundant features that are correlated to the principle
ones.
There are two types of dimensionality reduction algorithms: feature selection and feature extraction. In the following lecture,
we will introduce some of the most widely used feature selection and
feature extraction algorithms.
1
2
Feature Selection
Feature selection is the process of selecting a subset of features for
model training. It is often used as filters that exclude irrelevant and
redundant features that do not contribute to the accuracy of a predictive model or even decrease the accuracy of the model.
The objectives of feature selection include 1) improve the speed
and predictive accuracy of the predictive models; and 2) enhance the
simplicity and comprehensibility of the results.
There are three general classes of feature selection methods:
• Filter methods: e.g., ranking methods, clustering, etc.
• Wrapper methods:
e.g., recursive feature elimination algorithm, genetic algorithms, simulated annealing, etc.
• Embedded methods:
forests, etc.
2.1
e.g., ridge regression, LASSO, random
Filter Methods
Filter methods use a statistical measure to assess the merit of each
feature. The features are then ranked by their merits and those with
low merits are removed from the data.
The ranking methods assess the relevance (or importance) of each
feature to the output and select the top k features that are most
relevant to the output.
2
2.1.1
Pearson’s Correlation
Suppose Xi ∈ RN ×1 is the i-th feature vector and Y ∈ RN ×1 is the
output vector. The Pearson’s correlation coefficient ρ(Xi , Y ) can be
computed by
ρ(Xi , Y ) =
cov(Xi , Y )
σXi σY
PN
= qP
N
j=1 (xij − x̄i )(yj − ȳ)
2
j=1 (xij − x̄i )
qP
N
2
j=1 (yj − ȳ)
where
• xij is the j-th element of Xi
P
• x̄i = N1 N
j=1 xij
• ȳ = N1
PN
j=1 yj .
This measure describes how correlated the i-th feature and the
output is. The larger this value is, the more they are correlated.
3
2.1.2
Mutual Information
Mutual information measures the amount of information a random
variable contains about another random variable. It equals to zero if
and only if the two random variables are independent.
entropy: A measure to quantify the notion of information.
The entropy of a continuous random variable X with pdf fX (x)
is
Z
H(X) = − fX (x) log fX (x)dx = −E[log(fX (x))]
R
P
For discrete random variables, we replace by . The entropy measures the expected uncertainty in X and describes how much information we learn on average from X.
Example 1 Consider a Bernoulli random variable X, which takes two
possible values 0 and 1, with P (X = 1) = p. Calculate the entropy of
X.
4
Now consider two random variables X, Y jointly distributed with
joint pdf fXY (x, y).
Joint entropy:
Z Z
H(X, Y ) = −
fXY (x, y) log fXY (x, y)dxdy
which measures how much uncertainty is in the two random variables
taken together.
Conditional entropy of X given Y :
Z Z
H(X|Y ) = −
fXY (x, y) log fX|Y (x|y)dxdy
which measures how much uncertainty is in the random variable X
when we know the value of Y .
mutual information of the two random variables:
Z Z
fXY (x, y)
I(X; Y ) =
log
fXY (x, y)dxdy
fX (x)fY (y)
Z Z
fY |X (y|x)
=
log
fXY (x, y)dxdy
fY (y)
Z Z

=
log fY |X (y|x) − log fY (y) fXY (x, y)dxdy
Z Z
Z Z
=
log fY |X (y|x)fXY (x, y)dxdy −
log fY (y)fXY (x, y)dxdy
= H(Y ) − H(Y |X)
We can also obtain
I(X; Y ) = H(X) − H(X|Y )
= H(X) + H(Y ) − H(X, Y )
= H(X, Y ) − H(X|Y ) − H(Y |X)
5
Example 2 Consider a data set that describes the relationship between
the blood type X and the chance for skin cancer Y . The joint probabilities P (X, Y ) are provided in Table 1.Calculate the conditional entropy
H(X|Y ), H(Y |X) and the mutual information I(X; Y ).
A
B
AB
O
Very low 1/8 1/16 1/32 1/32
Low
1/16 1/8 1/32 1/32
Medium 1/16 1/16 1/16 1/16
High
1/4
0
0
0
Table 1: Probabilities P (X, Y ).
6
The advantage of the ranking methods is that they are fast.
The disadvantage is that the ranking methods consider each feature separately and ignore dependencies among the features. Therefore, the selected top k features may not be the best ones.
2.2
Wrapper Methods
Wrapper methods formulate the selection of a subset of features as
a search problem, where different subsets of features are compared
according to their contributions to the model accuracy using the predictive model.
In particular, suppose w ∈ Rn×1 is a vector with the i-th element
w[i] equal to 1 if the i-th feature belongs to the selected set of features.
Otherwise, w[i] = 0. n is the total number of features.
The wrappers aim to find the optimal vector w∗ that minimizes the
prediction error of the corresponding model, i.e.,
w∗ = arg min Errorw
w
where Errorw represents the generalization error of the model built
using the set of features described by w.
As the number of possible values of w increases exponentially with
the increase of the total number of features n, exhaustive search is too
computational expansive for large n. The heuristic search strategies
are then typically used.
7
Three typical greedy search strategies:
• Forward selection: The forward selection approach starts from
an empty set and adds features one by one. In each iteration, the
feature to be added into the set is the one, when added, that leads
to the greatest performance improvement (i.e., lowest error). The
iteration stops when no or little improvement is achieved.
• Backward selection: Opposite from the forward selection approach, the backward selection approach starts from the full set
and progressively removes features from the set. Similarly, in each
iteration, the feature to be removed is the one, when removed, that
leads to the greatest performance improvement.
• Stepwise selection: This approach combines the forward and
backward selection approaches. At each iteration, it allows adding
or removing a feature.
As an example, we provide the procedure of the forward selection
approach as follows:
1. Initialize the set of features as F = ∅
2. At each iteration,
• find the best new feature
xj = arg min Error(F ∪ xi )
i
• if
Error(F ∪ xj ) < Error(F ) F = F ∪ xj • otherwise, stop the iteration 8 2.3 Embedded Methods Embedded methods perform feature selection while the predictive model is being created. Such methods are usually specific to given learning algorithms. The most common type of embedded feature selection methods are regularization methods, such as the ridge regression and Lasso. The regularization methods introduce constraints on the model coefficients and penalize large coefficients to bias the model toward simpler models with fewer coefficients. 2.3.1 Ridge Regression Let’s briefly summarize the ridge regression here. The least-squares solution of the ridge regression is p N X X T 2 β̂ = arg min{ (yi − xi b) + λ b2j } b i=1 j=1 = arg min (Y − Xb)T (Y − Xb) + λbT b b = (X X + λI)−1 X T Y T where λ > 0 is a tuning parameter that controls the level of penalty.
A larger λ leads to a simplier model.
9
We can also formulate the ridge regression as follows
N
X
β̂ = arg min{ (yi − xTi b)2 }
b
i=1
subject to
p
X
b2j ≤ L
j=1
where L is a constant related to λ.
2.3.2
LASSO
The Least Absolute Selection and Shrinkage Operator (LASSO) estimate is given by
N
X
β̂ = arg min{ (yi − xTi b)2 }
b
i=1
subject to
p
X
|bj | ≤ L
j=1
The 1-norm penalty allows a stronger constraint on the coefficients.
As the problem now is nonlinear, quadratic programming algorithm
needs to be applied to find the solution.
10
3
Feature Extraction
The feature extraction approaches aim to convert the original p feature
vectors xi , i = 1, 2, . . . , p to k < p new vectors zj , j = 1, 2, . . . , k. The most widely used feature extraction method is principle component analysis, which is a linear projection method, i.e., the new feature vector zj is a linear combination of the original feature vectors xi . 3.1 Preliminary Knowledge 3.1.1 Projection The vector projection of a vector x on a nonzero vector w is a vector parallel to w, i.e., w z=x kwk where x = kxk cos θ = x · w kwk is the length of the projection. If w is a unit vector, i.e., kwk = 1. The length of the projection equals x = x · w = wT x. 11 3.1.2 Constrained Optimization using Lagrange Multiplier The Lagrange multiplier is frequently used to find the local maxima and minima of a function subject to equality constraints. Consider an optimization problem formulated as follows maximize f (x, y) subject to g(x, y) = 0 where g and f have continuous partial derivatives. The Lagrange function is then defined by L(x, y, λ) = f (x, y) − λg(x, y) where λ is called the Lagrange multiplier. The maxima or minima is found by taking the partial derivatives of L with respect to each variable and setting the derivatives to zero. 3.1.3 Eigenvalue and Eigenvector Let M be a square matrix, λ be a constant and v as a nonzero unit column vector with the same number of rows as M . Then λ is an eigenvalue of M and v is the corresponding eigenvector of M if M v = λv The above equation tells us that (M − λI)v = 0. For a nonzero vector v, this holds when the determinant of M − λI is 0. The determinant of a 2 × 2 matrix is a b = ad − bc c d 12 2 0 Example 3 Find the eigenvalues and eigenvectors of matrix A = . 0 1 2 1 . Example 4 Find the eigenvalues and eigenvectors of matrix A = 1 3 13 3.2 Principle Components Analysis (PCA) The key idea of PCA is to find a mapping from inputs in the original high-dimensional space to a new low-dimensional space, so that the loss of information is minimized. This is achieved by maximizing the variance of the inputs in the low-dimensional space. T Consider a random vector x = [x1 , x2 , .T.. , xp ] with mean µ and covariance matrix Σ = E (x − µ)(x − µ) . PCA tries to find in order the most informative k < p linear combinations of the variables x1 , x2 , . . . , xp , which are called principle components and are denoted here as w1 , w2 , . . ., wk . Information is interpreted as a percentage of the total variation in Σ. For instance, w1 is the first principle component, such that the sample, after projection onto w1 , is most spread out, i.e., with the highest variance. Note that if w1 is a unit vector with kw1 k = 1, the length of projection of x on the direction of w1 is z1 = w1T x We then want to maximize the variance of z1 given by V (z1 ) = w1T Σw1 14 To calculate the maximum V (z1 ) subject to kw1 k = w1T w1 = 1, we formulate the above equation as a Lagrange function L(w1 , λ) = w1T Σw1 − λ(w1T w1 − 1) and then take the derivative of L(w1 , λ) with respect to w1 and setting it to 0, i.e., ∇w1 L(w1 , λ) = 2Σw1 − 2λw1 = 0 Therefore Σw1 = λw1 which holds when w1 is the eigenvector of Σ and λ is the corresponding eigenvalue. To maximize V (z1 ), we then have max w1T Σw1 = max λw1T w1 = max λ which indicates that λ should be the largest eigenvalue, and thus w1 should be the corresponding eigenvector. 15 The second principle component w2 can be found in the similar way, which should also maximize the variance, and also be orthogonal to w1 , i.e., w2T w1 = 0, and have a unit length. In particular, the projection of x on the direction of w2 is z2 = w2T x, and then we want to maximize its variance V (z2 ) = w2T Σw2 subject to constraints w2T w1 = 0 w2T w2 = 1 Formulated as a Lagrange problem, we have L(w2 , λ, α) = w2T Σw2 − λ(w2T w2 − 1) − α(w2T w1 − 0) Taking the derivative with respect to w2 and setting it to 0, we have 2Σw2 − 2λw2 − αw1 = 0 (1) Multiplying by w1T , we have 2w1T Σw2 − 2λw1T w2 − αw1T w1 = 0 Note that w1T w2 = 0 and w1T Σw2 = w2T Σw1 (w1T Σw2 is a scalar). Since w1 is an eigenvector, we have Σw1 = λ1 w1 , then w1T Σw2 = w2T Σw1 = λ1 w2T w1 = 0 Therefore, we have α = 0. Equation 1 becomes 2Σw2 − 2λw2 = 0 16 and Σw2 = λw2 , which implies that w2 should also be an eigenvector of the matrix Σ and λ is the corresponding eigenvalue. To maximize V (z2 ) = w2T Σw2 = λw2T w2 = λ, λ should be the second largest eigenvalue (the largest eigenvalue corresponds to w1 ), and w2 is the corresponding eigenvector. The third or more principle components are computed similarly. Note that Σ is symmetric, the eigenvectors of two different eigenvalues are orthogonal. 3.2.1 Interpretation of PCA In PCA, we know that w1 , named the first principle component is the eigenvector corresponding to the largest eigenvalue. It explains the largest part of the variance. The second explains the second largest and so on. Define z as the random vector after performing PCA on x, i.e., z = WT (x − x̄) where x̄ is the sample mean. The i-th column of W ∈ Rp×k is the eigenvector of matrix Σ corresponding to the i-th largest eigenvalue. This linear transformation projects x of dimension p onto a kdimensional space. The k dimensions are defined by the eigenvectors. The variances over these dimensions equal to the eigenvalues. 17 3.2.2 Select Value k We want to select the top k largest eigenvalues that can explain the most portions of the variance. To measure how well the top k eigenvalues can explain the variance, we define a proportion of variance metric as follows: λ1 + λ2 + · · · + λk (2) λ1 + λ2 + · · · + λk + · · · + λp We select k so that the proportion of variance exceeds a threshold. 3.2.3 Procedures of PCA We here summarize the procedures to perform PCA. 1. Compute the estimated variance of x using   cov(x1 , x1 ) cov(x1 , x2 ) . . . cov(x1 , xp ) cov(x2 , x1 ) cov(x2 , x2 ) . . . cov(x2 , xp )  S = E (X − X̄)T (X − X̄) =  .. .. .. ..   . . . . cov(xp , x1 ) cov(xp , x2 ) . . . cov(xp , xp ) where X ∈ RN ×p is the matrix formed by the sample data of size N . X̄ is the sample mean, and PN i=1 (Xi1 − X̄1 )(Xi1 − X̄1 ) cov(x1 , x1 ) = N −1 PN i=1 (Xi1 − X̄1 )(Xi2 − X̄2 ) cov(x1 , x2 ) = N −1 where Xij is the j-th feature of the i-th data point. X̄j is the sample mean of the j-th feature. 18 2. Compute the eigenvalues and corresponding eigenvectors of S, and sort eigenvalues in a descending order. 3. Starting from k = 1, calculate the proportion of variance using Equation 2 with increasing values of k. Stop the iteration when the proportion exceeds a pre-determined threshold (0.9 is typically used). 4. Transform X to the k-dimensional space using Z = (WT X T )T or Z = XW, where Z ∈ RN ×k and W = [w1 , w2 , . . . , wk ]. 19 Example 5 Find the principle components of the following dataset, and then project the dataset onto the directions of the principle components. X = (x1 , x2 ) = {(1, 2), (3, 3), (3, 5), (5, 4), (5, 6), (6, 5), (8, 7), (9, 8)} 20

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Brooklyn College Programming Worksheet ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now