Programming Question

INFO370 Problem Set: Trees, matricesJanuary 22, 2024
Instructions
1 Which firm goes out of business? (40pt)
The first question is just a basic ML task: use decision trees to model whether a firm goes to bankruptcy,
and tune the hyperparameters to get the best model. We use Taiwanese bankruptcy data (tw-bankruptcy.csv.bz2).
It contains a large number of features, see my data repo for the list of those and (incomplete) explanations.
We recommend to use sklearn library for the tasks here.
1.1 Prepare (8pt)
First, let’s prepare data and understand what it is.
1. (1pt) Load data. See if it is good. What is it’s dimension?
2. (1pt) What are the data types? Are there any non-numeric variables?
3. (1pt) Are there any missings?
4. (1pt) Create the design matrix X and outcome vector y.
5. (1pt) Split these into training/validation parts (80/20).
6. (3pt) What is the accuracy of the naive estimator that predicts every case to the majority category?
(See Lecture Notes, Exercise 4.2, p ≈ 195).
Compute this on validation data.
1.2 Logistic regression (15pt)
First, let’s do logistic regression and see how does it compare to the naive model.
1. (4pt) What is the dimensionality of the feature space of this model?
2. (4pt) How does the logistic regression’s decision boundary look in the feature space?
3. (5pt) What do you think–given the values of the features, how much uncertainty is there if the firm
goes to bankruptcy or not?
4. (1pt) Fit a logistic regression model (on training data). Compute accuracy on validation data.
5. (1pt) How well does logistic regression perform, compared to the naive model above?
1
1.3 Decision trees (17pt)
Now it is time to see if trees work any better. First, let’s just do a tree and see how well it works, and
thereafter tune the max depth. As above, always compute both training and validation accuracy.
1. (3pt) What do you expect–how might the decision boundary look like in this feature space when
using decision trees?
2. (2pt) Explain what does the maximum depth parameter do. Do large or small values for maximum
depth cause overfitting?
Hint: check out sklearn’s documentation.
3. (2pt) Run a series of decision tree models of different maximum depth in a loop. Start with a small
depth, and increase it into the overfitting territory where the model starts overfitting. Each time
store both validation and training accuracy.
4. (3pt) Make a plot where you show how both training and validation accuracy depend on maximum
depth. Try to make the graph in a way that the differences are easily visible.
5. (2pt) What is the best validation accuracy you get? What is the corresponding max depth?
6. (5pt) Discuss your findings: can you see where the model is overfitting? What is the optimal depth?
Do trees work better than logistic regression?
2 Skin color (30pt)
Here we use a skin tone dataset skin-nonskin.csv. It contains a large number of colors (as R, G, B),
and a label for skin/non-skin tone (“1” = skin, “2” = non-skin). Here is an example of colors and the
corresponding labels:
166,109,54:
166,109,54: 11
170,200,200:
170,200,200: 22
6,32,31:
6,32,31: 22
182,123,67:
182,123,67: 11
247,186,165:
247,186,165: 11
65,113,115:
65,113,115: 22
90,139,133:
90,139,133: 22
211,155,118:
211,155,118: 11
210,153,110:
210,153,110: 11
69,127,128:
69,127,128: 22
125,171,171:
125,171,171: 22
198,138,101:
198,138,101: 11
138,81,62:
138,81,62: 11
50,59,58:
50,59,58: 22
135,180,183:
135,180,183: 22
249,186,145:
249,186,145: 11
126,88,77:
126,88,77: 11
88,138,137:
88,138,137: 22
0,255,0:
0,255,0: 22
127,82,76:
127,82,76: 11
226,172,138:
226,172,138: 11
158,196,197:
158,196,197: 22
153,194,198:
153,194,198: 22
233,184,169:
233,184,169: 11
233,181,167:
233,181,167: 11
162,197,199:
162,197,199: 22
0,3,4:
0,3,4: 22
139,93,44:
139,93,44: 11
129,89,81:
129,89,81: 11
147,167,166:
147,167,166: 22
164,198,200:
164,198,200: 22
220,146,99:
220,146,99: 11
225,172,158:
225,172,158: 11
126,174,176:
126,174,176: 22
131,175,178:
131,175,178: 22
240,180,156:
240,180,156: 11
186,117,52:
186,117,52: 11
2,0,1:
2,0,1: 22
250,250,248:
250,250,248: 22
0
199,127,87:
199,127,87: 11
2
0
2.1 Explore and prepare (8pt)
1. (1pt) Load data. Find:
(a) number of rows/columns
(b) print a few lines of data
(c) does the dataset contain any missings?
(d) what are the possible labels?
2. (3pt) Note that the feature space here is about the same as “color space”, you have probably seen
color selectors that work on such a color space. Here the color selector of gimp:
(a) What is the dimensionality of the feature space?
(b) What do you think, how might the shape of the subset of feasible skin tones in this feature
space look like?
(c) What do you think, given the R,G,B values, is there any uncertainty if the given tone is a
possible skin tone?
3. (1pt) Create the design matrix X (the R, G, B values) and the outcome vector y (the labels).
4. (1pt) Split both of the above into training and validation sets (80 pct for training, 20 for testing).
5. (2pt) What is the accuracy of the naive estimator that predicts every case to the majority category
(on validation data)?
2.2 Logistic regression (4pt)
As above, let’s start using logistic regression.
1. (1pt) Fit a logistic regression model. Compute accuracy on validation data.
2. (1pt) How well does logistic regression perform, compared to the naive model above?
3. (3pt) How does the decision boundary look like in this feature space?
3
2.3 Decision trees: maximum depth (18pt)
Now let’s do decision trees again and tune the max depth parameter.
1. (4pt) What do you expect–how might the decision boundary look like in this feature space when
using decision trees? Explain it in words (or make a quick sketch).
2. (2pt) Run a series of decision tree models of different maximum depth in a loop. Start with a small
depth, and increase it into a large one, until the model does not change any more. Each time store
both validation and training accuracy.
If the loop run too slow, you may not want to test all possible parameter values. For instance, you
may go from depth 1 to 100, but only test every 5th or 10th value.
3. (3pt) Make a plot where you show how both training and validation accuracy depend on the maximum depth. Just plain plot of accuracy may not be easily understandable. You may try plottng
log(1 − A) instead of just A, or to do some other tricks. But tell clearly what kind of tricks you use!
4. (1pt) What is the best validation accuracy you get? What is the corresponding max depth?
5. (3pt) Discuss your findings: can you spot overfitting? What do you think, why does the model
behave as it does? What is the optimal depth? Are the results better than in case of logistic
regression?
6. (5pt) How do you explain the fact that in case of bankruptcy data, the validation accuracy peaked at
depth 5 or so, and fell thereafter. But on skin color data the accuracy improves, and stays constant
from depth 30 or so onward?
Hint: think about 2-D case where one dataset is noisy and the other is not. How would good
decision boundaries look like in these cases?
3 Ensemble estimator (30pt)
Your final task is to create an ensemble estimator yourself. You should include logistic regression, k-NN,
and trees in this ensemble. We use the same bankruptcy dataset as in Question 1.
3.1 Prepare (2pt)
1. (2pt) Load data, create the design matrix and outcome vector, and split these into training and
validation chunks (80/20).
You already did all that in Question 1.1.
3.2 Baseline models (6pt)
Now it is time to do baseline models. Use Logistic, k-NN, and trees. For all these models, pick some kind
of parameters (defaults are fine), but use the same parameters below in Q 3.3.
1. (2pt) Use logistic regression to fit the model and compute accuracy on the validation data.
2. (2pt) Repeat with k-NN. You have to pick a reasonable k.
3. (2pt) Repeat with decision trees. Pick a reasonable depth.
4
3.3 Ensemble (22pt)
Now it is time to create an ensemble estimator. It should contain a fit method and score method (let’s
do these just as functions), you can call these something else.
The fit function should take training data Xt and yt . Thereafter it should fit all three baseline
models, and return a tuple that contains these three models.
The score function should take the tuple of three models, validation matrix Xv and validation outcome
yv as arguments. It should
• for each of the three models, predict outcome using Xv ;
• For each case in validation data, pick the majority value of the three predicted values. This is the
ensemble prediction.
• compute accuracy;
• return the average accuracy over the validation data.
1. (3pt) Create such fit function
2. (4pt) Create such score function
Hint: in this task, you can create majority category in a very simple way. As we are predicting just
“0” or “1” here, you can just add all values, and see if it is less than 2, or 2 or more. Something
along the lines:
yhat = (yhat1 + yhat2 + yhat3) > 1.5
where yhat1 is your predicted value from the first model.
3. (3pt) Explain why the formula above work for our 3-mode ensemble. Why is 1.5 a good threshold?
4. (3pt) Compute accuracy with your ensemble score estimator.
5. (3pt) Now add a few more estimators to your ensemble. You can include k-NN with different k,
and trees with different parameters, you can also use other models, such as SVM-s.
6. (3pt) Comment your results: does the ensemble estimator give you better results than the individual
ones? Does adding additional models to your ensemble improve the results?
7. (3pt) How does your ensemble score compare to the baseline scores in Question 3.2? Does it improve
if you add more models?
Note: if you did a new testing-validation split, the comparison with results in Question 1 may be
off.
Finally…
How much time did you spend on this PS?
5

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Programming Question ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now