Programming Question

Info371: Tree LabJanuary 22, 2024
Introduction
This lab asks you to play with regression and classification trees, and find the best combination of
hyperparameters. We use Wisconsin Diagnostic Breast Cancer (WDBC) data for categorization and
Boston housing data for regression. Both tasks are fairly similar.
The lab has two aims:
1. to give you some experience with trees and related hyperparameters;
2. refresh the basic ML workflow for model selection.
1 Classification
In this task you work with WDBC data. As a reminder, your task is to predict diagnosis (“M” = cancer,
“B” = no cancer).
1.1 Prepare
1. Load wdbc data and ensure it looks good.
2. Create your design matrix X and outcome vector y. The former should contain all 30 features,
everything, except diagnosis and id. The latter should be diagnosis, converted to either logical or
numeric variable (otherwise sklearn will fail).
3. Split your data into training and validation chunks (or do cross validation below, but that is slower).
1.2 Tune the model
Now everything should be ready for a few classification trees. Your task is to analyze the effect of
two hyperparameters of DecisionTreeClassifier: max_depth and min_samples_split. Both of these
hyperparameters can be used to avoid overfitting.
1. Explain what do these hyperparameters do.
2. Fit a decision tree (on training data), and compute accuracy (on validation data). Use a combination
of both hyperparameters when defining the model. As a refresher, you can create it along these
lines:
m = DecisionTreeClassifier(max_depth=7, min_samples_leaf=…, …)
1
and you can compute accuracy on validation data as
m.score(Xv, yv)
where Xv and yv are your validation X and y.
Now it is time to do a more thorough search through hyperparameters by performing a 2-D grid
search.
3. Write a nested loop where the outer loop runs over max depth and the inner loop runs over min
samples split. Use a meaningful set of values for each of these. For instance, I am using:
depths = range(1,6)
splits = [2,5,10,20,50,100]
Inside the loop, define a decision tree classifier using these parameters, fit it on training data, and
compute accuracy on validation data. Essentially you repeat question 1.2.2, just inside the loop.
I recommend you start with few combinations only (e.g. three for each parameter) to speed up your
work. If you still have time left over, you can compute over wider range at the end.
4. Find the best accuracy and the corresponding hyperparameter combination your loop can detect.
You can just check inside the innermost loop if the current accuracy is better than the previous best
accuracy.
2 Boston housing: regression tree
This task is a very similar task to the previous one, just you should do a regression, not classification
model. So you can copy-paste most of your code, and then modify it a little bit. We use Boston housing
data and predict the median value (medv) using all other attributes. Instead of accuracy, we are now using
RMSE, and instead of comparing the result with logistic regression, we compare it with linear regression.
1. Load boston data and ensure it looks good.
2. Create your design matrix X and outcome vector y. The former should contain all features, except
medv, and the latter is medv.
3. Split your data into training and validation chunks (or do cross validation below, but that is slower).
4. Fit a regression tree (on training data), and compute RMSE (on validation data). Use a combination
of the same hyperparameters when defining the model.
As a refresher, RMSE is defined as
v
u
n
u1 ∑
RMSE = t
(b
yi − yi )2
N
i=1
2
(1)
5. Write a similar nested loop over both hyperparameters.
Inside of the loop, define a decision tree classifier using these parameters, fit it on training data,
and compute RMSE on validation data. Essentially you repeat question 2.4, just inside of the loop.
6. Find the best accuracy and the corresponding hyperparameter combination your loop can detect.
You can just check inside the innermost loop if the current accuracy is better than the previous best
accuracy.
3

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper
Still stressed from student homework?
Get quality assistance from academic writers!

Order your essay today and save 25% with the discount code LAVENDER