Advanced Data analytics

please use Jupyter Notebook

Yeabin Moon, Ph.D.
BUS 212A-2 Spring 2023
April 25, 2023
Homework 5
Submission Instructions
1) You have to use Jupyter Notebook
2) Click the Save button at the top of the Jupyter Notebook.
3) Select Cell → All Output → Clear. This will clear all the outputs from all cells (but will keep the
content of all cells).
4) Select Cell → Run All. This will run all the cells in order, and will take several minutes.
5) Once you’ve rerun everything, select File → Download as → PDF via LaTeX (If you have trouble
using “PDF via LaTex”, you can also save the webpage as pdf. Make sure all your solutions especially
the coding parts are displayed in the pdf, it’s okay if the provided codes get cut off because lines
are not wrapped in code cells).
6) Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is
the only thing your graders will see!
7) Submit your PDF on Latte.
Question 1. Is a node’s Gini impurity generally lower or higher than its parent’s?
Question 2. Should we try scaling the input features if a decision tree is underfitting the training set?
Question 3. Train and fine-tune a decision tree for the moons dataset by following
1) Use the following code to generate a moons dataset
1
make_moons ( n_samples =10000 , noise =0.4)
2) Use train test split() to split the dataset into a training set and a test set
3) Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyperparameter values for a DecisionTreeClassifier. (Try various values for max leaf nodes)
4) Train it on the full training set using these hyperparameters, and measure your model’s performance
on the test set. You should get roughly 85% to 87% accuracy.
Question 4. Load the MNIST dataset, and split it into a training set, a validation set, and a test set
(e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various
classifiers, such as a random forest classifier, an extra-trees classifier, and an SVM classifier. Next, try
to combine them into an ensemble that outperforms each individual classifier on the validation set, using
soft or hard voting. Once you have found one, try it on the test set. How much better does it perform
compared to the individual classifiers?
1
2
from sklearn . datasets import fetch_openml
mnist = fetch_openml ( ’ mnist_784 ’ , as_frame = False )
Question 5. (graded by your effort) We now know that random forests have a similar level of bias and a
lower level of variance compared to decision trees. Does this statement still hold true when we use sampling
without replacement for the random forest?
1

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Advanced Data analytics ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now