Business trad and big data/ python coding

Business trad and big data/ python coding 

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Assignment is attached. Please accept if you have knowledge or experience in python coding asap

IS 7935 Assignment 2

IS7935 – Assignment 2

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Description

and Deliverable

In this assignment, we will practice what we learned in classification. Please follow the instructions below and provide the required answers and screenshots. Python code will be submitted as a separate file.

The section to enter your answer has been indicated and marked as blue. Please leave the color unchanged. Because you will be asked to submit the Python code file, no Python code will be asked for in this document.

This assignment has three sections. In the first section, we will try to understand the classification task. In our second section, we will use Python to develop and evaluate classification models. In the last section, we will compare and interpret our evaluation results.

Python is required to finish the assignment. Even the best experts run into unexpected bugs that turn out to be time-consuming. Please start early and test early to avoid last-minute bugs that will cost you a late submission penalty.

Please download the following CSV file uploaded together with this instruction:

‘heart.csv’

and put it in your project folder (i.e., the folder where you run your code) before getting started. Please avoid accidentally putting the CSV file in the .venv folder. Putting the data in the.venv folder will lead to an error.

The original data is collected from

Kaggle

. The

target

of the prediction is to determine whether an individual has heart disease or not. In this dataset, relevant features have already been extracted by experts based on medical knowledge.
For the purpose of this assignment, there is no need to understand the medical terms. We may directly use the features in the dataset (i.e., all columns outside the target column) for prediction.

For your ease of reference, descriptions for columns are copied from Kaggle as follows:

Column Name

Description

age

Age of the person in years

sex

Gender of the person (

1

= male, 0 = female)

chest_pain_type

Type of the chest pain (0 = typical angina, 1 = atypical angina, 2 = non-anginal pain, 3 = asymptomatic)

resting_bp

Blood pressure while resting (in mm Hg)

cholesterol

A person’s serum cholesterol in mg/dl

fasting_blood_sugar

Whether the blood sugar while fasting is greater than 120 mg/dl (1 = true; 0 = false)

restecg

ECG (electrocardiographic) while resting (0 = normal; 1= having ST-T wave abnormality; 2= showing probable or definite left ventricular hypertrophy by Estes’ criteria)

max_hr

Maximum heart rate achieved

exang

Exercise-induced angina (1 = yes; 0 = no)

oldpeak

ST depression induced by exercise relative to rest

slope

The slope of the peak exercise ST segment (0 = unsloping; 1 = flat; 2 = downsloping)

num_major_vessels

Number of major vessels (0-3) colored by flourosopy

thal

Thalassemia (0 = normal, 1 = fixed defect, 2 = reversable defect)

target

The outcome we are interested in predicting (0 = no disease; 1 = disease)

As a hint, please use the classification_practice.py from our course website as a starting point to develop the script for this assignment. Changes will be needed, so please make sure to watch the Python Classification lecture video, so you understand the meaning of code in classification_practice.py.

When you are ready to work on the assignment, please follow the instructions and finish the questions in order.

Deliverable

There are

two deliverables
for this assignment 2:

· Please follow the instructions and respond in this document. Please name your finished Word document as

“[FirstName] [LastName] IS 7935 Assignment 2 x”

· Please save the script you used to train and evaluate models as

“[FirstName]_[LastName]_assignment2.py”
(For example, my submission should be named “Yolanda_Li_assignment2.py”)

Please submit

both files
to our assignment folder by the deadline. If you make multiple submissions, the last submission will be used for grading.

·

Part 1 – Understanding the Task (18 Points)

For this first part, let’s try to understand this task. Please answer the following questions.

1.1. Why is this task considered a classification instead of a clustering task?

(3 points)

[Insert your answer here]

1.2. Is this classification task a binary, multi-class, or multi-label classification? Why?

(3 points)

[Insert your answer here]

1.3. Given that 1 represents disease (positive) and 0 represents no disease (negative) in our prediction outcome, please provide intuitive explanations for the following. Please make sure to refer to the context in your interpretation.

(4 points each. 12 points total)

· Accuracy: [Insert your answer here]

· Precision: [Insert your answer here]

· Recall: [Insert your answer here]

Part 2 – Develop and Evaluate the Model in Python (60 Points)

Please use the classification_practice.py from our course website as a starting point to develop the script for this assignment. Changes will be needed to read in our ‘heart.csv’ file and to specify the column that contains our output (i.e., the target column). Please carefully consider each row in the code to determine if it is needed at all and if it requires changes.

Please use the number in your KSU email address to set the random_state. For example, Yolanda’s email address is

yli60@kennesaw.edu

. She should use 60 as her random_state number wherever applicable. If your email address does not have a number, please use 0 for your random state. Setting the random_state in your code is important. It ensures that we can replicate your results when grading.

Please separate the train and test set, then train and evaluate three classifiers:

· A Logistic Regression classifier (please name this classifier lg_clf and the predicted results from this classifier as lg_y_pred)

· A Decision Tree classifier (please name this classifier dt_clf and the predicted results from this classifier as dt_y_pred)

· A Random Forest classifier (please name this classifier rf_clf and the predicted results from this classifier as rf_y_pred)

For the purpose of this assignment, there is no need to change the parameter setting of any classifier. Using the default setting for each is sufficient.

For each classifier above, please print out its performance on the test set. The accuracy, precision, recall, and F1 score need to be printed.

In addition, please print out the accuracy, precision, recall, and F1 score achieved by the majority baseline.

Your code will be graded based on the Python file you submit. When grading your assignment, we will run the Python script you submit. Please make sure that your code prints out all results for different classifiers. Below are some examples that are

NOT
considered printing out all the results:

· Commenting out part of the code so that only the result of one classifier is printed out.

· Changing the classifier while you run the code, so that it only prints out the results of one classifier each time you run.

When we click run in PyCharm on your code, all numbers reported in the screenshots below should be printed. The classifiers need to be added to the same script (i.e., you only submit one .py file). As long as you name the classifiers and their prediction results differently, they will have no conflict in the same script. You may copy and paste the code, give the classifiers different names, give the predicted results different names, and print out using their new names.

Non-runnable code, incorrect code, incorrectly specified random state, code that does not print out all results, or code that generates inconsistent results with your screenshots will result in point loss.

(
15 points each for Logistic Regression, Decision Tree, Random Forest, and Majority Baseline; 60 points total)
.

2.1. Please insert the screenshot of your Logistic Regression evaluation results below

[Insert screenshot here]

2.2. Please insert the screenshot of your Decision Tree evaluation results below

[Insert screenshot here]

2.3. Please insert the screenshot of your Random Forest evaluation results below

[Insert screenshot here]

2.4. Please insert the screenshot of your Majority Baseline evaluation results below

[Insert screenshot here]

Part 3 – Comparing the Model Performance (22 Points)

In this part, we will look at the evaluation results above to understand and compare the models.

3.1. How do all three models compare with the majority baseline? Please intuitively interpret the evaluation results for the majority baseline and then make the comparison considering all evaluation metrics (i.e., accuracy, precision, recall, and f1-score) of the models

(8 points)

[Insert your answer here]

3.2. For this context of heart disease prediction, should we put more emphasis on precision or recall? Why? Which method performs the best in this metric? Can we confidently say that it is the best method? (

8 points)

[Insert your answer here]

3.3. Taking other metrics into account, which method do you believe to have the best overall performance? Why do you say that?

(6 points)

[Insert your answer here]

1

image1

Still stressed from student homework?
Get quality assistance from academic writers!

Order your essay today and save 25% with the discount code LAVENDER