Belo r the question and make a ved

this is the docu

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Final Project: Data Mining
November 19, 2023
1
BACKGROUND
A new title, “The Art History of Florence”, is ready for release by The Charles Book Club (”CBC”).
To design targeted marketing strategies, CBC has sent a test mailing to a random sample of 4,000
customers from its customer base. The customer responses have been collated with past purchase
data.
You may work in a group of 2 members for this project.
2
DATASET
Each row in the spreadsheet corresponds to one market test customer. Each column is a variable
with the header row giving the name of the variable. The variable names and descriptions are given
below:
• Seq#: Sequence number in the partition
• ID#: Identification number in the full (unpartitioned) market test data set
• Gender: O=Male, 1=Female
• M: Monetary- Total money spent on books
• R: Recency- Months since last purchase
• F: Frequency – Total number of purchases
• FirstPurch: Months since first purchase
• ChildBks: Number of purchases from the category: Child books
• YouthBks: Number of purchases from the category: Youth books
• CookBks: Number of purchases from the category: Cookbooks
• DoItYBks: Number of purchases from the category Do It Yourself books
• RefBks: Number of purchases from the category: Reference books (Atlases, Encyclopedias,
Dictionaries)
• ArtBks: Number of purchases from the category: Art books
• GeoBks: Number of purchases from the category: Geography books
• ItalCook: Number of purchases of book title: “Secrets of Italian Cooking.”
• ItalAtlas: Number of purchases of book title: “Historical Atlas of Italy.”
• ItalArt: Number of purchases of book title: “Italian Art.”
• Florence: =1 “The Art History of Florence.” was bought, =0 if not
1
3
Project Goal
Which team/submission’s data mining model can most correctly predict whether a customer will
buy ”The Art History of Florence.”
Training, Validation, and Testing
In machine learning, the dataset is divided into three sets: the training set, used to train the model;
the validation set, employed for hyperparameter tuning and performance evaluation during training;
and the test set, reserved for a final unbiased assessment of the model’s generalization to new data.
The training set is the largest, teaching the model by exposing it to diverse examples. The validation
set aids in preventing overfitting, guiding adjustments to hyperparameters, while the test set serves
as an independent benchmark for evaluating the model’s real-world performance, ensuring it hasn’t
memorized the training data but can generalize effectively.
Use CBC 3200.csv for training and validating your models, and CBC 800.csv for testing only
after a model is trained. Use Area Under the Curve (AUC) for reporting model performance.
4
Requirements
Each submission will submit a final notebook report detailing its analysis. The final report will
contain the following:
• Describe any exploratory analysis performed. For each analysis, include why it is done, the
findings, and whether or how it impacts later project stages.
• Describes any changes/pre-processing you made to the data set – for example, handling missing values, transforming variables, binning variables, handling class imbalance, or eliminating outliers, Elaborate why these operations are performed. Again, this is an open-ended
question.
• Perform and summarize each broad class of modeling types that you conducted – for example, clustering-based, regression-based techniques, ensemble-based, neural nets, and any
other types of data mining models. Plot at least one Receiver operating characteristic (ROC)
curve.
• Describe the actual model that you selected to make your “best” results. Summarize the steps
followed to acquire the best performance.
• SHAP (SHapley Additive exPlanations) values are a way to explain the output of any
machine learning model. It uses a game theoretic approach that measures each player’s
contribution to the outcome. Given the best model above, perform SHAP anal- ysis to find
the importance of the features. One resource for SHAP: A Novel Approach to Feature
Importance — Shapley Additive Explanations.
• Two actionable insights that you learned from doing this analysis that the company could use
to improve its operations.
2
5
Assignment Evaluation
Your grade on the assignment will be based on the scope, depth, notebook organization, clarity of
youranalysis, the quality of your write-up, and the performance of the best model.
6
Presentation
• Make an unlisted YouTube presentation and submit the unlisted video link
• The video should be about 5-8 minutes..
• The presentation slides should be clearly visible and the presenter(s) are preferred to be
visible, if possible.
• You may use the Bb Zoom Meeting, the Bb Collaborator Ultra, or other platform of your
choice.
• Content: describe the process of building and testing the best model from beginning to the
end.
7
References
Use the IEEE citation format: numerical citations in square brackets to refer to all resources and
provide straightforward formatting for references. See IEEE Citation Guidelines.
3

Still stressed with your coursework?
Get quality coursework help from an expert!