Republican and Democratic parties
Campaign organizers for both the Republican and Democratic parties are interested in identifying individual undecided voters who would consider voting for their party in an upcoming election. A non-partisan group has collected data on a sample of voters with tracked variables, including whether or not they are undecided regarding their candidate preference, age, whether they own a home, gender, marital status, household size, income, years of education, and whether they attend church. Use logistic regression to classify observations as undecided (or decided) using Age, HomeOwner, Female, Married, HouseholdSize, Income, Education, and Church as input variables and Undecided as the target (or response) variable.
In the Data tab of the Rattle GUI – R window, click inside the box next to Filename: and navigate to the location of the file BlueOrRedTrain.csv. Select the file BlueOrRedTrain.csv, click Open, then click the Execute button. Uncheck the box next to Partition. For the Voter variable, select the Ident button. For the Age, HomeOwner, Female, Married, HouseholdSize, Income, Education, and Church variables, select the Input button. For the Undecided variable, select the Target button. Next, click the Execute button. In the Model tab, and in the Type: row, select the button next to Linear, and then select the button next to Logistic. Click the Execute button.
To evaluate the performance of a logistic regression model on a validation set (or a test set), click the Evaluate tab. In the Model: row, select the box next to Linear, and in the Data: row select CSV File. Click inside the box next to CSV File and navigate to the location of the file BlueOrRedValidation.csv. Select the file BlueOrRedValidation.csv, and click Open.
To generate the error matrix, in the Evaluate tab, select Error Matrix in the Type: row, and click Execute. To generate the ROC chart in the Plots pane of the RStudio interface, in the Evaluate tab, select ROC in the Type: row, and click Execute.
Click on the datafile logo to reference the data.
(a)Constructing models on the training set, evaluate a small set of candidate models based on their predictive performance on the validation set. Recommend a final model and express the model as a mathematical equation relating the target (or response) variable to the input variables. Iteratively remove the least significant independent variable one at a time until all independent variables remaining in the model are significant at the 0.01 level of significance. If required, round your answers to four decimal places. If a variable is not used in the model, enter “0” before the corresponding variable. For subtractive or negative numbers use a minus sign even if there is a + sign before the blank. (Example: -300)Log odds of being undecided = + Age + HomeOwner + Female + Married + HouseholdSize + Income + Education + Church
(c)Using a default cutoff value of 0.5 for your logistic regression model, what is the overall error rate on the test set for the final model from part (a)? If required, round your answer to one decimal place.