Computer Science Question

You will train a variety of tree-based models and evaluate each one using 5-fold cross-validation. Using your best performing model, you will run inference on a test set and submit the predicted labels.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Dataset Description:

You will use the news dataset from Quiz. As before, the dataset contains five categories (sport, business, politics, entertainment, tech). The task is to classify documents into one of these five categories. You will be provided with the following datasets:

● Raw training data attached with labels:

● The dataset contains the raw text of

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

10

63 news articles and the article category. Each row is a document.

● The raw file is a .csv with three columns: ArticleId, Text, Category

● The “Category” column are the labels you will use for training

● Raw test data attached without labels

● This dataset contains the raw text of 735 news articles. Each row is a document.

● The raw file is a .csv with two columns: ArticleId,Text.

● The labels are not provided

Your job:

  1. Preprocess the raw training data. You can use your code from the HW0. Additionally, you can use the code from the posted solution from the HW0. You may also construct other features, such n-grams or keyword extractions. Feel free to use any other features you feel may be relevant.
  2. Evaluate the decision tree model on your pre-processed data. (25pt)Randomly select 80% data instances as training, and the remaining 20% data instances as validation. Change the parameter setting on criterion (“gini”, “entropy”}. Draw a bar chart showing the

    training accuracy

    and validation accuracy w.r.t. different parameter values. (5pt)

Example:

  1. Evaluate the decision tree using 5-fold cross-validation (see the example code for a different task here) w.r.t

    min_samples_leaf

    :Report the average training and validation accuracy, and their standard deviation for different parameter values (organize the results in a table). (5pt)

Example:

min_samples_leaftraining accuracy

10

……

testing accuracy

0.839

0.723

50

0.899

0.923

200

0.702

0.792

  1. Draw a line figure showing the training and validation result, x-axis should be the parameter values, y-axis should be the training and validation accuracy. (5pt)

Example:

  1. Evaluate the decision tree using 5-fold cross-validation w.r.t max_features:Report the average training and validation accuracy, and their standard deviation for different parameter values (organize the results in a table). (5pt)Draw a line figure showing the training and validation result, x-axis should be the parameter values, y-axis should be the training and validation accuracy. (5pt)
  1. Evaluate random forests model on pre-processed training data. (25 pt)Describe your parameter setting. (5pt)Use 5-fold cross-validation to evaluate the performance w.r.t. the number of trees (n_estimators):Report the average training and validation accuracy, and their standard deviation for different parameter values (organize the results in a table). (5pt)Draw a line figure showing the training and validation result, x-axis should be the parameter values, y-axis should be the training and validation accuracy. (5pt)Use 5-fold cross-validation to evaluate the performance w.r.t. the minimum number of samples required to be at a leaf node (min_samples_leaf)Report the average training and validation accuracy, and their standard deviation for different parameter values (organize the results in a table). (5pt)Draw a line figure showing the training and validation result, x-axis should be the parameter values, y-axis should be the training and validation accuracy. (5pt)
  2. Predict the labels for the testing data (using raw training data and raw testing data). (50pt)Describe how you pre-process the data to generate features. (5pt)Describe how you choose the model and parameters. (5pt)Describe the performance of your chosen model and parameter on the training data. (5pt)The final classification models to be used in this question are limited to decision trees, random forests, and boosting trees (AdaBoost, or GradientBoostingTree). It is OK to use other models/methods to do feature engineering (e.g., using word embeddings). (25pt)Note that this question will be graded based on your accuracy on our test data. You should try to think of better features and try different models and parameters in order to get a higher accuracy.

What to submit:

You need to submit three files:

1. code.ipynb – The notebook containing all the code for the questions. Please do not include notebook cells that had no use randomly. For each cell in the notebook, you should include a description of what it does. This will help improve your code writing skills in general.

2. description.pdf – The description of the results for all questions

3. labels.csv, this is the predicted labels for Q4. Each row of the file will be a comma-separated string denoting the article ID and predicted label. For example, if the predicted label for article number 2 is politics, then the row in the file would be “2,politics”. Make sure that your .csv file does not have a header row.

Still stressed with your coursework?
Get quality coursework help from an expert!