Computer Science Question

The goal of this assignment is to classify documents in a corpus. You will train a variety of linearmodels and evaluate each one using 5-fold cross-validation. Using your best performing model, you
will run inference on a test set and submit the predicted labels.
Dataset description:
You will use the news dataset from Quiz. As before, the dataset contains five categories (sport,
business, politics, entertainment, tech). The task is to classify documents to one of these five
categories. You will be provided with the following datasets:
● Raw training data (link) with labels:
● The dataset contains the raw text of 1063 news articles and the article category.
Each row is a document.
● The raw file is a .csv with three columns: ArticleId, Text, Category
● The “Category” column are the labels you will use for training
● Raw test data (link) without labels
● This dataset contains raw text of 735 news articles. Each row is a document.
● The raw file is a .csv with two columns: ArticleId,Text.
● The labels are not provided
Example codes for Pytorch and Tensorflow.
Your job:
1. Preprocess the raw training data. You can use the code from Homework 1. You are required
to construct other features, such n-grams or keyword extractions. (15pt)
a. Run Neural Networks with the 2-hidden layers, each has 128 neurons, extracting
features by CountVectorizer() as the original features. Use 5-fold cross-validation to
evaluate the performance.
b. Feature exploration. Use other features like TFIDF, or any word embeddings
provided by other packages like GloVe with gensim, or BERT. Use 5-fold crossvalidation to evaluate the performance of your Neural Network.
c. Describe how you generate features. (5pt)
d. Report the average training and validation accuracy, and their standard deviation for
different feature construction (organize the results in a table). (5pt)
Example:
Feature method
training accuracy
testing accuracy
CountVectorizer()
0.839
0.723
GloVe
0.899
0.923



BERT
0.702
0.792
e. Draw a bar figure showing the training and validation result, x-axis should be the
parameter values, y-axis should be the training and validation accuracy. (5pt)
Example:
2. Explore the Neural Network model on pre-processed training data. (25pt)
a. Describe your parameter setting. (5pt)
b. Use 5-fold cross-validation to evaluate the performance w.r.t. the learning rates (𝜂),
you could use the feature engineering method that has the best performance from
Question 1. Recommended candidate values: [0.0001, 0.0003, 0.001, 0.003, 0.01,
0.03, 0.1]
1. Report the average training and validation accuracy, and their standard
deviation for different parameter values (organize the results in a table). (5pt)
Example:
Feature method
training accuracy
testing accuracy
0.0001
0.839
0.723
0.0003
0.899
0.923



0.1
0.702
0.792
2. Draw a line figure showing the training and validation result, x-axis should be
the parameter values, y-axis should be the training and validation accuracy.
(5pt)
Example:
c. Use 5-fold cross-validation to evaluate the performance w.r.t. optimizers, you could
use the feature engineering method that has the best performance from Question
1. Recommended candidate values: [SGD, Adam, RMSprop] (see PyTorch or
Tensorflow)
1. Report the average training and validation accuracy, and their standard
deviation for different parameter values (organize the results in a table). (5pt)
2. Draw a bar figure showing the training and validation result, x-axis should be
the parameter values, y-axis should be the training and validation accuracy.
(5pt)
3. Predict the labels for the testing data (using raw training data and raw testing data). (60pt)
a. Describe how you pre-process the data to generate features. (5pt)
b. Describe how you choose the model and parameters. (5pt)
c. Describe the performance of your chosen model and parameter on the training data.
(5pt)
d. The final classification models to be used in this question are limited to random
forest, neural networks, and ensemble methods. It is OK to use other models to do
feature engineering. (45pt)
1. Note that this question will be graded based on your accuracy. You should try
to think of better features and try different models and parameters in order to
get a higher accuracy.
What to submit:
You need to submit three files:
1. code.ipynb – The notebook containing all the code for the questions. Please do not include
notebook cells that had no use randomly. For each cell in the notebook, you should include a
description of what it does. This will help improve your code writing skills in general.
2. description.pdf – The description of the results for all questions
3. labels.csv, this is the predicted labels for Q3. Each row of the file will be a commaseparated string denoting the article ID and predicted label. For example, if the predicted
label for article number 2 is politics, then the row in the file would be “2,politics”. Make sure
that your .csv file does not have a header row.
Note:




Remember to submit the three files by clicking on “add another file” in Canvas, instead of
submitting one zipped file of the aforementioned three files.
Submit all the files to HW 2 on CANVAS.
If there is any question about the assignment, please email the TA.
Late submission penalty will be strictly enforced (see syllabus). Assignment should be
completed independently: Submissions after the deadline but less than 24 hours late are
accepted but penalized 10%, and submissions more than 24 hours but less than 48 hours
late are penalized 30%. No submissions are accepted more than 48 hours late.
000
1
10
100
11
12
13
14
15
16
18
2
20
2000
2001
2002
2003
2004
2005
2006
24
25
3
30
4
40
5
50
abil
abl
accept
access
accord
account
accus
achiev
across
act
action
activ
actor
actress
ad
add
address
admit
affect
africa
age
agenc
ago
agre
agreement
ahead
aid
aim
air
airlin
album
alleg
allow
almost
alreadi
also
although
alway
america
american
among
amount
analyst
announc
annual
anoth
answer
anyth
appeal
appear
appl
approach
approv
april
area
argu
around
arrest
arsen
artist
ask
associ
asylum
athlet
attack
attempt
attend
attract
audienc
australia
australian
author
avail
averag
aviat
avoid
award
away
back
bad
ball
ban
band
bank
bankruptci
base
battl
bbc
beat
becam
becom
begin
behind
believ
benefit
best
better
bid
big
biggest
bill
billion
bit
black
blair
blog
board
bodi
book
boost
boss
bought
box
brand
break
bring
britain
british
broadband
broadcast
brown
bt
budget
build
bush
busi
buy
call
came
camera
campaign
campbel
car
card
care
career
carri
case
cash
categori
caus
celebr
centr
central
ceremoni
chairman
challeng
champion
championship
chanc
chancellor
chang
channel
charg
charl
chart
chelsea
chief
children
china
choic
christma
citi
claim
clark
clear
close
club
coach
code
collect
combin
come
comedi
comment
commiss
commit
committe
common
commun
compani
compar
compet
competit
complet
comput
concern
confer
confid
confirm
connect
conserv
consid
consol
consum
content
contest
continu
contract
control
controversi
copi
corpor
cost
could
council
countri
coupl
cours
court
creat
credit
crime
crimin
critic
criticis
cup
current
custom
cut
damag
data
date
davi
david
day
de
deal
death
debat
debt
debut
decemb
decid
decis
declin
defeat
defenc
defend
deficit
deliv
dem
demand
democrat
deni
depart
describ
design
despit
detail
develop
devic
die
differ
difficult
digit
direct
director
disappoint
disast
discuss
distribut
dollar
domin
done
doubl
doubt
download
dr
drive
drop
drug
due
dvd
earli
earlier
earn
econom
economi
educ
effect
effort
eight
either
elect
electron
email
emerg
employ
end
engin
england
enough
ensur
enter
entertain
estim
eu
euro
europ
european
even
event
ever
everi
everyon
everyth
evid
exampl
exchang
execut
exist
expect
experi
expert
explain
export
extra
face
fact
fail
fall
famili
fan
far
favour
favourit
fear
featur
februari
feder
feel
fell
felt
festiv
field
fight
figur
file
film
final
financ
financi
find
fine
finish
firm
first
fit
five
follow
footbal
forc
forecast
foreign
form
former
forward
found
four
fourth
franc
fraud
free
french
friday
friend
front
full
fund
futur
gadget
gain
game
gave
gener
german
germani
get
giant
give
given
global
go
goal
gold
golden
gone
good
googl
gordon
got
govern
grand
great
ground
group
grow
growth
half
hand
handset
happen
happi
hard
head
health
hear
held
help
high
higher
histori
hit
hold
hollywood
home
honour
hope
host
hour
hous
howard
howev
huge
human
hunt
id
idea
illeg
imag
immigr
impact
import
impress
improv
includ
incom
increas
independ
india
indian
individu
industri
inflat
inform
initi
injuri
insist
instead
interest
intern
internet
introduc
invest
investig
investor
involv
iraq
ireland
irish
issu
itali
jame
januari
japan
job
john
johnson
join
jone
judg
june
keep
kennedi
key
kick
know
known
labour
lack
larg
largest
last
late
later
latest
launch
law
lawyer
lead
leader
leagu
least
leav
led
left
legal
less
let
level
lib
liber
life
light
like
limit
line
link
list
listen
littl
live
liverpool
local
london
long
look
lord
lose
loss
lost
lot
love
low
lower
machin
made
magazin
main
major
make
maker
man
manag
manchest
mani
manufactur
march
mark
market
martin
match
matter
may
mean
meanwhil
measur
media
meet
member
men
messag
met
michael
microsoft
might
mike
million
mini
minist
minut
miss
mobil
model
moment
monday
money
monitor
month
move
movi
mp
mr
ms
much
music
must
name
nation
need
net
network
never
new
news
newspap
next
night
nomin
north
noth
novemb
number
octob
offer
offic
offici
often
oil
old
olymp
one
onlin
open
oper
opportun
opposit
order
organis
origin
oscar
other
outsid
owner
paid
pair
paper
parent
park
parliament
part
parti
particularli
pass
past
paul
pay
pc
penalti
pension
peopl
per
perform
period
person
peter
phone
pick
pictur
place
plan
play
player
pledg
point
polic
polici
polit
poll
poor
pop
popular
portabl
posit
possibl
post
potenti
power
predict
premiership
prepar
present
presid
press
pressur
previou
price
prime
privat
prize
probabl
problem
process
produc
product
profit
program
programm
project
promis
promot
properti
propos
protect
prove
provid
public
publish
push
put
qualiti
quarter
question
quit
race
radio
rais
rang
rate
rather
reach
read
real
realli
reason
receiv
recent
record
reduc
refere
reflect
reform
refus
region
reject
releas
remain
replac
report
repres
requir
research
respect
respond
respons
rest
result
retail
return
reveal
revenu
richard
right
rise
risk
rival
road
robert
robinson
rock
roddick
role
rose
round
row
rugbi
rule
run
russia
russian
said
sale
saturday
save
saw
say
scheme
school
score
scotland
scottish
screen
search
season
seat
second
secretari
sector
secur
see
seed
seek
seem
seen
sell
send
senior
sent
septemb
seri
seriou
serv
servic
set
seven
sever
share
sharehold
short
shot
show
shown
side
sign
signific
similar
simpli
sinc
singer
singl
sir
site
situat
six
slow
small
softwar
sold
someth
song
soni
soon
sound
sourc
south
spain
spam
speak
special
specul
speech
speed
spend
spent
spokesman
sport
squad
stage
stand
standard
star
start
state
statement
station
stay
step
still
stock
stop
store
stori
street
strong
struggl
student
studi
studio
success
suffer
suggest
summer
sunday
support
sure
surpris
survey
system
tackl
take
taken
talk
target
tax
team
technolog
televis
tell
term
terror
test
theatr
thing
think
third
though
thought
thousand
threat
three
thursday
time
titl
today
togeth
told
toni
took
tool
top
tori
total
tough
tour
track
trade
train
travel
tri
trial
trust
tsunami
tuesday
turn
tv
two
uk
ukip
understand
union
unit
univers
unlik
unveil
us
use
user
v
valu
version
via
victim
victori
video
view
viru
visit
vote
voter
wage
wait
wale
want
war
warn
watch
way
web
websit
wednesday
week
weekend
well
went
west
whether
whole
whose
wide
william
win
window
winner
within
without
women
word
work
worker
world
worri
worth
would
write
wrong
year
yet
york
young
yuko
zealand

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper
Still stressed with your coursework?
Get quality coursework help from an expert!