- Using the 20 newsgroup data do the following:Do the pre-processing. This step is application dependent and so you want to read till the end of the task description before deciding what pre-processing steps you’ll choose to applyCreate plots, using matplotlib, to show the following (for each topic in the data separately and save the plots to file):Most frequent words, bigrams and trigramsWord cloud plotsHistogram of word and sentence lengthUse both Matrix Factorization (LSA) and the LDA algorithms to do topic modelling. The output is a sequence of 10 words for each topicCompare your topics between LSA and LDA and prepare yourself for questions about it (and other subjects) during your presentation.Use the labels provided in the dataset to measure the performance of both algorithms based on both accuracy and the F1 scoreLSA and LDA are unsupervised algorithms. In this part, try to apply logistic regression to this problem to see if you can predict the topic in a supervised fashion. Note that this problem no longer is a binary classification problem. You have to find a way to convert it to binary classification.
NOTES1: The 20 newsgroup dataset (KAGGLE) has 2 parts when you download it, there is a train file and a test file. All the items in this project should be done on the train dataset. Test dataset should only be used to measure/illustrate the performance of your model. The reported performances should not be reported on the train dataset.
NOTES2: You will be required to run your project during the presentation.