ANaiveBayes classifier is not a single algorithm but uses multiple machine learning algorithms to classify data. It not only uses probability, but it is simple to implement. Some real-world examples of its use include filtering spam, classifying documents, text analysis, or medical diagnosis.
To perform sentiment analysis using a Naive Bayes algorithm, complete the following:
- Access the resources related to sentiment analysis, located in the topic Resources (https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment)
Note: There are about 50 datasets that are suitable for use in a sentiment analysis task. For this part of the exercise, you must choose one of these datasets, provided it includes at least 10,000 instances.
- Ensure that the datasets are suitable for classification using this method.
- You may search for data in other repositories, such as Data.gov, Kaggle or Scikit Learn.
For your selected dataset, build a classification model as follows, in Python:
- Explain the dataset and the type of information you wish to gain by applying a classification method.
- Explain the Naive Bayes algorithm and how you will be using it in your analysis (list the steps, the intuition behind the mathematical representation, and address its assumptions).
- Import the necessary libraries, then read the dataset into a data frame and perform initial statistical exploration.
- Clean the data and address unusual phenomena (e.g., normalization, feature scaling, outliers); use illustrative diagrams and plots and explain them.
- Formulate two questions that can be answered by applying a classification method using the Naïve Bayes.
- Choose one of the Naive Bayes types of algorithms: Gaussian Naïve Bayes, Multinomial Naïve Bayes, or Bernoulli Naïve Bayes and explain your reasoning.
- Split the data into dependent and independent variables (or features and labels).
- Vectorize the text into numbers.
- Train the Naive Bayes classifier on the training set.
- Make classification predictions.
- Interpret the results in the context of the questions you asked.
- Validate your model using a confusion matrix, accuracy score, ROC-AUC curves, and k-fold cross validation. Then, explain the results.
- Include all mathematical formulas used and graphs representing the final outcomes.
From the work done above, prepare a comprehensive technical report as Jupyter notebook, including all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains:
a) Problem statement
b) Algorithm of the solution
c) Analysis of the findings
d) References