CS 4435/5435 and DASE 4435 - Project 2

Instructions

NO EXTENSION will be given for submission. You will be graded on the solution you have on the due date.

The assignment should be submitted via Canvas. Submit a file called Project2.zip containing the files listed under Deliverables:

Each student must work individually on each assignment and submit their own work. Violations of this policy will be reported to the Office of Student Conduct.
Late Submission: Due by 11:59 pm on the due date. Late submissions lose 5 points per hour until zero points. No waivers unless documented illness or substantial impediment beyond your control.
Ensure submitted plots are readable at normal zoom level.
Follow the naming conventions strictly, or incur up to 10 points penalty.

Deliverables

The files expected in your submission folder are listed below. You will lose considerable points if any files are missing:

Project2Report-[lastname].pdf
20_newsgroups_Train (All Train data in this folder)
20_newsgroups_Test (All Test data in this folder)
eliminate.txt
BagofWords.txt
naive_bayes.ipynb or naive_bayes.Rmd
neural_networks.ipynb or neural_networks.Rmd
support_vm.ipynb or support_vm.Rmd

Introduction

This project involves classifying 20,000 messages into 20 categories using three algorithms: Naïve Bayes, Neural Networks, and Support Vector Machine (SVM). Proper formatting of the dataset is essential for success.

Task 1: Data Download

Download the Twenty Newsgroups Data Set and extract the 20_newsgroups.tar.gz file.

Task 2: Create Training and Test Dataset (5 points)

Split the dataset with 60% as training data and 40% as test data.

Training dataset folder: 20_newsgroups_Train

Test dataset folder: 20_newsgroups_Test

Task 3: Create the Bag of Words (15 points)

Create a dictionary (Bag of Words) containing unique words arranged by term weight.

Remove punctuations and stop words.
Compile removed words into eliminate.txt (5 points)
Save the Bag of Words in BagofWords.txt.

Here is an example of the Bag of Words:
Bag of Words Example

Task 4: Naive Bayes (NB) Algorithm (30 points)

Implement a function in Python or R to train and apply a Naive Bayes model. Use function structure:

			naive_bayes(<training_path>, <test_path>)

Note: Implement the classifier without using any external libraries for Naive Bayes.

Task 5: Neural Networks (10 points)

Implement a function in Python or R to train and apply a Neural Network model. Use function structure:

			neural_networks(<training_path>, <test_path>)

Library: Any neural networks library is permitted. BONUS: 7 extra points for implementing without using a library.

Task 6: Support Vector Machine SVM (10 points)

Implement a function in Python or R to train and apply an SVM model.

			support_vm(<training_path>, <test_path>)

Library: You may use libsvm or any other SVM library. BONUS: 7 extra points for implementing without using a library.

Task 7: Report (30 points)

Report the accuracy, number of misclassified, recall, and running time for each algorithm (9 points).
Specify the feature size for Neural Networks and SVM (1 point).
List the libraries used for Neural Networks and SVM (1 point).
Report accuracy, misclassifications, recall, and running time with Bag of Words sizes of 70,000, 50,000, 30,000, and 10,000 for Naive Bayes (12 points).
Comment on performance trends with different dictionary sizes (3 points).
Reduce the feature size of Neural Networks and SVM by 10,000 and report the results (4 points).

Example Report Table

Dictionary Size	Accuracy	Number of Misclassified	Recall	Running Time
70,000
50,000
30,000
10,000