Instructions
- The assignment should be submitted via Canvas. Submit a file called Project1.zip, containing the following two files:
- Project1Report-[lastname].pdf, containing your written report with answers to the tasks, explanations, and any necessary analysis. This should include any figures or tables generated as part of your work. Only PDF files will be accepted. All text should be typed, and figures should be computer-generated. Scans of handwritten answers will NOT be accepted.
- DTvsRFvsNB.Rmd or DTvsRFvsNB.ipynb, containing your main code in RMarkdown or Jupyter Notebook format. This file should include all code required for the programming tasks and any outputs that need to be generated. Make sure that your code runs independently and includes any necessary comments for clarity.
Your function should be invoked as follows (e.g., in RMarkdown or Jupyter Notebook):
DTvsRFvsNB(<training_file>, <test_file>)
where:
- The first argument, <training_file>, is the path name of the training file where the training data is stored.
- The second argument, <test_file>, is the path name of the test file where the test data is stored.
- Each student is expected to work on each assignment INDIVIDUALLY and submit their own work. The instructor will report all violations or suspicions of violations to the Office of Student Conduct.
- Late Submission: Submissions are due by 11:59 pm on the due date and are automatically managed by Canvas. If you submit after the deadline, you will automatically lose 5 points per hour until the score reaches zero. Exceptions: Illness or substantial impediment beyond your control, supported by documented proof, are the only acceptable reasons for a waiver.
- Ensure any submitted plots are easy to read at normal zoom levels.
- Adherence to the stipulated file naming conventions is mandatory. Failure to follow these specifications may result in a penalty of up to 10 points.
- The written report should be no more than 2 pages in length.
Deliverables
The files expected in your submission folder are listed below. You will lose significant points if you submit fewer or additional files.
- Project1Report-[lastname].pdf — Contains your written answers and analysis.
- DTvsRFvsNB.Rmd or DTvsRFvsNB.ipynb — Your main code file in RMarkdown or Jupyter Notebook format.
census_trainset.txt
credit_trainset.txt
census_testset.txt
credit_testset.txt
Task6_credit_trainset.txt
Task6_credit_testset.txt
Task 1: Data Download
Download the Census Income Dataset and Credit Approval data from UCI ML repository to answer the question in task 2-5.
Census Income(CI) Data Set: The dataset predicts whether income exceeds $50K/yr based on census data. It contains 14 Attributes. Note: download only the adult.data file
Credit Approval(CA) Data Set: This data concerns credit card applications. It contains 15 attributes. Note: download only the crx.data file
Task 2: Create Training and Test Dataset (5 points)
In this task, for each of the datasets above create a training dataset and test dataset. Make the number of instances in training and test dataset be 80% and 20% of the original dataset.
For how to perform the splitting, check the course slide for Lecture 4-6, for Random Sampling Without Replacement. Once you have carried out the splitting, label the generated datasets as follow:
Training dataset: census_trainset.txt, credit_trainset.txt
Test dataset: census_testset.txt, credit_testset.txt
Task 3: Decision Tree(DT), Random Forest (RF), and Naive Bayes (NB) Algorithm (45 points)
In this task, you will implement a R, Matlab or Python function executable file, that uses Decision Trees, Random Forest, and Naive Bayes to train a model, and then applies the model to classify your test data.
Your function should be invoked as follows (Matlab, R or Python):
decision_tree(<training_file>, <test_file>)
random_forest(<training_file>, <test_file>)
naive_bayes(<training_file>, <test_file>)
- The first argument, <training_file>, is the path name of the training file where the training data is stored.
- The second argument, <test_file>, is the path name of the test file where the test data is stored.
All this function should be implemented in the file named: DTvsRFvsNB.[extension of language]
Note: Your code should also work with ANY OTHER training and test files using the same format as the files in the UCI datasets directory.
Task 4: Classification: (5 points)
For each test data instance you should print a line containing the following info:
- Object ID. This is the line number where that object occurs in the test file. Start with 1 in numbering the objects, not with 0.
- Predicted class (the result of the classification)
- True class (from the last column of the test file).
- Accuracy. Defined as follows:
- Predicted Class = True Class, accuracy is 1
- Predicted Class ≠ True Class, accuracy is 0
- Format as follows (Matlab):
fprintf('ID=%5d, predicted=%3d, true=%3d, accuracy=%4.2f\n', object_id, predicted_class, true_class, accuracy);
Task 5: Report (15 points)
- Report the number of instances in the created Training and Test datasets for both CI and CA Dataset.
- Report the accuracy, precision, and recall for both datasets for the three Algorithms.
- Report the running times the three algorithms
- Comment on the computational performance and the actual run-time of the three algorithms (in seconds, minutes, or hours) when applied to both datasets.
Note: For each of the dataset, present the Task 5: question (b) and (c) above in tabular format for ease of reading
Task 6: Handling Missing Data (30 points)
In this task, you will pick ONE of the three ways, described in our course slide Lecture 4-6, for Handling Missing Data to resolve the missing data in some attributes in the Credit Approval Dataset.
Once you have decided on the technique to use to handle the missing data, perform Task 2 and 3 only on the Dataset.
Report the following :
- State the Technique you used to handle the Missing data.
- Compare your result with the result reported for Credit approval in Task 5. [compare in a tabular format for ease of reading]
- Comment on the similarities or differences in the result in (b) above .