CS 4435/5435 and DASE 4435 - Project 1

Instructions


Deliverables

The files expected in your submission folder are listed below. You will lose significant points if you submit fewer or additional files.

Task 1: Data Download

Download the Census Income Dataset and Credit Approval data from UCI ML repository to answer the question in task 2-5.

  • Census Income(CI) Data Set: The dataset predicts whether income exceeds $50K/yr based on census data. It contains 14 Attributes. Note: download only the adult.data file
  • Credit Approval(CA) Data Set: This data concerns credit card applications. It contains 15 attributes. Note: download only the crx.data file

    Task 2: Create Training and Test Dataset (5 points)

    In this task, for each of the datasets above create a training dataset and test dataset. Make the number of instances in training and test dataset be 80% and 20% of the original dataset. For how to perform the splitting, check the course slide for Lecture 4-6, for Random Sampling Without Replacement. Once you have carried out the splitting, label the generated datasets as follow:

    Training dataset: census_trainset.txt, credit_trainset.txt
    Test dataset: census_testset.txt, credit_testset.txt

    Task 3: Decision Tree(DT), Random Forest (RF), and Naive Bayes (NB) Algorithm (45 points)

    In this task, you will implement a R, Matlab or Python function executable file, that uses Decision Trees, Random Forest, and Naive Bayes to train a model, and then applies the model to classify your test data.
    Your function should be invoked as follows (Matlab, R or Python):
    	decision_tree(<training_file>, <test_file>)
    	random_forest(<training_file>, <test_file>)
    	naive_bayes(<training_file>, <test_file>)
    	
    All this function should be implemented in the file named: DTvsRFvsNB.[extension of language]

    Note: Your code should also work with ANY OTHER training and test files using the same format as the files in the UCI datasets directory.


    Task 4: Classification: (5 points)

    For each test data instance you should print a line containing the following info:

  • Task 5: Report (15 points)

    1. Report the number of instances in the created Training and Test datasets for both CI and CA Dataset.
    2. Report the accuracy, precision, and recall for both datasets for the three Algorithms.
    3. Report the running times the three algorithms
    4. Comment on the computational performance and the actual run-time of the three algorithms (in seconds, minutes, or hours) when applied to both datasets.
    Note: For each of the dataset, present the Task 5: question (b) and (c) above in tabular format for ease of reading

    Task 6: Handling Missing Data (30 points)

    In this task, you will pick ONE of the three ways, described in our course slide Lecture 4-6, for Handling Missing Data to resolve the missing data in some attributes in the Credit Approval Dataset.
    Once you have decided on the technique to use to handle the missing data, perform Task 2 and 3 only on the Dataset.
    Report the following :
    1. State the Technique you used to handle the Missing data.
    2. Compare your result with the result reported for Credit approval in Task 5. [compare in a tabular format for ease of reading]
    3. Comment on the similarities or differences in the result in (b) above .