CS 4435/5435 and DASE 4435 - Project 3

Instructions

NO EXTENSION will be given for submission. You will be graded on the solution you have on due date.

The assignment should be submitted via Canvas. Submit a file called Project3.zip containing the files in the deliverables:

Note: You can download and install ANACONDA NAVIGATOR to make sure that you have the required Python and R versions and dependencies.

Deliverables

The files expected in your submission folder are listed below. You will loose considerable points if you have less files where:

Introduction

Clustering involves finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters. For this clustering project, you will analyze a study of Asian Religious and Biblical Texts Data Set.

Goal of this Project: The goal is to use clustering algorithms to determine which documents should be in same group based on the document matrices. Ultimately, you want document from same religious group in same cluster.


Task 1: Data Download (5 Points)

Download the Asian Religious and Biblical Texts Data Set from the UCI Repository to use for this project. For this project the authors have pre-processed the document and generated the Document Term Matrices(DTM). So you do not need to do the extraction your self. The two datasets to use from the download are:

  • AllBooks_baseline_DTM_Unlabelled.csv: This contains the DTM with No labels. Use this dataset for the Clustering Project
  • AllBooks_baseline_DTM_Labelled.csv: This contains the DTM with Labels. Use this to answer the comments in the report if necessary.

    Task 2: Cluster Analysis (5 points)

    All your functions and Codes created in Task 3 will be inside a file named :

    cluster_analysis.[extension of language] containing your R, Matlab or Python code for the programming part.

    Your code should be invoked as follows (Matlab):

    				cluster_analysis(<input_data>, <number of clusters>)
    			

    Your code should be invoked as follows (Python):

        		python3 cluster_analysis.py <input_data>  <number of clusters> 
        	   
    where:

    Task 3: Create functions for K-Means, K-Medoids and Hierarchical Clustering (50 points)

    Write a function that Implement the k-means and k-medoid clustering algorithms.
    Library: You can use any library from your language of choice library to solve this.

    Your code should be invoked as follows (Matlab):

    				kmeans_dm(<input_data>, <number of clusters>)
    				kmedoids_dm(<input_data>, <number of clusters>)
    				hierarchicalclustering_dm(<input_data>, <number of clusters>)
    			
    where:

    Task 4: Report (40 points)

    1. Use ten(10) different cluster sizes(k) and evaluate the quality of the clustering results using the Davies-Bouldin index (DBI) and Silhouette Index (SI) for the three algorithms. Present your result with a plot. (20 point)
    2. Comment on the performance of the algorithms on the different cluster sizes ? (10 point)
    3. At what k value was the best clustering quality achieved, and what is the running times of the three algorithms for this k ? (5 point)
    4. Which of the algorithms would you recommend for solving text clustering? (5 points for undergrad / 2.5 points for grads)
    5. Which Books were poorly clustered? Any explanation why? (5 points bonus for undergrad / 2.5 points for grads)

    Please Note:

  • Don't let the fact that you know the specific number of religious books, from the labelled data, affect your number of cluster(k) assignment.
  • Remember that for most clustering problems you do not know definitely the specific number of clusters to have.