CS 4435/5435 and DASE 4435 - Project 3

Instructions

NO EXTENSION will be given for submission. You will be graded on the solution you have on the due date.

The assignment should be submitted via Canvas. Submit a file called Project3.zip containing the files in the deliverables:


Deliverables

Important: You must use either R Markdown or Jupyter Notebook to complete this assignment.

Description of Deliverables:


Introduction

Clustering involves finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters. For this clustering project, you will analyze a study of Asian Religious and Biblical Texts Data Set.

Goal of this Project: The goal is to use clustering algorithms to determine which documents should be in the same group based on the document matrices. Ultimately, you want documents from the same religious group in the same cluster.


Task 1: Data Download (5 Points)

Download the Asian Religious and Biblical Texts Data Set from the UCI Repository to use for this project. The two datasets to use from the download are:


Task 2: Create Functions for K-Means, K-Medoids, and Hierarchical Clustering (50 points)

Write a function that implements the k-means, k-medoids, and hierarchical clustering algorithms.

Your functions and arguments should be formatted as follow:

				kmeans_dm(<input_data>, <number of clusters>)
				kmedoids_dm(<input_data>, <number of clusters>)
				hierarchicalclustering_dm(<input_data>, <number of clusters>)
			
where:

Library: You can use any library from your language of choice to solve this.

Your code should be invoked and executed within R Markdown or Jupyter Notebook, and all results should be documented in the respective notebook format.


Task 3: Report (45 points)

  1. Use ten (10) different cluster sizes (k) and evaluate the quality of the clustering results using the Davies-Bouldin index (DBI) and Silhouette Index (SI) for the three algorithms. Present your results with a plot. (20 points)
  2. Comment on the performance of the algorithms on the different cluster sizes. (5 points)
  3. At what k value was the best clustering quality achieved, and what are the running times of the three algorithms for this k? (5 points)
  4. Which of the algorithms would you recommend for solving text clustering based on your results and why? (8 points ) Note: You can use the Labelled DTM dataset to answer this.
  5. Which books were poorly clustered? Any explanation why? (7 points ) Note: You can use the Labelled DTM dataset to answer this.

Please Note:

  • Don't let the fact that you know the specific number of religious books, from the labelled data, affect your number of cluster(k) assignment.
  • Remember that for most clustering problems, you do not know definitely the specific number of clusters to have but a rough domain knowledge might be helpful.