Instructions
NO EXTENSION will be given for submission. You will be graded on the solution you have on the due date.
The assignment should be submitted via Canvas. Submit a file called Project3.zip containing the files in the deliverables:
- Each student is expected to work on each assignment INDIVIDUALLY and submit their own work. The instructor will report to the Office of Student Conduct all violations of this policy, and all cases that are suspicious of such violations.
- Late Submission: Everything is due by 11:59 pm on the due date. The deadline is automatically managed by Canvas. You can still turn in assignment/project after the deadline. However, you automatically lose 5 points per hour after the due time, until you get 0. (Each individual assignment is 100 points). We cannot waive the penalty, unless there was a case of illness or other substantial impediment beyond your control, with proof in documents.
Note: Reasons such as meeting at work, attending a wedding, or travel e.t.c are not acceptable as reasons to waive penalty for late/no submission.
- Make sure the plots, if any, that you submit are easy to read at a normal zoom level.
- The stipulated naming conventions are mandatory. Non-adherence to these specifications can incur a penalty of up to 10 points.
Deliverables
Important: You must use either R Markdown or Jupyter Notebook to complete this assignment.
- Project3Report-[lastname].pdf
cluster_analysis.Rmd
or cluster_analysis.ipynb
Description of Deliverables:
- Project3Report-[lastname].pdf: Contains the report of your results for the programming task. Only PDF files will be accepted. All text should be typed, and if any figures are present, they should be computer-generated.
- Scans of handwritten answers will NOT be accepted.
- cluster_analysis.Rmd or cluster_analysis.ipynb: The main code implementation must be written in R Markdown or Jupyter Notebook, which allows reproducible code and results to be integrated.
Introduction
Clustering involves finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters. For this clustering project, you will analyze a study of Asian Religious and Biblical Texts Data Set.
Goal of this Project: The goal is to use clustering algorithms to determine which documents should be in the same group based on the document matrices. Ultimately, you want documents from the same religious group in the same cluster.
Task 1: Data Download (5 Points)
Download the Asian Religious and Biblical Texts Data Set from the UCI Repository to use for this project. The two datasets to use from the download are:
- AllBooks_baseline_DTM_Unlabelled.csv: This contains the DTM with no labels. Use this dataset for the clustering project.
- AllBooks_baseline_DTM_Labelled.csv: This contains the DTM with labels. Use this to answer the comments in the report if necessary.
Task 2: Create Functions for K-Means, K-Medoids, and Hierarchical Clustering (50 points)
Write a function that implements the k-means, k-medoids, and hierarchical clustering algorithms.
Your functions and arguments should be formatted as follow:
kmeans_dm(<input_data>, <number of clusters>)
kmedoids_dm(<input_data>, <number of clusters>)
hierarchicalclustering_dm(<input_data>, <number of clusters>)
where:
- The first argument, <input_data> is the input data.
- The second argument, <number of clusters> is the number of clusters passed as input
Library: You can use any library from your language of choice to solve this.
Your code should be invoked and executed within R Markdown or Jupyter Notebook, and all results should be documented in the respective notebook format.
Task 3: Report (45 points)
- Use ten (10) different cluster sizes (k) and evaluate the quality of the clustering results using the Davies-Bouldin index (DBI) and Silhouette Index (SI) for the three algorithms. Present your results with a plot. (20 points)
- Comment on the performance of the algorithms on the different cluster sizes. (5 points)
- At what k value was the best clustering quality achieved, and what are the running times of the three algorithms for this k? (5 points)
- Which of the algorithms would you recommend for solving text clustering based on your results and why? (8 points ) Note: You can use the Labelled DTM dataset to answer this.
- Which books were poorly clustered? Any explanation why? (7 points ) Note: You can use the Labelled DTM dataset to answer this.
Please Note:
Don't let the fact that you know the specific number of religious books, from the labelled data, affect your number of cluster(k) assignment.
Remember that for most clustering problems, you do not know definitely the specific number of clusters to have but a rough domain knowledge might be helpful.