Instructions
NO EXTENSION will be given for submission. You will be graded on the solution you have on due date.
The assignment should be submitted via Canvas. Submit a file called Project3.zip containing the files in the deliverables:
- Each student is expected to work on each assignment INDIVIDUALLY and submit his or her own work. The instructor will report to the Office of Student Conduct all violations of this policy, and all cases that are suspicious of such violations.
- Late Submission: Everything is due by 11:59 pm on the due date. The deadline is automatically managed by Canvas. You can still turn in assignment/project after the deadline. However, you automatically lose 5 points per hour after the due time, until you get 0. (Each individual assignment is 100 points). We cannot waive the penalty, unless there was a case of illness or other substantial impediment beyond your control, with proof in documents.
Note: Reasons such as meeting at work, attending a wedding, or travel e.t.c are not acceptable as reasons to waive penalty for late/no submission
- Make sure the plots, if any, that you submit are easy to read at a normal zoom level.
- The stipulated naming conventions are mandatory, non-adherence to these specifications can incur a penalty of up to 10 points.
Note: You can download and install ANACONDA NAVIGATOR to make sure that you have the required Python and R versions and dependencies.
Deliverables
The files expected in your submission folder are listed below. You will loose considerable points if you have less files
- Project3Report-[lastname].pdf
cluster_analysis.[extension of language]
where:
- Project3Report-[lastname].pdf, contains the report of your results for the programming task 5. Only PDF files will be accepted. All text should be typed, and if any figures are present they should be computer-generated.
- Scans of handwriten answers will NOT be accepted.
Introduction
Clustering involves finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters.
For this clustering project, you will analyze a study of Asian Religious and Biblical Texts Data Set.
Goal of this Project: The goal is to use clustering algorithms to determine which documents should be in same group based on the document matrices. Ultimately, you want document from same religious group in same cluster.
Task 1: Data Download (5 Points)
Download the Asian Religious and Biblical Texts Data Set from the UCI Repository to use for this project.
For this project the authors have pre-processed the document and generated the Document Term Matrices(DTM).
So you do not need to do the extraction your self. The two datasets to use from the download are:
AllBooks_baseline_DTM_Unlabelled.csv: This contains the DTM with No labels. Use this dataset for the Clustering Project
AllBooks_baseline_DTM_Labelled.csv: This contains the DTM with Labels. Use this to answer the comments in the report if necessary.
Task 2: Cluster Analysis (5 points)
All your functions and Codes created in Task 3 will be inside a file named :
cluster_analysis.[extension of language] containing your R, Matlab or Python code for the programming part.
Your code should be invoked as follows (Matlab):
cluster_analysis(<input_data>, <number of clusters>)
Your code should be invoked as follows (Python):
python3 cluster_analysis.py <input_data> <number of clusters>
where:
- The first argument, <input_data> is the input data.
- The second argument, <number of clusters> is the number of clusters passed as input
Task 3: Create functions for K-Means, K-Medoids and Hierarchical Clustering (50 points)
Write a function that Implement the k-means and k-medoid clustering algorithms.
Library: You can use any library from your language of choice library to solve this.
Your code should be invoked as follows (Matlab):
kmeans_dm(<input_data>, <number of clusters>)
kmedoids_dm(<input_data>, <number of clusters>)
hierarchicalclustering_dm(<input_data>, <number of clusters>)
where:
- The first argument, <input_data> is the input data.
- The second argument, <number of clusters> is the number of clusters passed as input
Task 4: Report (40 points)
- Use ten(10) different cluster sizes(k) and evaluate the quality of the clustering results using the Davies-Bouldin index (DBI) and Silhouette Index (SI) for the three algorithms. Present your result with a plot. (20 point)
- Comment on the performance of the algorithms on the different cluster sizes ? (10 point)
- At what k value was the best clustering quality achieved, and what is the running times of the three algorithms for this k ? (5 point)
- Which of the algorithms would you recommend for solving text clustering? (5 points for undergrad / 2.5 points for grads)
- Which Books were poorly clustered? Any explanation why? (5 points bonus for undergrad / 2.5 points for grads)
Please Note:
Don't let the fact that you know the specific number of religious books, from the labelled data, affect your number of cluster(k) assignment.
Remember that for most clustering problems you do not know definitely the specific number of clusters to have.