Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

QBUS1040

Tutorial 4

Semester 2, 2022

Exercise 1: k-means algorithm: code it up!

In this exercise, you will implement the k-means algorithm using Python.

(a) Create 15 2-vectors. Provide an initial guess of the group representatives and choose the number of groups k to be 3. Plot your data to see what you are working with.

(You may try to change the initial guess later on and see how it affects your results.) (b) Write Python code that assigns the vectors to different groups.

(c) Write Python code that computes the new centroids.

(d) Compute the clustering objective value Jclust .

(e) Perform five iterations using the code you have written in parts (b) - (d).

(f) Define parts (b) - (d) as functions and write a wrapper function that runs k-means until con- vergence.

(g) Run the k-means algorithm on the initial data and plot your results.  Does your result make sense?

Exercise 2: Topic discovery via k-means

In this exercise, you will use the k-means algorithm to cluster 300 Wikipedia articles selected from 5 broad groups of topics. The CSV file article_histograms .csv contains the histograms of all 300 articles as a list of 300 1000-vectors.  Each element (xi )j  of vector xi  is the number of times word j from the dictionary appears in article i.  The CSV file article_titles .csv provides the list of article titles, and the CSV file dictionary .csv contains the list of 1000 words used to create the histograms. The Jupyter notebook file Tutorial4_Problem2_Student_version .ipynb includes the code to import the data into Python.

(a) Use the Kmeans_alg function you have written in Problem 1 to perform k-means clustering.

(b) (If time allows) For each of k = 2, k = 5, and k = 10 run k-means twice, and plot Jclust

(vertically) versus iteration (horizontally) for the two runs on the same plot. Comment briefly on your results.

(c) (If time allows) Choose a value of k from part (b) and investigate your results by looking at the words and article titles associated with each centroid. Feel free to visit Wikipedia if an article’s content is unclear from its title. Give a short description of the topics your clustering discovered, along with the three most common words from each topic. If the topics do not make sense, pick another value of k .