CS 344 - Unsupervised Learning

K-Means is a simple clustering algorithm for unsupervised learning. Because AIMA Python does not include an implementation of this algorithm, nor does the book provide a description of it, you will use a more main-stream implementation supported by the SciPy library

Exercise 9.1

Do the following exercises with K-Means clustering.

Download the following sample code, which implements a one-dimensional K-means clustering problem: lab1a.py. Make sure that you can answer the following questions:
1. What are the data and what do the cluster centers and classifications represent?
2. Sometimes this code throws an error. What is the error and why do you get it?
3. What happens if you configure K-Means to look for 2 cluster centers? 4? Why?
Download this two-dimensional clustering problem: lab1b.py and, again, make sure that you can answer the following questions:
1. What are the data and what do the cluster centers and classifications represent?
2. How would you decide what the right number of clusters should be? How about what the best solution is?

Expectation Maximization (EM)

Exercise 9.2

Do the following exercises with EM clustering.

Download the following sample code, which implements a simple Gaussian mixture model: lab2a.py. Make sure that you can answer the following questions:
1. How many components does this Gaussian mixture model contain?
2. How are the data points generated? Try modifying the mixture to make sure you understand how it is built.
3. What are the hidden variables for this Gaussian mixture model?
4. How close does EM get to the true mixture model?
Modify the code from the previous exercise to match the example output shown in the text in Figure 20.11. You can make Gaussian mixture models with more than one component by:
- Changing the value of components.
- Adding random data points for each new component and concatenating them into one data set. For example, the following code produces a two-component Gaussian mixture:
```
numpy.concatenate((sigma1 * numpy.random.randn(n1, 2) + mu1, 
                   sigma2 * numpy.random.randn(n2, 2) + mu2))
```
  The mu and sigma values for each cluster can differ, and the number of points specified for each cluster will determine the weights.
You won't be able to make the model exact, but produce something close to what is shown in the figure.

Now try EM on a real dataset, the well-known iris dataset. The AIMA data distribution contains this dataset as well.

Exercise 9.3

Download the following sample code, lab3.py, which loads the iris data-set from a pickled file, iris.txt. Modify the code to use a gaussian mixture model to classify the basic flower types represented in the iris data.

Do your learned weights, means and variances match the actual structure of the data? How would you check this?

This exercise uses pickle, a Python tool for dumping and loading data. You’ll use this in the homework as well.

K-Means

Expectation Maximization (EM)

Checking in