cs344 → reasoning → homework2

Build a spam filter based on Paul Graham’s A Plan for Spam. You’ll find a sketch of his statistical algorithm early in the article (roughly one-fifth of the way through the article).

Include in your solution, as one test case, the probability tables for the words in the following hard-coded SPAM/HAM corpus (and only this corpus) using a minimum count threshold of 1 (rather than the 5 used in the algorithm):
```
spam_corpus = [["I", "am", "spam", "spam", "I", "am"], ["I", "do", "not", "like", "that", "spamiam"]]
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]
```
Graham argues that this is a Baysian approach to SPAM. What makes it Bayesian?
Do the following exercises based on the Bayesian network shown in Figure 14.12a:
1. Implement the network using the AIMA Python tools.
2. Compute the number of independent values in the full joint probability distribution for this domain. Assume that no conditional independence relations are known to hold between these values.
3. Compute the number of independent values in the Bayesian network for this domain. Assume the conditional independence relations implied by the Bayes network.
4. Compute probabilities for the following:
  1. P(Cloudy)
  2. P(Sprinker | cloudy)
  3. P(Cloudy| the sprinkler is running and it’s not raining)
  4. P(WetGrass | it’s cloudy, the sprinkler is running and it’s raining)
  5. P(Cloudy | the grass is not wet)
  Provide both computer-generated solutions and hand-worked derivations of how these numbers are computed.

Checking in

Submit a Jupyter notebook (homework2.ipynb). We will grade your work according to the following criteria:

75% — Exercise 1
25% — Exercise 2

See the policies page for homework due-dates and times.