The purpose of today's lab is to practice writing and using functions, and to reinforce the use of selection and repetition. Throughout the course of the exercises we will be working in a single file creating functions common to text analysis.

Textual Analysis of Green Eggs and Ham

The first step in text analysis is to find and format an appropriate text. Your analysis will focus on the popular children’s book Green Eggs and Ham by Theo Geisel, aka Dr. Seuss, but the techniques can be applied to any text.

Exercise 6.1

  1. Begin by creating a lab06 folder in your cs108 project if you have not already done so.
  2. Create a new python file called text_analysis.py
  3. Add appropriate documentation at the top of the file.
  4. Copy this definition of a list of strings into your program: GreenEggsAndHam.txt

Verify that your definition is usable by adding code that prints out the total number of words in the list. Mark this test in your file with a comment above the test that says "Exercise 6.1".

The plan for today is to implement a number of a different tasks that are common in textual analysis. For each task, we will create a general purpose function that could be used for whichever text in which we are interested. We will then test our function using several different lists, and finally use the functions to analyze Green Eggs and Ham.

Test Lists

When thinking about writing functions that will operate on data they receive, it is good practice to consider as many different kinds of data as we can. Start by asking these questions:

Exercise 6.2

The functions we write today are intended to work on lists of strings. Go to the top of your file (above your definition of the green eggs and ham list) and create 3 more lists that will be used for testing your functions. Include an empty list (i.e. [] ), a list of 1 element, and a then a list of 3 or 4 elements. Put a comment above these definitions that indicates these are for Exercise 6.2.

We are now ready to start thinking about the kinds of textual analysis we would like to have available.

Linear Search

Searching a list is such a common computational task that many languages (including Python) include a means to perform this operation. However, it is also a great opportunity to practice selection, repetition and functions, so today we will create our own version.

A linear search algorithm searches for a target in a list of elements, returning the position of the first occurrence of the target in the list. If it doesn't find the target, it returns -1. It works as follows:

  1. Given: a list to search, called str_list, and a target to find, called target
  2. Initialize position indicator to 0.
  3. Loop for each element in list:
    1. If target is the same as the current element then
      1. Return position indicator.
    2. Increment position indicator.
  4. Return -1.

Notice that this algorithm will appropriately return -1 for an empty list as well as when the target word is not present.

Exercise 6.3
  1. Write a function called search that implements the linear search algorithm above to search through a list of words looking for a given target word. Put this function below the definition of the green eggs and ham text. Your function should receive a list of strings and a target string from the calling program and return an index at which it finds the target word or -1 if it doesn't find the target word. Be sure to document your function appropriately. Note that this function does not return the indices of all instances of the target word, but rather just the first.
  2. To test that your search function is correct, create a function called test_search that will perform several tests of our search function. This function should not receive any inputs, and then print out the results of our tests.
  3. Add a couple tests that call the search function using your test lists (created in exercise 6.2) and targets such that the result should be -1 and where the result should give an index in the list. There should be at least 4 calls to the search function in this test function.
  4. To actually run your tests, add a call to the test search function at the very bottom of your file (below the computation of the length of green eggs and ham).
  5. Finally, create a function for doing the analysis of Green Eggs and Ham. Name this function green_eggs_and_ham_analysis and put it below your definition of test_search(). It should not receive any arguments, and should print out an index for each of the following words:
    • i
    • boat
    • fox
    • thank
    • book
    Call this analysis function at the bottom of your file, after calling the test function.

This is the approach we will take for each of our analyses. We will create a function to do the analysis, a function to test the analysis, and update a function to run the analysis on the text of green eggs and ham. At the bottom of the file, we will run our tests, and then the analysis of interest.

Computing Statistics

There are a number of different statistics than we would like to be able to compute for a text.

Exercise 6.4
  1. Write an algorithm for a function called longest_and_shortest that receives a list of words and prints the length of the shortest and the longest word in the list. The function does not return anything. Put your algorithm in your code file as a multi-line comment. When you’re satisfied that your algorithm is correct, implement it directly below your algorithm.
  2. When you have an implementation, write a test function that calls the function on each of your test lists. Note:
    • The length of the shortest and longest words should be the same for the list of one word.
    • An empty list should print an error message such as "The given list was empty".
  3. Finally, add a call to this function in your green_eggs_and_ham_analysis function.

Natural languages use certain utility words such as "a" "an" and "the" over and over again. These are commonly called stop words. Search engines and other text analysis tools generally ignore them because they tend to be peripheral to the central meaning of the text.

Exercise 6.5
  1. Start by copying this definition of a list of stop words: stopWords.txt. Place this definition below your definition of the Green Eggs and Ham list, but before any function definitions.
  2. Write an algorithm for a function called count_words_not_present that counts the number of words in one list that are not found in another list. Your function should receive two lists of words and should return the number of words in the first list that are not in the second list. You do not need to keep track of words that you’ve seen before; if a word in the first list occurs more than once, count it multiple times.
    Make use of the search method you implemented in Exercise 6.3 for this algorithm!!!
    As in Exercise 6.4, put your algorithm in a multi-line comment in your code file. When you are confident in your algorithm, implement it and leave your algorithm in place just above the function definition.
  3. When you have an implementation, write a test function that calls the function using each of your test lists. Since the function requires two lists as arguments, you will need to use your lists in combination (you may use the stop words lists as you like). Finally, use this function in your green_eggs_and_ham_analysis function to count the number of words in green eggs and ham that are not stop-words.

Extra Credit Exercise

Green Eggs and Ham is famous for only using 50 words (51 if you count the proper name “Sam-i-am”). This statistic can be computed as well, as shown in this optional exercise.

Exercise 6.6 (Extra-credit)

Write a function that counts the number of unique words in the text. Consider all the words, including stop words. This problem requires that you keep track of the unique words you’ve seen so far. The following algorithm implements this behavior:

  1. Given: a list of strings, called str_list.
  2. Set count = 0;
  3. Create an empty list called unique_words
  4. Loop for all words in str_list
    1. If the current word is not already in uniqueWords then
      1. Add the current word to unique_words.
      2. Increment count by 1.
  5. Return count.

Notice a couple things about this algorithm before implementing it:

Implement this algorithm in a function called count_unique_words and create an appropriate test function. Document your function in the standard way (not with the algorithm), and add a call to the analysis of green eggs and ham to verify the number of unique words in the text.

Checking In

Submit all the code and supporting files for the exercises in this lab. We will grade this exercise according to the following criteria:

Submit your solutions to these lab exercises using your appropriate submission environment.

If you worked with a partner, make sure you both have a copy of the files.

If you’re working on a lab computer, don’t forget to log off of your machine when you are finished!