The purpose of today's lab is to practice writing and using functions, and to reinforce the use of selection and repetition. Throughout the course of the exercises we will be working in a single file creating functions common to text analysis.
The first step in text analysis is to find and format an appropriate text. Your analysis will focus on the popular children’s book Green Eggs and Ham by Theo Geisel, aka Dr. Seuss, but the techniques can be applied to any text.
lab06
folder in your cs108
project if you have not already done so.
text_analysis.py
Verify that your definition is usable by adding code that prints out the total number of words in the list. Mark this test in your file with a comment above the test that says "Exercise 6.1".
The plan for today is to implement a number of a different tasks that are common in textual analysis. For each task, we will create a general purpose function that could be used for whichever text in which we are interested. We will then test our function using several different lists, and finally use the functions to analyze Green Eggs and Ham.
When thinking about writing functions that will operate on data they receive, it is good practice to consider as many different kinds of data as we can. Start by asking these questions:
The functions we write today are intended to work on lists of
strings. Go to the top of your file (above your definition of the
green eggs and ham list) and create 3 more lists that will be used
for testing your functions. Include an empty list (i.e. []
), a list of 1 element, and a then a list of 3 or 4 elements. Put a
comment above these definitions that indicates these are for Exercise
6.2.
We are now ready to start thinking about the kinds of textual analysis we would like to have available.
Searching a list is such a common computational task that many languages (including Python) include a means to perform this operation. However, it is also a great opportunity to practice selection, repetition and functions, so today we will create our own version.
A linear search algorithm searches for a target in a list of elements, returning the position of the first occurrence of the target in the list. If it doesn't find the target, it returns -1. It works as follows:
str_list
, and a target to find, called target
Notice that this algorithm will appropriately return -1 for an empty list as well as when the target word is not present.
search
that implements the linear search algorithm above to search through
a list of words looking for a given target word. Put this function
below the definition of the green eggs and ham text. Your function
should receive a list of strings and a target string from the
calling program and return an index at which it finds the target
word or -1 if it doesn't find the target word. Be sure to document
your function appropriately. Note that this function does not return
the indices of all instances of the target word, but rather just the
first.
test_search
that will perform several tests of our search function.
This function should not receive any inputs, and then print out the
results of our tests.
green_eggs_and_ham_analysis
and put it below your definition of test_search()
. It
should not receive any arguments, and should print out an index for
each of the following words:
This is the approach we will take for each of our analyses. We will create a function to do the analysis, a function to test the analysis, and update a function to run the analysis on the text of green eggs and ham. At the bottom of the file, we will run our tests, and then the analysis of interest.
There are a number of different statistics than we would like to be able to compute for a text.
longest_and_shortest
that receives a list of words and prints the length
of the shortest and the longest word in the list. The function does
not return anything. Put your algorithm in your code file as a
multi-line comment. When you’re satisfied that your algorithm
is correct, implement it directly below your algorithm.
green_eggs_and_ham_analysis
function.
Natural languages use certain utility words such as "a" "an" and "the" over and over again. These are commonly called stop words. Search engines and other text analysis tools generally ignore them because they tend to be peripheral to the central meaning of the text.
count_words_not_present
that counts the number of words in one list that are not found in
another list. Your function should receive two lists of words and
should return the number of words in the first list that are not in
the second list. You do not need to keep track of words that you’ve
seen before; if a word in the first list occurs more than once,
count it multiple times. green_eggs_and_ham_analysis
function to count the number of words in green eggs and ham that are
not stop-words.
Green Eggs and Ham is famous for only using 50 words (51 if you count the proper name “Sam-i-am”). This statistic can be computed as well, as shown in this optional exercise.
Write a function that counts the number of unique words in the text. Consider all the words, including stop words. This problem requires that you keep track of the unique words you’ve seen so far. The following algorithm implements this behavior:
str_list
.count
= 0;unique_words
str_list
uniqueWords
then
count
.Notice a couple things about this algorithm before implementing it:
Implement this algorithm in a function called count_unique_words
and create an appropriate test function. Document your function in
the standard way (not with the algorithm), and add a call to the
analysis of green eggs and ham to verify the number of unique words
in the text.
Submit all the code and supporting files for the exercises in this lab. We will grade this exercise according to the following criteria:
Submit your solutions to these lab exercises using your appropriate submission environment.
If you worked with a partner, make sure you both have a copy of the files.
If you’re working on a lab computer, don’t forget to log off of your machine when you are finished!