cs108 → strings → homework07

Your instructor may assign one or more of the following problems. Don’t feel that you must do all the problems; for the homeworks, you are only required to do those that are explicitly assigned via Moodle.

Biologists use a sequence of letters A,C,T and G to model a genome. A gene is a substring of a genome that starts after a triplet ATG and ends before a triplet TAG, TAA or TGA. Furthermore, the length of a gene string is a multiple of 3 and the gene does not contain any of the triplets ATG, TAG, TAA and TGA. Write a program called find_genes.py that prompts the user to enter a genome and displays all genes in the genome. If no gene is found in the input sequence, the program displays no gene is found. Below are a couple sample runs:
```
	==================================================================
	Enter a genome string: TTATGTTTTAAGGATGGGGCGTTAGTT
	Gene 1: TTT
	Gene 2: GGGCGT
	==================================================================
	Enter a genome string: TGTGTGTATAT
	no gene is found
	==================================================================
	
```
Write a function that assigns the appropriate part of speech (POS) to each word in a sentence. It should receive a sentence like “John kicked the dog.”, and return the list of POS tags: ['n', 'v', 'd', 'n'], indicating that John is a noun, kicked is a verb, the is a determiner and dog is another noun. Include “unknown” for words that don’t match any known words. Helpful functions for this include string.split(), which creates a list of strings corresponding to the words in the sentence, and string.endswith(subString), which checks to see if a string ends with a given sub-string.

The sentence could contain different forms of the words. For example, the nouns could include “dog” and “dogs” and the verbs could include “kick”, “kicks”, “kicked”. Thus, your system should stem each work, that is, remove known suffixes in order to find the unadorned stem of the word. Ignore all punctuation in the input sentence.

Your system should support the following words:
- Nouns: John, Mary, dog, cat
- Verbs: kick, help, call, need
- Determiners: a, an, the
It should also support the following suffixes: -s, -es, -ed.

To make this problem easier to solve, you may make the following assumptions:
- Consider only the words listed above; ignore irregular forms, which don’t use standard stems (e.g., is/am/are, ox/oxen, etc.) and words that end in “e” or “s”, which would mess up the simple stemming algorithm you’re building (e.g., loves → lov, Chris → Chri).
- Assume that any suffix could work on any word, regardless of whether they are nouns, verbs or determiners, e.g., your system would label Johned (→ John) as a noun, even though -ed is not a valid suffix for nouns.
- Don’t worry about word order, e.g., “John Mary helps.” will return ['n', 'n', 'v'] even though the sentence is ungrammatical; you’re focusing on isolated POS tagging only.
Write a function that returns the longest common prefix of two strings. This function should receive two strings and return a string. If the strings have no common prefix, the function should return an empty string. Put this function in a program called find_prefix.py that prompts the user to enter two strings and then displays their common prefix.

Checking In

Submit all appropriate files for grading, including code files, screen captures, supplemental files (e.g., image files), and text files. We will grade this exercise according to the following criteria:

Correctness:
- 25% - Interactive behavior is a specified.
- 60% - Basic functionality - Include the basic behavior required by the problem specification.
Understandability:
- 5% - Code Documentation - Separate the logical blocks of your program with useful comments and white space.
- 5% - Header Documentation - Document the code’s basic purpose, authors and assignment number.
- 5% - Documentation strings - Include appropriate documentation strings for any functions