Calvin University CS 112 Project 9

CS 112 Project 9: Binary Search Trees

Objectives:

Practice building BST operations.
Practice building recursive operations.
Learn more about binary search trees.

Introduction

This week's project is a single person project. The project is to:

Add a new method to your BST class:
- bst.getHeight() should return the height of the tree in bst. For example, if bst is empty, it should return 0; if bst contains a single item, it should return 1; if it contains 2 items, it should return 2; if it is balanced and contains 3 items, it should return 2, otherwise, it should return 3; if it is balanced and contains 4 items, it should return 3, otherwise, it should return 4; etc.
- Note that your implementation should not assume that the tree is balanced. Your algorithm must be correct for all trees.
Convert your BST into a template; and
Use your BST<Item> to conduct an experiment.

As usual, you should use test-drive development to accomplish these tasks.

The Experiment

The height of a BST determines the speed of operations like insert() and contains(). If we build a "large" BST of N items (e.g., ~2²⁰), and that tree is balanced, then its height should be the minimal O(lg(N)) (e.g., 20).

This week's problem is to conduct a simple experiment to see how "tall" BSTs get, given random values. Our objective is to see how close we come to lg(N) height if we build large trees of N randomly ordered values.

In the Maroon lab, the folder:

/home/cs/112/proj/09

contains ten data files (randomInts01.txt, randomInts02.txt, ..., randomInts10.txt), each containing 2²⁰-1 = 1,048,575 randomly generated integer values. If these values were ordered in exactly the right way, we could build trees of height 20. (On the other hand, if they were ordered in exactly the wrong way, we could build trees of height 1,048,575!)

Your task is to write a simple program that prompts the user for the name of an input file, reads each value from that file, and inserts it into a BST. Your program should then display the height of the resulting BST.

There are three minor details to keep in mind:

There may be duplicate values within an input file, and inserting a duplicate value into our BST will throw an exception. Your program should count and output the number of duplicate values in a given file by catching and handling this exception.
Each of these input files is about 20 MB in size. If you are working on a machine in the lab (perhaps via remote.cs.calvin.edu), then to avoid killing your disk-space quota, your program should open and read the files directly from that folder. (Hint: use a file's full file-name to open it: e.g., /home/cs/112/proj/09/randomInts01.txt.) If you are working on your own computer, then you can download all the files here:
- Linux / Mac: proj09randomints.zip
- Windows: proj09randomints_windows.zip
You'll have to unzip it and put the files in your project directory/folder. Note that this zip file is still about 100 MB in size.
These files were generated in the Maroon lab (whose workstations are 64-bit machines), and so the integers in these files are 64-bit integers. However, the C++ int and unsigned types only store integers in 32 bits; the long type stores integers in 64 bits, so use it to store these 64-bit integers.

Since there are ten input files, you will need to run your program ten times. (Or find a clever way to automate this.) For each file, record the corresponding tree-height in a spreadsheet, and the number of duplicate values that were found. Given all ten heights, use the spreadsheet to calculate the minimum height, the maximum height, the average height, the median height, and the standard deviation of the ten heights. (Your spreadsheet will make each of these computations easy.) Finally, use your spreadsheet to create a bar-chart showing each of the ten heights, as a visualization of your data. Save this chart in the worksheet where you recorded and analyzed your data.

Finally

Answer the following questions in your spreadsheet, below the chart:

Is lg(N) or N a better approximation for your measured heights? Why?
How much variance is there in the height of these trees? Is this surprising? Why or why not?
How many duplicate values were there (on average)? Is this surprising? Why or why not?
The "big Oh" notation O(f(n)) implies that there are constants a and b, such that a * f(n) + b describes the reality for which O(f(n)) is an approximation. A balanced BST of N items has height O(lg(N)). Theoretically, there should be values a and b such that a * lg(N) + b describes the actual height of a BST built using random data. Using your experimental results, what is a reasonable value for a?

Preparing to Submit

Create a script file in which you use ls to list all your project's files, use cat to display the contents of each file, and build+run your project to demonstrate that it works correctly. Make certain that your script file and your spreadsheet are located in your project folder.

Our grader will grade your submission based on the categories given in this grade sheet, so you should make certain all of those categories are covered.

Submit

An electronic copy of your proj09 folder, copied to your /home/cs/112/current/yourUserName/ folder.

CS > 112 > Projects > 09

This page maintained by Joel Adams.