Homework 10: Plotting MALDI-TOF Data using matplotlib

Warning: This is draft content. Do not start work on this assignment yet.

For this project, your program will read in a data file containing data from a run of the biochem department's MALDI-TOF device. The device shoots lasers at a sample and measures the time-of-flight of ionized molecules. In essence, the result is a kind of mass spectrometry.

Your program will put the MALDI-TOF data into two different lists and will use a package called matplotlib to graph the values.

The file is a CSV file. The first few lines look like:

time,intensity
2668.909,12
2669.102,9
2669.296,5
2669.490,3
2669.684,2

We will use time as the x value. It gives a time when the sensors in the MALDI-TOF device got a reading. The y-value is the intensity of the reading.

Your program must read this file into 2 lists -- a list of the x-values and a list of the y-values -- as required by matplotlib.

The data file malditof_data.csv is here. Download that file, as described below.

Step 1. Set up

In Thonny, create a new folder called malditof.  Then, create a new file malditof.py

Download the malditof_data.csv file (if you haven't already) and put the file in this new folder. Open the file and inspect it a little bit to get a feel for what it looks like.

We’re going to supply the name of the file to use as a command-line argument. Enable “Program Arguments” on the “View” menu in Thonny to show the text box, then put malditof_data.csv in that text box.

Step 2. Set up the main file's areas

Create the areas of the file where you will put imports, CONSTANTS, function declarations, and the main code. Do it similar to how we've done this in previous weeks.

To access the command-line argument:

  1. In the imports section, import sys.
  2. In the main code, use filename = sys.argv[1]

Step 3. Write the code

The code should be fairly similar to your code where you plotting climate data. You will need to import the matplotlib package like this:

import matplotlib.pyplot as plt

Write the code to display the malditof data using a line graph. To do this, read the values from the file into two lists, time_data and intensity_data. time_data holds all the first values from the lines of the data file. intensity_data holds all the second values from the lines of the data file. Then, call:

plt.plot(time_data, intensity_data, color='red', marker='.', linestyle='-')

Note that when you run your program it may take a while to read in the thousands and thousands of data points and plot them.  Be patient.

Remember to show your plot.

Step 4: Adding a Smoothed Plot

When you look closely at the plot of the data, you'll see the data shows 6 distinct peaks.  Or, does it?  Zoom in closely at the largest peak, and you will see the peak is not just one value, but a set of values that jump up and down a bit at the top.  Let’s smooth that out.

smoothed plot

You need to add code to your program that will put a second line on the same plot that shows smoothed values of the data in blue.  For each point i in the graph, you will display the average of the y-values from i-5 to i+5 -- in other words, the 11 surrounding y-values.  You'll do this to try to "smooth out" the values so that you can better find real individual peaks.

Note: this sounds easy, doesn't it?  But, there is at least one gotcha involved.  What do you do when processing data at the beginning and end where there is no i-5 or i+5 intensity_data value? For this exercise, we’ll only smooth the values that exist: for the point at index 0, we’ll average points 0 through 5; for point 1, we’ll average points 0 to 6, etc.

To do this, it’ll be helpful to make a function that computes the average of a list of numbers:

Name
compute_average(x)
Purpose
Computes the average of a list of numbers.
Parameters
x: a list of floating-point numbers
Return Value
the average of that list (a floating-point number)
Example
compute_average([1.0, 2.0, 3.0]) returns 2.0

Hint 1: Create a second list called smoothed_intensity_data, and fill it by averaging the data from intensity_data. Then, your second plot is xdata vs. smoothed_intensity_data.

Hint 2: To get both plots on the same screen, call plt.plot() twice, and then do plt.show() afterward.

Hint 3: The easiest way to do this is to do an index-based loop over intensity_data. In the loop, compute the beginning index and ending index for the values to average together to get the smoothed value. The beginning index is i-5 and the ending index is i+6 if the point is in the middle of the list. (Note that this includes 5 points on either side.) If the point is near the beginning of the list, the beginning index is 0. if the point is near the end of the list, the ending index is len(intensity_data). (hint: max(idx, limit) and min(idx, limit).) Once you have the indices computed, you can use a slice to get the sub-list that you will average.

Hint 4: Although you might be able to find code on the Internet or in numpy to accomplish this task, for this task try to actually write it yourself in plain Python.

Test your smoothing code by trying it on a few specially crafted lists. Execute your tests only if DEBUGging is enabled; disable it before you submit.

Grading Rubric:

Category Max points
Program produces correct plot of data 4
Program produces correct 2nd plot of smoothed data 4
Good variable names and comments 2
TOTAL 10

Submit to Moodle only malditof.py. Do not submit the data file.

Extension

Can you do this exercise without using csv.reader()?

In particular, this exercise originally had you work with a space-separated file and .split() the lines yourself. Try downloading the original space-separated file and see if you can adapt your code to work with it. Note that it doesn’t have a header row.

Acknowledgments

This exercise is based on an exercise developed by Prof Vic Norman.