Loading and Plotting Data

CS 106

Takeaways of the week

  • Libraries like numpy and matplotlib provide pre-packaged useful functionality
  • Using them builds on fundamental concepts you learned earlier (data types, Boolean expressions, etc.)
  • Files are just named strings; they need to be interpreted (including data types)

This week

Fundamentals of the Python scientific computing ecosystem

  • numpy
  • matplotlib

Fundamentals of data storage

  • CSV files

Array Programming: numpy

aka np, because it’s canonically imported as:

import numpy as np

numpy

  • Numerical computing library for Python
  • Provides the array data type. Like a list but:
    • Automatic for loops!
    • Supports multiple dimensions
  • …and lots of utilities
    • arange: range that makes arrays
    • zeros / ones / full: make new arrays
    • lots of math functions

example

x = [1.0, 2.0, 3.0]
y = [3.0, 2.0, 1.0]
x + y
  • [4., 4., 4.]
  • [1.0, 2.0, 3.0, 3.0, 2.0, 1.0]
  • error
import numpy as np
x = np.array([1.0, 2.0, 3.0])
y = np.array([3.0, 2.0, 1.0])
x + y
  • [4., 4., 4.]
  • [1.0, 2.0, 3.0, 3.0, 2.0, 1.0]
  • error

Arrays have consistent data types

All ints:

np.array([1, 2, 3])
array([1, 2, 3])

All floats:

np.array([1, 2, 3.1])
array([1. , 2. , 3.1])

np.arange

Like range, but:

  • makes NumPy arrays
  • allows floats
x = np.arange(0.0, 2.0, .5)
x[:5]
array([0. , 0.5, 1. , 1.5])

Broadcasting (automatic for loops!)

x = np.linspace(0.0, 1.0, 5)
x
array([0.  , 0.25, 0.5 , 0.75, 1.  ])

array plus scalar:

x + 1
array([1.  , 1.25, 1.5 , 1.75, 2.  ])

array plus array:

x + x
array([0. , 0.5, 1. , 1.5, 2. ])

Applying a function to every element:

y = np.sin(2 * np.pi * x)
y
array([ 0.0000000e+00,  1.0000000e+00,  1.2246468e-16, -1.0000000e+00,
       -2.4492936e-16])

Reduction operations

Reduce the dimensionsionality of an array (e.g., summing over an axis)

x = np.linspace(0.0, 1.0, 5)
x
array([0.  , 0.25, 0.5 , 0.75, 1.  ])
x.sum()
np.float64(2.5)
x.mean()
np.float64(0.5)
x.max()
np.float64(1.0)
np.argmax(x)
np.int64(4)

Example: computing error

Suppose we’re fitting a model to some data. The true values are:

y_true = np.array([1., 2., 3.])

And the model predicts:

y_pred = np.array([1.5, 1.5, 3.5])

MAE: mean absolute error: average of absolute differences

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| \]

MSE: mean squared error or RMSE: root mean squared error

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

Can you write NumPy code to compute these? (You’ll need np.abs not abs for arrays.)

mae = np.abs(y_true - y_pred).mean()
mae
np.float64(0.5)
mse = ((y_true - y_pred) ** 2).mean()
mse
np.float64(0.25)

matplotlib

Import:

import matplotlib.pyplot as plt

Example

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, .01)
y = np.sin(2 * np.pi * x)
plt.plot(x, y)
plt.xlabel('time (s)')
plt.ylabel('volts (mV)')
plt.show()

x = np.arange(0.0, 2.0, .01)
y = np.sin(2 * np.pi * x)
plt.plot(x, y)

What type of thing is y?

Variations

x = np.arange(0.0, 2.0, .1)
y = np.sin(2 * np.pi * x)
plt.plot(x, y, 'o')

Write code that makes this plot.

plt.plot([0., 1., -1])
plt.show()

Matplotlib Recipe

  • Import…
    • import matplotlib.pyplot as plt
    • Optionally, import numpy as np
  • Make list(s) of data (all the same length)
  • Call plt.plot() to construct plot
    • or plt.hist(), plt.scatter(), plt.bar(), …
  • Add labels, legends, etc.
  • plt.savefig('filename.png')
  • plt.show()

What will this plot?

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, .01)
plt.plot(x, x)
plt.show()
  • a parabola
  • a diagonal line
  • syntax error
  • label error

matplotlib details

  • Data doesn’t have to be numpy arrays, but it’s often handy.
  • Many different plot types!

plt.histogram

import random
random.seed(5)
numbers = []
for _ in range(100):
    numbers.append(random.random())
plt.hist(numbers, bins=10, edgecolor='black');

plt.hist(numbers, bins=100);

Files

Files

  • full path (/Users/me/Documents/Research/experiment1.csv). includes:
    • name (make it meaningful, not untitled5023.py)
    • extension: indicates its type (e.g., .csv, .py, .docx)
    • path: where to find it
  • contents: sequence of bytes (basically a string)

Aside: organize your files!

  • Meaningful folders (by project, by class)
  • Backed up! (easiest: use OneDrive or similar)

Reading Files: all at once

filename = "data.csv"
text = open(filename).read()
type(text)
str
text[:100]
'Name,Location,URL,Students\nWestminster College,"Salt Lake City, UT",westminstercollege.edu,2135\nMuhl'

File not found?

  • URL: world-wide unambiguous name for a file (but on a website, often needs to be downloaded)
  • Full path: unambiguous name for a file — on a specific computer!
  • Relative path: depends on working folder (aka current directory)
    • "data.csv" actually means os.getcwd() + "/" + "data.csv"
    • Thonny sets working folder to the folder containing the .py file
  • So either:
    • Use a full path to the file
    • or (preferred): put the file you need right next to your script
    • … or in a data folder and use data/data.csv

What results?

filename = "data.csv"
text = open(filename).read()
type(text)

Reading Files

The content of a file is just a string. (maybe really long)

text = open("data.csv").read()
print(repr(text[:50])) # <- show just the first 50 characters
'Name,Location,URL,Students\nWestminster College,"Sa'

We can split it into lines:

lines = text.splitlines() # <- almost the same as text.split('\n')
for line in lines:
  print(line)
Name,Location,URL,Students
Westminster College,"Salt Lake City, UT",westminstercollege.edu,2135
Muhlenberg College,"Allentown, PA",muhlenberg.edu,2330
University of Maine,"Orono, ME",umaine.edu,8677
James Madison University,"Harrisonburg, VA",jmu.edu,19019
Michigan State University,"East Lansing, MI",msu.edu,38853

Reading Files: line by line

lines = open("data.csv").read().splitlines()
for line in lines:
  print(line.split(','))
['Name', 'Location', 'URL', 'Students']
['Westminster College', '"Salt Lake City', ' UT"', 'westminstercollege.edu', '2135']
['Muhlenberg College', '"Allentown', ' PA"', 'muhlenberg.edu', '2330']
['University of Maine', '"Orono', ' ME"', 'umaine.edu', '8677']
['James Madison University', '"Harrisonburg', ' VA"', 'jmu.edu', '19019']
['Michigan State University', '"East Lansing', ' MI"', 'msu.edu', '38853']

What’s wrong?

Reading Files: CSV

import csv
csv_data = list(csv.reader(open("data.csv")))
for row in csv_data:
  print(row)
['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']

CSV Column Names

import csv
csv_data = list(csv.reader(open("data.csv")))
names = csv_data[0]
print("Column names:", names)
for row in csv_data[1:]:
  print(row)
Column names: ['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']

DictReader

import csv, pprint
csv_data = list(csv.DictReader(open("data.csv")))
pprint.pprint(csv_data[0])
{'Location': 'Salt Lake City, UT',
 'Name': 'Westminster College',
 'Students': '2135',
 'URL': 'westminstercollege.edu'}
  • open("data.csv")
  • open(data.csv)

File names are strings.

Looping through CSV data

import csv
csv_data = list(csv.DictReader(open("data.csv")))

for row in csv_data:
  num_students = row['Students']
  print(repr(num_students))
'2135'
'2330'
'8677'
'19019'
'38853'

Converting Data Types

Files are always just strings. If we want ints, we need to convert:

import csv
csv_data = list(csv.DictReader(open("data.csv")))

for row in csv_data:
  num_students = int(row['Students'])
  print(repr(num_students))
2135
2330
8677
19019
38853

Getting a whole column

import csv
csv_data = list(csv.DictReader(open("data.csv")))

student_counts = []
for row in csv_data:
  students = int(row['Students'])
  student_counts.append(students)
print(student_counts)
[2135, 2330, 8677, 19019, 38853]

Debug and add comments to the following code.

import csv
import matplotlib.pyplot as plt

# 1:
csv_data = list(csv.reader(open(1880-2022.csv)))
names = csv_data[0]

# 2:
for row in csv_data:
    years = row[0]
    temp_anomalies = row[1]

# 3: 
plt.plot(years, temp_anomalies)
plt.show()

Easier way: pandas

import pandas as pd
data = pd.read_csv("data.csv")
data
Name Location URL Students
0 Westminster College Salt Lake City, UT westminstercollege.edu 2135
1 Muhlenberg College Allentown, PA muhlenberg.edu 2330
2 University of Maine Orono, ME umaine.edu 8677
3 James Madison University Harrisonburg, VA jmu.edu 19019
4 Michigan State University East Lansing, MI msu.edu 38853

We’ll study this next week.