Loading and Plotting Data

CS 106

Takeaways of the week

Libraries like numpy and matplotlib provide pre-packaged useful functionality
Using them builds on fundamental concepts you learned earlier (data types, Boolean expressions, etc.)
Files are just named strings; they need to be interpreted (including data types)

This week

Fundamentals of the Python scientific computing ecosystem

numpy
matplotlib

Fundamentals of data storage

CSV files

Array Programming: `numpy`

aka np, because it’s canonically imported as:

import numpy as np

`numpy`

Numerical computing library for Python
Provides the array data type. Like a list but:
- Automatic for loops!
- Supports multiple dimensions
…and lots of utilities
- arange: range that makes arrays
- zeros / ones / full: make new arrays
- lots of math functions

example

x = [1.0, 2.0, 3.0]
y = [3.0, 2.0, 1.0]
x + y

[4., 4., 4.]
[1.0, 2.0, 3.0, 3.0, 2.0, 1.0]
error

import numpy as np
x = np.array([1.0, 2.0, 3.0])
y = np.array([3.0, 2.0, 1.0])
x + y

[4., 4., 4.]
[1.0, 2.0, 3.0, 3.0, 2.0, 1.0]
error

Arrays have consistent data types

All ints:

np.array([1, 2, 3])

array([1, 2, 3])

All floats:

np.array([1, 2, 3.1])

array([1. , 2. , 3.1])

`np.arange`

Like range, but:

makes NumPy arrays
allows floats

x = np.arange(0.0, 2.0, .5)
x[:5]

array([0. , 0.5, 1. , 1.5])

Broadcasting (automatic `for` loops!)

x = np.linspace(0.0, 1.0, 5)
x

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

array plus scalar:

x + 1

array([1.  , 1.25, 1.5 , 1.75, 2.  ])

array plus array:

x + x

array([0. , 0.5, 1. , 1.5, 2. ])

Applying a function to every element:

y = np.sin(2 * np.pi * x)
y

array([ 0.0000000e+00,  1.0000000e+00,  1.2246468e-16, -1.0000000e+00,
       -2.4492936e-16])

Reduction operations

Reduce the dimensionsionality of an array (e.g., summing over an axis)

x = np.linspace(0.0, 1.0, 5)
x

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

x.sum()

np.float64(2.5)

x.mean()

np.float64(0.5)

x.max()

np.float64(1.0)

np.argmax(x)

np.int64(4)

Example: computing error

Suppose we’re fitting a model to some data. The true values are:

y_true = np.array([1., 2., 3.])

And the model predicts:

y_pred = np.array([1.5, 1.5, 3.5])

MAE: mean absolute error: average of absolute differences

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| \]

MSE: mean squared error or RMSE: root mean squared error

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

Can you write NumPy code to compute these? (You’ll need np.abs not abs for arrays.)

mae = np.abs(y_true - y_pred).mean()
mae

np.float64(0.5)

mse = ((y_true - y_pred) ** 2).mean()
mse

np.float64(0.25)

`matplotlib`

Import:

import matplotlib.pyplot as plt

Example

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, .01)
y = np.sin(2 * np.pi * x)
plt.plot(x, y)
plt.xlabel('time (s)')
plt.ylabel('volts (mV)')
plt.show()

x = np.arange(0.0, 2.0, .01)
y = np.sin(2 * np.pi * x)
plt.plot(x, y)

What type of thing is y?

Variations

x = np.arange(0.0, 2.0, .1)
y = np.sin(2 * np.pi * x)
plt.plot(x, y, 'o')

Write code that makes this plot.

plt.plot([0., 1., -1])
plt.show()

Matplotlib Recipe

Import…
- import matplotlib.pyplot as plt
- Optionally, import numpy as np
Make list(s) of data (all the same length)
Call plt.plot() to construct plot
- or plt.hist(), plt.scatter(), plt.bar(), …
Add labels, legends, etc.
plt.savefig('filename.png')
plt.show()

What will this plot?

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, .01)
plt.plot(x, x)
plt.show()

a parabola
a diagonal line
syntax error
label error

matplotlib details

Data doesn’t have to be numpy arrays, but it’s often handy.
Many different plot types!

`plt.hist`ogram

import random
random.seed(5)
numbers = []
for _ in range(100):
    numbers.append(random.random())

plt.hist(numbers, bins=10, edgecolor='black');

plt.hist(numbers, bins=100);

Files

full path (/Users/me/Documents/Research/experiment1.csv). includes:
- name (make it meaningful, not untitled5023.py)
- extension: indicates its type (e.g., .csv, .py, .docx)
- path: where to find it
contents: sequence of bytes (basically a string)

Aside: organize your files!

Meaningful folders (by project, by class)
Backed up! (easiest: use OneDrive or similar)

Reading Files: all at once

filename = "data.csv"
text = open(filename).read()
type(text)

str

text[:100]

'Name,Location,URL,Students\nWestminster College,"Salt Lake City, UT",westminstercollege.edu,2135\nMuhl'

File not found?

URL: world-wide unambiguous name for a file (but on a website, often needs to be downloaded)
Full path: unambiguous name for a file — on a specific computer!
Relative path: depends on working folder (aka current directory)
- "data.csv" actually means os.getcwd() + "/" + "data.csv"
- Thonny sets working folder to the folder containing the .py file
So either:
- Use a full path to the file
- or (preferred): put the file you need right next to your script
- … or in a data folder and use data/data.csv

What results?

filename = "data.csv"
text = open(filename).read()
type(text)

Reading Files

The content of a file is just a string. (maybe really long)

text = open("data.csv").read()
print(repr(text[:50])) # <- show just the first 50 characters

'Name,Location,URL,Students\nWestminster College,"Sa'

We can split it into lines:

lines = text.splitlines() # <- almost the same as text.split('\n')
for line in lines:
  print(line)

Name,Location,URL,Students
Westminster College,"Salt Lake City, UT",westminstercollege.edu,2135
Muhlenberg College,"Allentown, PA",muhlenberg.edu,2330
University of Maine,"Orono, ME",umaine.edu,8677
James Madison University,"Harrisonburg, VA",jmu.edu,19019
Michigan State University,"East Lansing, MI",msu.edu,38853

Reading Files: line by line

lines = open("data.csv").read().splitlines()
for line in lines:
  print(line.split(','))

['Name', 'Location', 'URL', 'Students']
['Westminster College', '"Salt Lake City', ' UT"', 'westminstercollege.edu', '2135']
['Muhlenberg College', '"Allentown', ' PA"', 'muhlenberg.edu', '2330']
['University of Maine', '"Orono', ' ME"', 'umaine.edu', '8677']
['James Madison University', '"Harrisonburg', ' VA"', 'jmu.edu', '19019']
['Michigan State University', '"East Lansing', ' MI"', 'msu.edu', '38853']

What’s wrong?

Reading Files: CSV

import csv
csv_data = list(csv.reader(open("data.csv")))
for row in csv_data:
  print(row)

['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']

CSV Column Names

import csv
csv_data = list(csv.reader(open("data.csv")))
names = csv_data[0]
print("Column names:", names)
for row in csv_data[1:]:
  print(row)

Column names: ['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']

`DictReader`

import csv, pprint
csv_data = list(csv.DictReader(open("data.csv")))
pprint.pprint(csv_data[0])

{'Location': 'Salt Lake City, UT',
 'Name': 'Westminster College',
 'Students': '2135',
 'URL': 'westminstercollege.edu'}

open("data.csv")
open(data.csv)

File names are strings.

Looping through CSV data

import csv
csv_data = list(csv.DictReader(open("data.csv")))

for row in csv_data:
  num_students = row['Students']
  print(repr(num_students))

'2135'
'2330'
'8677'
'19019'
'38853'

Converting Data Types

Files are always just strings. If we want ints, we need to convert:

import csv
csv_data = list(csv.DictReader(open("data.csv")))

for row in csv_data:
  num_students = int(row['Students'])
  print(repr(num_students))

Getting a whole column

import csv
csv_data = list(csv.DictReader(open("data.csv")))

student_counts = []
for row in csv_data:
  students = int(row['Students'])
  student_counts.append(students)
print(student_counts)

[2135, 2330, 8677, 19019, 38853]

Debug and add comments to the following code.

import csv
import matplotlib.pyplot as plt

# 1:
csv_data = list(csv.reader(open(1880-2022.csv)))
names = csv_data[0]

# 2:
for row in csv_data:
    years = row[0]
    temp_anomalies = row[1]

# 3: 
plt.plot(years, temp_anomalies)
plt.show()

Easier way: `pandas`

import pandas as pd
data = pd.read_csv("data.csv")
data

	Name	Location	URL	Students
0	Westminster College	Salt Lake City, UT	westminstercollege.edu	2135
1	Muhlenberg College	Allentown, PA	muhlenberg.edu	2330
2	University of Maine	Orono, ME	umaine.edu	8677
3	James Madison University	Harrisonburg, VA	jmu.edu	19019
4	Michigan State University	East Lansing, MI	msu.edu	38853

We’ll study this next week.

Loading and Plotting Data

Takeaways of the week

This week

Array Programming: numpy

numpy

example

Arrays have consistent data types

np.arange

Broadcasting (automatic for loops!)

Reduction operations

Example: computing error

matplotlib

Example

Variations

Write code that makes this plot.

Matplotlib Recipe

What will this plot?

matplotlib details

plt.histogram

Files

Files

Reading Files: all at once

File not found?

What results?

Reading Files

Reading Files: line by line

Reading Files: CSV

CSV Column Names

DictReader

Looping through CSV data

Converting Data Types

Getting a whole column

Easier way: pandas

Array Programming: `numpy`

`numpy`

`np.arange`

Broadcasting (automatic `for` loops!)

`matplotlib`

`plt.hist`ogram

`DictReader`

Easier way: `pandas`