Loading and Plotting Data

CS 106

Takeaways of the week

Libraries like numpy and matplotlib provide pre-packaged useful functionality
Using them builds on fundamental concepts you learned earlier (data types, Boolean expressions, etc.)
Files are just named strings; they need to be interpreted (including data types)

This week

Fundamentals of the Python scientific computing ecosystem

numpy
matplotlib

Fundamentals of data storage

CSV files

`numpy`

aka np, because it’s canonically imported as:

import numpy as np

`numpy`

Numerical computing library for Python
Provides the array data type. Like a list but:
- Automatic for loops!
- Fancy indexing (e.g., multiple dimensions at once)
…and lots of utilities
- arange: range that makes arrays
- zeros / ones / full: make new arrays
- lots of math functions

example

x = [1.0, 2.0, 3.0]
y = [3.0, 2.0, 1.0]
x + y

[4., 4., 4.]
[1.0, 2.0, 3.0, 3.0, 2.0, 1.0]
error

import numpy as np
x = np.array([1.0, 2.0, 3.0])
y = np.array([3.0, 2.0, 1.0])
x + y

[4., 4., 4.]
[1.0, 2.0, 3.0, 3.0, 2.0, 1.0]
error

Arrays have consistent data types

All ints:

np.array([1, 2, 3])

array([1, 2, 3])

All floats:

np.array([1, 2, 3.1])

array([1. , 2. , 3.1])

All complex:

np.array([2j, 2, 3.0])

array([0.+2.j, 2.+0.j, 3.+0.j])

Arrays support fancy indexing

x = np.array([1.0, -2.0, 3.0])
x[0]

np.float64(1.0)

y = x > 0
y

array([ True, False,  True])

y is a boolean array. We can use it to index into x

x[y]

array([1., 3.])

x[~y]

array([-2.])

We usually put these together without a separate y:

x[x > 0]

array([1., 3.])

`np.arange`

Like range, but:

makes NumPy arrays
allows floats

x = np.arange(0.0, 2.0, .5)
x[:5]

array([0. , 0.5, 1. , 1.5])

Broadcasting (automatic `for` loops!)

array plus scalar:

x + 1

array([1. , 1.5, 2. , 2.5])

array plus array:

x + x

array([0., 1., 2., 3.])

Applying a function to every element:

y = np.sin(2 * np.pi * x)
y

array([ 0.0000000e+00,  1.2246468e-16, -2.4492936e-16,  3.6739404e-16])

Reduction operations:

np.mean(x), np.std(x), np.sum(x), np.argmax(x)

(np.float64(0.75),
 np.float64(0.5590169943749475),
 np.float64(3.0),
 np.int64(3))

`matplotlib`

Import:

import matplotlib.pyplot as plt

Example

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, .01)
y = np.sin(2 * np.pi * x)
plt.plot(x, y)
plt.xlabel('time (s)')
plt.ylabel('volts (mV)')
plt.show()

x = np.arange(0.0, 2.0, .01)
y = np.sin(2 * np.pi * x)
plt.plot(x, y)

What type of thing is y?

Variations

x = np.arange(0.0, 2.0, .1)
y = np.sin(2 * np.pi * x)
plt.plot(x, y, 'o')

Write code that makes this plot.

plt.plot([0., 1., -1])
plt.show()

Matplotlib Recipe

Import…
- import matplotlib.pyplot as plt
- Optionally, import numpy as np
Make list(s) of data (all the same length)
Call plt.plot() to construct plot
- or plt.hist(), plt.scatter(), plt.bar(), …
Add labels, legends, etc.
plt.savefig('filename.png')
plt.show()

What will this plot?

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, .01)
plt.plot(x, x)
plt.show()

a parabola
a diagonal line
syntax error
label error

matplotlib details

Data doesn’t have to be numpy arrays, but it’s often handy.
Many different plot types!

`plt.hist`ogram

import random
random.seed(5)
numbers = []
for _ in range(100):
    numbers.append(random.random())

plt.hist(numbers, bins=10, edgecolor='black');

plt.hist(numbers, bins=100);

Files

full path (/Users/me/Documents/Research/experiment1.csv). includes:
- name (make it meaningful, not untitled5023.py)
- extension: indicates its type (e.g., .csv, .py, .docx)
- path: where to find it
contents: sequence of bytes (basically a string)

Aside: organize your files!

Meaningful folders (by project, by class)
Backed up! (easiest: use OneDrive or similar)

Reading Files: all at once

filename = "data.csv"
text = open(filename).read()
type(text)

str

text[:100]

'Name,Location,URL,Students\nWestminster College,"Salt Lake City, UT",westminstercollege.edu,2135\nMuhl'

File not found?

URL: world-wide unambiguous name for a file (but on a website, often needs to be downloaded)
Full path: unambiguous name for a file — on a specific computer!
Relative path: depends on working folder (aka current directory)
- "data.csv" actually means os.getcwd() + "/" + "data.csv"
- Thonny sets working folder to the folder containing the .py file
So either:
- Use a full path to the file
- or (preferred): put the file you need right next to your script
- … or in a data folder and use data/data.csv

What results?

filename = "data.csv"
text = open(filename).read()
type(text)

Reading Files: line by line

A file behaves like a list of strings, one per line:

lines = list(open("data.csv"))
for line in lines:
  print(line)

Name,Location,URL,Students

Westminster College,"Salt Lake City, UT",westminstercollege.edu,2135

Muhlenberg College,"Allentown, PA",muhlenberg.edu,2330

University of Maine,"Orono, ME",umaine.edu,8677

James Madison University,"Harrisonburg, VA",jmu.edu,19019

Michigan State University,"East Lansing, MI",msu.edu,38853

Why the blank lines between?

Reading Files: line by line

lines = list(open("data.csv"))
for line in lines:
  print(line, end='!')

Name,Location,URL,Students
!Westminster College,"Salt Lake City, UT",westminstercollege.edu,2135
!Muhlenberg College,"Allentown, PA",muhlenberg.edu,2330
!University of Maine,"Orono, ME",umaine.edu,8677
!James Madison University,"Harrisonburg, VA",jmu.edu,19019
!Michigan State University,"East Lansing, MI",msu.edu,38853
!

Note: We’re simplifying some details here around file handles; see the textbook for details.

Reading Files: line by line

lines = list(open("data.csv"))
for line in lines:
  print(line.split(','))

['Name', 'Location', 'URL', 'Students\n']
['Westminster College', '"Salt Lake City', ' UT"', 'westminstercollege.edu', '2135\n']
['Muhlenberg College', '"Allentown', ' PA"', 'muhlenberg.edu', '2330\n']
['University of Maine', '"Orono', ' ME"', 'umaine.edu', '8677\n']
['James Madison University', '"Harrisonburg', ' VA"', 'jmu.edu', '19019\n']
['Michigan State University', '"East Lansing', ' MI"', 'msu.edu', '38853\n']

What’s wrong?

Reading Files: CSV

import csv
csv_data = list(csv.reader(open("data.csv")))
for row in csv_data:
  print(row)

['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']

CSV Column Names

import csv
csv_data = list(csv.reader(open("data.csv")))
names = csv_data[0]
print("Column names:", names)
for row in csv_data[1:]:
  print(row)

Column names: ['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']

`DictReader`

import csv, pprint
csv_data = list(csv.DictReader(open("data.csv")))
pprint.pprint(csv_data[0])

{'Location': 'Salt Lake City, UT',
 'Name': 'Westminster College',
 'Students': '2135',
 'URL': 'westminstercollege.edu'}

open("data.csv")
open(data.csv)

File names are strings.

Looping through CSV data

import csv
csv_data = list(csv.DictReader(open("data.csv")))

for row in csv_data:
  num_students = row['Students']
  print(repr(num_students))

'2135'
'2330'
'8677'
'19019'
'38853'

Converting Data Types

Files are always just strings. If we want ints, we need to convert:

import csv
csv_data = list(csv.DictReader(open("data.csv")))

for row in csv_data:
  num_students = int(row['Students'])
  print(repr(num_students))

Getting a whole column

import csv
csv_data = list(csv.DictReader(open("data.csv")))

student_counts = []
for row in csv_data:
  students = int(row['Students'])
  student_counts.append(students)
print(student_counts)

[2135, 2330, 8677, 19019, 38853]

Debug and add comments to the following code.

import csv
import matplotlib.pyplot as plt

# 1:
csv_data = list(csv.reader(open(1880-2022.csv)))
names = csv_data[0]

# 2:
for row in csv_data:
    years = row[0]
    temp_anomalies = row[1]

# 3: 
plt.plot(years, temp_anomalies)
plt.show()

Easier way: `pandas`

import pandas as pd
data = pd.read_csv("data.csv")
data

	Name	Location	URL	Students
0	Westminster College	Salt Lake City, UT	westminstercollege.edu	2135
1	Muhlenberg College	Allentown, PA	muhlenberg.edu	2330
2	University of Maine	Orono, ME	umaine.edu	8677
3	James Madison University	Harrisonburg, VA	jmu.edu	19019
4	Michigan State University	East Lansing, MI	msu.edu	38853

We’ll study this next week.

Loading and Plotting Data

Takeaways of the week

This week

numpy

numpy

example

Arrays have consistent data types

Arrays support fancy indexing

np.arange

Broadcasting (automatic for loops!)

matplotlib

Example

Variations

Write code that makes this plot.

Matplotlib Recipe

What will this plot?

matplotlib details

plt.histogram

Files

Files

Reading Files: all at once

File not found?

What results?

Reading Files: line by line

Reading Files: line by line

Reading Files: line by line

Reading Files: CSV

CSV Column Names

DictReader

Looping through CSV data

Converting Data Types

Getting a whole column

Easier way: pandas

`numpy`

`numpy`

`np.arange`

Broadcasting (automatic `for` loops!)

`matplotlib`

`plt.hist`ogram

`DictReader`

Easier way: `pandas`