Loading and Plotting Data

CS 106

Takeaways of the week

  • Libraries like numpy and matplotlib provide pre-packaged useful functionality
  • Using them builds on fundamental concepts you learned earlier (data types, Boolean expressions, etc.)
  • Files are just named strings; they need to be interpreted (including data types)

This week

Fundamentals of the Python scientific computing ecosystem

  • numpy
  • matplotlib

Fundamentals of data storage

  • CSV files

numpy

aka np, because it’s canonically imported as:

import numpy as np

numpy

  • Numerical computing library for Python
  • Provides the array data type. Like a list but:
    • Automatic for loops!
    • Fancy indexing (e.g., multiple dimensions at once)
  • …and lots of utilities
    • arange: range that makes arrays
    • zeros / ones / full: make new arrays
    • lots of math functions

example

x = [1.0, 2.0, 3.0]
y = [3.0, 2.0, 1.0]
x + y
  • [4., 4., 4.]
  • [1.0, 2.0, 3.0, 3.0, 2.0, 1.0]
  • error
import numpy as np
x = np.array([1.0, 2.0, 3.0])
y = np.array([3.0, 2.0, 1.0])
x + y
  • [4., 4., 4.]
  • [1.0, 2.0, 3.0, 3.0, 2.0, 1.0]
  • error

Arrays have consistent data types

All ints:

np.array([1, 2, 3])
array([1, 2, 3])

All floats:

np.array([1, 2, 3.1])
array([1. , 2. , 3.1])

All complex:

np.array([2j, 2, 3.0])
array([0.+2.j, 2.+0.j, 3.+0.j])

Arrays support fancy indexing

x = np.array([1.0, -2.0, 3.0])
x[0]
np.float64(1.0)
y = x > 0
y
array([ True, False,  True])

y is a boolean array. We can use it to index into x

x[y]
array([1., 3.])
x[~y]
array([-2.])

We usually put these together without a separate y:

x[x > 0]
array([1., 3.])

np.arange

Like range, but:

  • makes NumPy arrays
  • allows floats
x = np.arange(0.0, 2.0, .5)
x[:5]
array([0. , 0.5, 1. , 1.5])

Broadcasting (automatic for loops!)

array plus scalar:

x + 1
array([1. , 1.5, 2. , 2.5])

array plus array:

x + x
array([0., 1., 2., 3.])

Applying a function to every element:

y = np.sin(2 * np.pi * x)
y
array([ 0.0000000e+00,  1.2246468e-16, -2.4492936e-16,  3.6739404e-16])

Reduction operations:

np.mean(x), np.std(x), np.sum(x), np.argmax(x)
(np.float64(0.75),
 np.float64(0.5590169943749475),
 np.float64(3.0),
 np.int64(3))

matplotlib

Import:

import matplotlib.pyplot as plt

Example

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, .01)
y = np.sin(2 * np.pi * x)
plt.plot(x, y)
plt.xlabel('time (s)')
plt.ylabel('volts (mV)')
plt.show()

x = np.arange(0.0, 2.0, .01)
y = np.sin(2 * np.pi * x)
plt.plot(x, y)

What type of thing is y?

Variations

x = np.arange(0.0, 2.0, .1)
y = np.sin(2 * np.pi * x)
plt.plot(x, y, 'o')

Write code that makes this plot.

plt.plot([0., 1., -1])
plt.show()

Matplotlib Recipe

  • Import…
    • import matplotlib.pyplot as plt
    • Optionally, import numpy as np
  • Make list(s) of data (all the same length)
  • Call plt.plot() to construct plot
    • or plt.hist(), plt.scatter(), plt.bar(), …
  • Add labels, legends, etc.
  • plt.savefig('filename.png')
  • plt.show()

What will this plot?

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0.0, 2.0, .01)
plt.plot(x, x)
plt.show()
  • a parabola
  • a diagonal line
  • syntax error
  • label error

matplotlib details

  • Data doesn’t have to be numpy arrays, but it’s often handy.
  • Many different plot types!

plt.histogram

import random
random.seed(5)
numbers = []
for _ in range(100):
    numbers.append(random.random())
plt.hist(numbers, bins=10, edgecolor='black');

plt.hist(numbers, bins=100);

Files

Files

  • full path (/Users/me/Documents/Research/experiment1.csv). includes:
    • name (make it meaningful, not untitled5023.py)
    • extension: indicates its type (e.g., .csv, .py, .docx)
    • path: where to find it
  • contents: sequence of bytes (basically a string)

Aside: organize your files!

  • Meaningful folders (by project, by class)
  • Backed up! (easiest: use OneDrive or similar)

Reading Files: all at once

filename = "data.csv"
text = open(filename).read()
type(text)
str
text[:100]
'Name,Location,URL,Students\nWestminster College,"Salt Lake City, UT",westminstercollege.edu,2135\nMuhl'

File not found?

  • URL: world-wide unambiguous name for a file (but on a website, often needs to be downloaded)
  • Full path: unambiguous name for a file — on a specific computer!
  • Relative path: depends on working folder (aka current directory)
    • "data.csv" actually means os.getcwd() + "/" + "data.csv"
    • Thonny sets working folder to the folder containing the .py file
  • So either:
    • Use a full path to the file
    • or (preferred): put the file you need right next to your script
    • … or in a data folder and use data/data.csv

What results?

filename = "data.csv"
text = open(filename).read()
type(text)

Reading Files: line by line

A file behaves like a list of strings, one per line:

lines = list(open("data.csv"))
for line in lines:
  print(line)
Name,Location,URL,Students

Westminster College,"Salt Lake City, UT",westminstercollege.edu,2135

Muhlenberg College,"Allentown, PA",muhlenberg.edu,2330

University of Maine,"Orono, ME",umaine.edu,8677

James Madison University,"Harrisonburg, VA",jmu.edu,19019

Michigan State University,"East Lansing, MI",msu.edu,38853

Why the blank lines between?

Reading Files: line by line

lines = list(open("data.csv"))
for line in lines:
  print(line, end='!')
Name,Location,URL,Students
!Westminster College,"Salt Lake City, UT",westminstercollege.edu,2135
!Muhlenberg College,"Allentown, PA",muhlenberg.edu,2330
!University of Maine,"Orono, ME",umaine.edu,8677
!James Madison University,"Harrisonburg, VA",jmu.edu,19019
!Michigan State University,"East Lansing, MI",msu.edu,38853
!

Note: We’re simplifying some details here around file handles; see the textbook for details.

Reading Files: line by line

lines = list(open("data.csv"))
for line in lines:
  print(line.split(','))
['Name', 'Location', 'URL', 'Students\n']
['Westminster College', '"Salt Lake City', ' UT"', 'westminstercollege.edu', '2135\n']
['Muhlenberg College', '"Allentown', ' PA"', 'muhlenberg.edu', '2330\n']
['University of Maine', '"Orono', ' ME"', 'umaine.edu', '8677\n']
['James Madison University', '"Harrisonburg', ' VA"', 'jmu.edu', '19019\n']
['Michigan State University', '"East Lansing', ' MI"', 'msu.edu', '38853\n']

What’s wrong?

Reading Files: CSV

import csv
csv_data = list(csv.reader(open("data.csv")))
for row in csv_data:
  print(row)
['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']

CSV Column Names

import csv
csv_data = list(csv.reader(open("data.csv")))
names = csv_data[0]
print("Column names:", names)
for row in csv_data[1:]:
  print(row)
Column names: ['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']

DictReader

import csv, pprint
csv_data = list(csv.DictReader(open("data.csv")))
pprint.pprint(csv_data[0])
{'Location': 'Salt Lake City, UT',
 'Name': 'Westminster College',
 'Students': '2135',
 'URL': 'westminstercollege.edu'}
  • open("data.csv")
  • open(data.csv)

File names are strings.

Looping through CSV data

import csv
csv_data = list(csv.DictReader(open("data.csv")))

for row in csv_data:
  num_students = row['Students']
  print(repr(num_students))
'2135'
'2330'
'8677'
'19019'
'38853'

Converting Data Types

Files are always just strings. If we want ints, we need to convert:

import csv
csv_data = list(csv.DictReader(open("data.csv")))

for row in csv_data:
  num_students = int(row['Students'])
  print(repr(num_students))
2135
2330
8677
19019
38853

Getting a whole column

import csv
csv_data = list(csv.DictReader(open("data.csv")))

student_counts = []
for row in csv_data:
  students = int(row['Students'])
  student_counts.append(students)
print(student_counts)
[2135, 2330, 8677, 19019, 38853]

Debug and add comments to the following code.

import csv
import matplotlib.pyplot as plt

# 1:
csv_data = list(csv.reader(open(1880-2022.csv)))
names = csv_data[0]

# 2:
for row in csv_data:
    years = row[0]
    temp_anomalies = row[1]

# 3: 
plt.plot(years, temp_anomalies)
plt.show()

Easier way: pandas

import pandas as pd
data = pd.read_csv("data.csv")
data
Name Location URL Students
0 Westminster College Salt Lake City, UT westminstercollege.edu 2135
1 Muhlenberg College Allentown, PA muhlenberg.edu 2330
2 University of Maine Orono, ME umaine.edu 8677
3 James Madison University Harrisonburg, VA jmu.edu 19019
4 Michigan State University East Lansing, MI msu.edu 38853

We’ll study this next week.