numpy and matplotlib provide pre-packaged useful functionalityFundamentals of the Python scientific computing ecosystem
numpymatplotlibFundamentals of data storage
numpyaka np, because it’s canonically imported as:
numpyarray data type. Like a list but:
for loops!arange: range that makes arrayszeros / ones / full: make new arraysAll ints:
All floats:
np.arangeLike range, but:
arraysfloatsfor loops!)array plus scalar:
array plus array:
Applying a function to every element:
Reduce the dimensionsionality of an array (e.g., summing over an axis)
Suppose we’re fitting a model to some data. The true values are:
And the model predicts:
MAE: mean absolute error: average of absolute differences
\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| \]
MSE: mean squared error or RMSE: root mean squared error
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
Can you write NumPy code to compute these? (You’ll need np.abs not abs for arrays.)
matplotlibImport:
What type of thing is y?

import matplotlib.pyplot as pltimport numpy as npplt.plot() to construct plot
plt.hist(), plt.scatter(), plt.bar(), …plt.savefig('filename.png')plt.show()numpy arrays, but it’s often handy.plt.histogram/Users/me/Documents/Research/experiment1.csv). includes:
untitled5023.py).csv, .py, .docx)string)Aside: organize your files!
"data.csv" actually means os.getcwd() + "/" + "data.csv".py filedata folder and use data/data.csvThe content of a file is just a string. (maybe really long)
'Name,Location,URL,Students\nWestminster College,"Sa'
We can split it into lines:
Name,Location,URL,Students
Westminster College,"Salt Lake City, UT",westminstercollege.edu,2135
Muhlenberg College,"Allentown, PA",muhlenberg.edu,2330
University of Maine,"Orono, ME",umaine.edu,8677
James Madison University,"Harrisonburg, VA",jmu.edu,19019
Michigan State University,"East Lansing, MI",msu.edu,38853
['Name', 'Location', 'URL', 'Students']
['Westminster College', '"Salt Lake City', ' UT"', 'westminstercollege.edu', '2135']
['Muhlenberg College', '"Allentown', ' PA"', 'muhlenberg.edu', '2330']
['University of Maine', '"Orono', ' ME"', 'umaine.edu', '8677']
['James Madison University', '"Harrisonburg', ' VA"', 'jmu.edu', '19019']
['Michigan State University', '"East Lansing', ' MI"', 'msu.edu', '38853']
What’s wrong?
['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']
Column names: ['Name', 'Location', 'URL', 'Students']
['Westminster College', 'Salt Lake City, UT', 'westminstercollege.edu', '2135']
['Muhlenberg College', 'Allentown, PA', 'muhlenberg.edu', '2330']
['University of Maine', 'Orono, ME', 'umaine.edu', '8677']
['James Madison University', 'Harrisonburg, VA', 'jmu.edu', '19019']
['Michigan State University', 'East Lansing, MI', 'msu.edu', '38853']
DictReaderopen("data.csv")open(data.csv)File names are strings.
Files are always just strings. If we want ints, we need to convert:
Debug and add comments to the following code.
pandas| Name | Location | URL | Students | |
|---|---|---|---|---|
| 0 | Westminster College | Salt Lake City, UT | westminstercollege.edu | 2135 |
| 1 | Muhlenberg College | Allentown, PA | muhlenberg.edu | 2330 |
| 2 | University of Maine | Orono, ME | umaine.edu | 8677 |
| 3 | James Madison University | Harrisonburg, VA | jmu.edu | 19019 |
| 4 | Michigan State University | East Lansing, MI | msu.edu | 38853 |
We’ll study this next week.