Python Data Wrangling and Classification

# Python Data Wrangling and Classification
### K Arnold

---

## Example Dataset: Titanic Passengers

* <https://www.openml.org/d/40945>
* <http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html>
* <http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt>
* <https://www.encyclopedia-titanica.org/>

Download the data, if we don't have it alreday:

```r
data_filename <- "data/titanic.csv"
if (!file.exists(data_filename)) {
  dir.create("data")
  download.file("https://www.openml.org/data/get_csv/16826755/phpMYEkMl", data_filename)
}
```

---

## Python Setup

```r
library(reticulate)
py_config()
```

```
## python:         /Users/ka37/Library/r-miniconda/envs/r-reticulate/bin/python
## libpython:      /Users/ka37/Library/r-miniconda/envs/r-reticulate/lib/libpython3.6m.dylib
## pythonhome:     /Users/ka37/Library/r-miniconda/envs/r-reticulate:/Users/ka37/Library/r-miniconda/envs/r-reticulate
## version:        3.6.11 | packaged by conda-forge | (default, Aug  5 2020, 20:19:23)  [GCC Clang 10.0.1 ]
## numpy:          /Users/ka37/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/numpy
## numpy_version:  1.19.1
```

---

## The Python Data Science Toolbox

* **Pandas** (`pd`): the main library for wrangling tabular data in Python. (analogous to *tidyverse*) 
* **NumPy** (`np`): the underlying math library. Gives us `array`s of numbers. Conventionally imported as `np`.
* **scikit-learn**: the main library for machine learning in Python.

```python
import pandas as pd
import numpy as np
```

---

## Pandas

---

### Loading data

Data frames in R are automatically converted into Pandas `DataFrame`s:

```r
titanic <- read_csv("data/titanic.csv", na = "?")
```

```python
r.titanic.__class__
```

```
## <class 'pandas.core.frame.DataFrame'>
```

Pandas can read CSV files itself. (CSV is such a quirky data format, so read the docs
for all the parameters you can set.)

```python
titanic = pd.read_csv("data/titanic.csv", na_values="?")
```

---

### Exploring data structure

---

```python
titanic.shape
```

```
## (1309, 14)
```

```python
num_people, num_variables = titanic.shape
print(f"{num_people} people, {num_variables} variables about each")
```

```
## 1309 people, 14 variables about each
```

```python
titanic.head()
```

```
##    pclass  survived  ...   body                        home.dest
## 0       1         1  ...    NaN                     St Louis, MO
## 1       1         1  ...    NaN  Montreal, PQ / Chesterville, ON
## 2       1         0  ...    NaN  Montreal, PQ / Chesterville, ON
## 3       1         0  ...  135.0  Montreal, PQ / Chesterville, ON
## 4       1         0  ...    NaN  Montreal, PQ / Chesterville, ON
## 
## [5 rows x 14 columns]
```

---

```python
titanic.info()
```

```
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1309 entries, 0 to 1308
## Data columns (total 14 columns):
##  #   Column     Non-Null Count  Dtype  
## ---  ------     --------------  -----  
##  0   pclass     1309 non-null   int64  
##  1   survived   1309 non-null   int64  
##  2   name       1309 non-null   object 
##  3   sex        1309 non-null   object 
##  4   age        1046 non-null   float64
##  5   sibsp      1309 non-null   int64  
##  6   parch      1309 non-null   int64  
##  7   ticket     1309 non-null   object 
##  8   fare       1308 non-null   float64
##  9   cabin      295 non-null    object 
##  10  embarked   1307 non-null   object 
##  11  boat       486 non-null    object 
##  12  body       121 non-null    float64
##  13  home.dest  745 non-null    object 
## dtypes: float64(3), int64(4), object(7)
## memory usage: 143.3+ KB
```

---

```python
titanic.describe()
```

```
##             pclass     survived  ...         fare        body
## count  1309.000000  1309.000000  ...  1308.000000  121.000000
## mean      2.294882     0.381971  ...    33.295479  160.809917
## std       0.837836     0.486055  ...    51.758668   97.696922
## min       1.000000     0.000000  ...     0.000000    1.000000
## 25%       2.000000     0.000000  ...     7.895800   72.000000
## 50%       3.000000     0.000000  ...    14.454200  155.000000
## 75%       3.000000     1.000000  ...    31.275000  256.000000
## max       3.000000     1.000000  ...   512.329200  328.000000
## 
## [8 rows x 7 columns]
```

---

### Tidying data

---

#### Drop unneeded columns

```python
titanic2 = titanic.drop(['ticket', 'body'], axis = 1)
```

#### Rename columns

```python
titanic3 = titanic2.rename(columns={
  "pclass": "passenger_class",
  "survival": "survived",
  "sibsp": "num_siblings_or_spouses_aboard",
  "parch": "num_parents_or_children_aboard",
  "ticket": "ticket_num",
  "embarked": "embarked_from_port",
  "boat": "lifeboat",
})
```

Note that most (but not all!) Pandas methods make a *new* `DataFrame` (they don't modify the existing one).

---

### Dropping missing data

This dataset has a lot of missing data in some columns. For demonstration
purposes, we'll drop people where this data is missing, without
investigating why. But in general:

**Be careful about dropping missing data if you don't know why it's missing**!

```python
titanic4 = titanic3.dropna(subset = ['age', 'fare', 'embarked_from_port'])
```

---

### Querying data

.pull-left[
Each column of a `pd.DataFrame` is a `pd.Series`, which is a NumPy `array` with (optional) labels.

```python
titanic4['passenger_class']
```

```
## 0       1
## 1       1
## 2       1
## 3       1
## 4       1
##        ..
## 1301    3
## 1304    3
## 1306    3
## 1307    3
## 1308    3
## Name: passenger_class, Length: 1043, dtype: int64
```
]

```python
is_first_class = titanic4['passenger_class'] == 1
is_first_class
```

```
## 0        True
## 1        True
## 2        True
## 3        True
## 4        True
##         ...  
## 1301    False
## 1304    False
## 1306    False
## 1307    False
## 1308    False
## Name: passenger_class, Length: 1043, dtype: bool
```

How many rows does this Series have? How many columns?
]

---

### Filtering data

.pull-left[
You can use a Boolean series to query data. This syntax means: filter the 
data frame to include only the rows that correspond to a `True`:

```python
titanic4[is_first_class]
```

```
##      passenger_class  survived  ... lifeboat                         home.dest
## 0                  1         1  ...        2                      St Louis, MO
## 1                  1         1  ...       11   Montreal, PQ / Chesterville, ON
## 2                  1         0  ...      NaN   Montreal, PQ / Chesterville, ON
## 3                  1         0  ...      NaN   Montreal, PQ / Chesterville, ON
## 4                  1         0  ...      NaN   Montreal, PQ / Chesterville, ON
## ..               ...       ...  ...      ...                               ...
## 316                1         0  ...      NaN  Geneva, Switzerland / Radnor, PA
## 317                1         1  ...        A  Geneva, Switzerland / Radnor, PA
## 319                1         1  ...        3                               NaN
## 321                1         0  ...      NaN                       Halifax, NS
## 322                1         1  ...        8     New York, NY / Washington, DC
## 
## [282 rows x 12 columns]
```
]

.pull-right[
You can combine queries using Boolean operations (but they need to be the
element-wise versions: `&`, `|`, and `~` instead of `and`, `or`, and `not`).

```python
had_companions = titanic4['num_siblings_or_spouses_aboard'] > 0
titanic4[is_first_class & had_companions]
```

```
##      passenger_class  survived  ... lifeboat                        home.dest
## 1                  1         1  ...       11  Montreal, PQ / Chesterville, ON
## 2                  1         0  ...      NaN  Montreal, PQ / Chesterville, ON
## 3                  1         0  ...      NaN  Montreal, PQ / Chesterville, ON
## 4                  1         0  ...      NaN  Montreal, PQ / Chesterville, ON
## 6                  1         1  ...       10                       Hudson, NY
## ..               ...       ...  ...      ...                              ...
## 304                1         1  ...        5                     Portland, OR
## 310                1         0  ...      NaN                   Youngstown, OH
## 311                1         1  ...        8                   Youngstown, OH
## 312                1         0  ...      NaN                  Elkins Park, PA
## 314                1         1  ...        4                  Elkins Park, PA
## 
## [119 rows x 12 columns]
```
]

---

### Counting

You can get the counts of how many times each item occurs in a `Series`:

```python
titanic4['passenger_class'].value_counts()
```

```
## 3    500
## 1    282
## 2    261
## Name: passenger_class, dtype: int64
```

---

### Separating data into features and outcomes

sklearn needs the features to be in a separate data frame from the outcomes,
so we need to split them apart ourselves. If we want to predict survival,
we can create `y` as:

```python
y = titanic4['survived'] == 1
```

To create `X`, we can either drop the columns we don't want (see "tidying data" above)
or directly ask for a list of columns we do want:

```python
#titanic4.columns # this can help you look at the column names
numeric_features = [
  'age', 'num_siblings_or_spouses_aboard', 'num_parents_or_children_aboard']
# We'll use these later
categorical_features = ['passenger_class', 'sex', 'embarked_from_port']

X = titanic4[numeric_features]
```

---

```python
X.info()
```

```
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 1043 entries, 0 to 1308
## Data columns (total 3 columns):
##  #   Column                          Non-Null Count  Dtype  
## ---  ------                          --------------  -----  
##  0   age                             1043 non-null   float64
##  1   num_siblings_or_spouses_aboard  1043 non-null   int64  
##  2   num_parents_or_children_aboard  1043 non-null   int64  
## dtypes: float64(1), int64(2)
## memory usage: 32.6 KB
```

---

## Scikit-Learn (`sklearn`)

---

### Documentation and imports

The documentation is very well structured:

* the [User Guide](https://scikit-learn.org/stable/user_guide.html) gives narrative
  documentation with background and examples (e.g., [logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression))
* the [API Reference](https://scikit-learn.org/stable/modules/classes.html) gives
  the nitty-gritty details about individual classes and functions
* the [Examples](https://scikit-learn.org/stable/auto_examples/index.html) show
  worked examples of using most components.

It's conventional to import only what you actually need from `sklearn`:

```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import accuracy_score, precision_score, recall_score
```

---

### Train-Test Split

First, we'll hold out a test set of 10% of the passengers. We'll set a random
seed so that this process is reproducible:

```python
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.1)
X_train.shape, y_train.shape
```

```
## ((938, 3), (938,))
```

```python
X_test.shape, y_test.shape
```

```
## ((105, 3), (105,))
```

---

### Classifier API

All classifiers have the same basic interface: construct, `fit`, and `predict`.

We'll create a `LogisticRegression` object called `clf`, with the
regularization parameter `C` set to 0.1.

```python
clf = LogisticRegression(C = 0.1, solver = "lbfgs")
clf.fit(X, y);
y_pred = clf.predict(X)
```

---

### Metrics

The `sklearn.metrics` module implements a [variety of useful metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation).

```python
accuracy_score(y_true = y, y_pred = y_pred)
```

```
## 0.6184084372003835
```

```python
precision_score(y_true = y, y_pred = y_pred)
```

```
## 0.651685393258427
```

```python
recall_score(y_true = y, y_pred = y_pred)
```

```
## 0.13647058823529412
```

Remember that "recall" = true positive rate = *sensitivity*. sklearn doesn't
directly implement *specificity*, but it does give us "precision" = positive predictive value
(see [Wikipedia]((https://en.wikipedia.org/wiki/Sensitivity_and_specificity#Confusion_matrix)).

---

### Cross Validation

```python
cv_results = cross_validate(clf, X_train, y_train, cv=5,
  scoring=['accuracy', 'precision', 'recall'])

# Wrap the results in a DataFrame:
cv_results = pd.DataFrame(cv_results).reset_index()
```

We can now access this data in R.

```r
py$cv_results
```

```
##   index   fit_time  score_time test_accuracy test_precision test_recall
## 1     0 0.03214216 0.005925179     0.5957447      0.5333333   0.1038961
## 2     1 0.03038025 0.005378008     0.6329787      0.8333333   0.1298701
## 3     2 0.03310490 0.011760235     0.6276596      0.7058824   0.1558442
## 4     3 0.03105903 0.008498907     0.5882353      0.4782609   0.1447368
## 5     4 0.02601981 0.004664183     0.6203209      0.6666667   0.1315789
```

---

### Column Transformers

Column transformers let us apply preprocessing steps to subsets of columns.
For example, we'll scale the numeric features:

```python
numeric_feature_proc = StandardScaler()
```

and one-hot-encode the categorical features:

```python
categorical_feature_proc = OneHotEncoder()
```

And we'll apply each pre-processor to its corresponding columns:

```python
preprocessor = make_column_transformer(
  (numeric_feature_proc,     numeric_features),
  (categorical_feature_proc, categorical_features),
  remainder = 'drop')
```

---

### Pipelines

Pipelines put several steps in sequence. Like *workflows* in `tidymodels`, we
can use pipelines to say that the data should be preprocessed before running
the model:

```python
clf = make_pipeline(preprocessor, LogisticRegression())
```

Now we can use all of our features!

```python
X = titanic4.drop(["survived"], axis = 1)
```

Redo the train-test split:

```python
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.1)
```

---

### Pipelines have same API as models (`fit`, `predict`)

```python
clf.fit(X, y);
```

Just as a demo, let's predict and score on the full training set. Remember that
this is an overestimate of the accuracy we'd get on truly unseen data.

```python
y_pred = clf.predict(X)
accuracy_score(y_true = y, y_pred = y_pred)
```

```
## 0.7909875359539789
```

---

### CV with pipelines

A pipeline behaves exactly like a classifier (it has `fit` and `predict`),
so we can use exactly the same code to validate it.

```python
cv_results = cross_validate(clf, X_train, y_train, cv=5,
  scoring=['accuracy', 'precision', 'recall'])

# Wrap the results in a DataFrame:
cv_results = pd.DataFrame(cv_results).reset_index()
```

We can now access this data in R.

```r
py$cv_results
```

```
##   index   fit_time  score_time test_accuracy test_precision test_recall
## 1     0 0.06636715 0.019825697     0.8138298      0.8088235   0.7142857
## 2     1 0.03014016 0.009572983     0.7925532      0.7567568   0.7272727
## 3     2 0.03230286 0.007397175     0.7553191      0.6781609   0.7662338
## 4     3 0.01353335 0.005396843     0.7433155      0.7121212   0.6184211
## 5     4 0.03007913 0.010706902     0.8074866      0.8030303   0.6973684
```