# Made With ML

Original by G. Mohandas, June 2023; adapted for DATA 562, Spring 2024

This notebook is a re-engineered version of G. Mohandas&rsquo;s [original Made with ML (MwML) course](https://web.archive.org/web/20230623102343/https://madewithml.com/) (cf. this [archival repo](https://github.com/GokuMohandas/follow/blob/main/notebooks/tagifai.ipynb)), circa June 2023. This original version differs from the [current MwML course](https://madewithml.com/) in that it develops and deploys simpler models, which will take less system memory to run and will be easier to work with in this engineering-oriented course. The course goal is to move the code in this notebook into a production-ready form, and to deploy it as a web service.

This notebook demonstrates the construction of models that classify paragraph-length texts based on whether they are or are not about topics in artificial intelligence (AI), and, if so, what AI topics they address, e.g., natural language processing, computer vision, MLOps, etc. Mohandas&rsquo;s original goal was to provide literature search tool for AI researchers.


## Data

Mohandas provides the following datasets:

- [dataset.csv](https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/projects.csv) &mdash; a dataset of sentence-length texts with titles
- [tags.csv](https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/tags.csv) &mdash; hand-labeled tags for the projects dataset

We provide modified copies of these datasets in the `../data/` directory of this repository in part to make them easier to access, but also to add the required IDs to the tagged dataset, which are missing in the original, linked above. For clarity, we&rsquo;ll name our versions of the datasets `projects.csv` and `tags.csv`.

In [1]:
import pandas as pd
from collections import Counter

In [2]:
PROJECTS_URL = "../data/projects.csv"
projects = pd.read_csv(PROJECTS_URL)
projects.head(5)

Unnamed: 0,id,created_on,title,description
0,6,2020-02-20 06:43:18,Comparison between YOLO and RCNN on real world...,Bringing theory to experiment is cool. We can ...
1,7,2020-02-20 06:47:21,"Show, Infer & Tell: Contextual Inference for C...",The beauty of the work lies in the way it arch...
2,9,2020-02-24 16:24:45,Awesome Graph Classification,"A collection of important graph embedding, cla..."
3,15,2020-02-28 23:55:26,Awesome Monte Carlo Tree Search,A curated list of Monte Carlo tree search pape...
4,25,2020-03-07 23:04:31,AttentionWalk,"A PyTorch Implementation of ""Watch Your Step: ..."


In [3]:
TAGS_URL = "../data/tags.csv"
tags = pd.read_csv(TAGS_URL)
tags.head(5)

Unnamed: 0,id,tag
0,6,computer-vision
1,7,computer-vision
2,9,other
3,15,other
4,25,other


### Wrangling

We&rsquo;d like to merge the project texts and the tags into a single dataset.

In [4]:
df = pd.merge(projects, tags, on="id")
df.head()

Unnamed: 0,id,created_on,title,description,tag
0,6,2020-02-20 06:43:18,Comparison between YOLO and RCNN on real world...,Bringing theory to experiment is cool. We can ...,computer-vision
1,7,2020-02-20 06:47:21,"Show, Infer & Tell: Contextual Inference for C...",The beauty of the work lies in the way it arch...,computer-vision
2,9,2020-02-24 16:24:45,Awesome Graph Classification,"A collection of important graph embedding, cla...",other
3,15,2020-02-28 23:55:26,Awesome Monte Carlo Tree Search,A curated list of Monte Carlo tree search pape...,other
4,25,2020-03-07 23:04:31,AttentionWalk,"A PyTorch Implementation of ""Watch Your Step: ...",other


We&rsquo;ll remove projects with no tags so that the training will focus on tagged projects.

In [5]:
df = df[df.tag.notnull()]

And we&rsquo;ll save a copy of the merged dataset.

In [6]:
df.to_csv("../data/labeled_projects.csv", index=False)

### Exploratory Analysis

There are many things we could explore, but for this base example, we&rsquo;ll just check the tag distribution. There are on the order of hundreds of project examples for each of four tags.

In [7]:
tags = Counter(df.tag.values)
print(
    f"rows: {tags.total()}"
    f"\ntags: {len(tags)}"
    f"\ndistribution: {tags.most_common(len(tags))}"
)

rows: 764
tags: 4
distribution: [('natural-language-processing', 310), ('computer-vision', 285), ('other', 106), ('mlops', 63)]


### Preprocessing

Raw text data generally needs to be preprocessed in order to train effective models. We saw, above, that both the title and the description texts provided useful signal, so we&rsquo;ll start by concatenating the title and the text for each project, which provides a single text for each project.

In [8]:
df["text"] = df.title + " " + df.description

We then use the well-known [Natural Language Toolkit (nltk)](https://www.nltk.org/) to perform a number of common preprocessing steps. See the comments in the `clean_text` function for details on each operation.

In [9]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re

nltk.download("stopwords")  # Once downloaded, this will load the stored copy.
STOPWORDS = stopwords.words("english")
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kvlinden/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [10]:
def clean_text(text, lower=True, stem=False, stopwords=STOPWORDS):
    """Clean raw text."""

    # Set all text to lowercase.
    if lower:
        text = text.lower()

    # Remove all stopwords (i.e., common words that will act as noise in the data.)
    if len(stopwords):
        pattern = re.compile(r"\b(" + r"|".join(stopwords) + r")\b\s*")
        text = pattern.sub("", text)

    # Systematize the whitespace, and filter out non-alphanumeric characters.
    text = re.sub(
        r"([!\"'#$%&()*\+,-./:;<=>?@\\\[\]^_`{|}~])", r" \1 ", text
    )  # Add spacing between objects to be filtered.
    text = re.sub("[^A-Za-z0-9]+", " ", text)  # Remove non-alphanumeric characters.
    text = re.sub(" +", " ", text)  # Remove multiple spaces.
    text = text.strip()  # Strip white space at the ends.

    # Remove hyperlinks.
    text = re.sub(r"http\S+", "", text)

    # Reduce inflected words to their stem form.
    if stem:
        text = " ".join(
            [stemmer.stem(word, to_lowercase=lower) for word in text.split(" ")]
        )

    return text

We now can run the preprocessor on the dataset.

In [11]:
df.text = df.text.apply(clean_text, lower=True, stem=False)
df.text.head(5)

0    comparison yolo rcnn real world videos bringin...
1    show infer tell contextual inference creative ...
2    awesome graph classification collection import...
3    awesome monte carlo tree search curated list m...
4    attentionwalk pytorch implementation watch ste...
Name: text, dtype: object

We now have a clean, flat text for each example in the dataset.

This process has been rather simplified because Mohandas did some work on the original dataset for us, e.g., he combined the low-frequency, non-NLP/vision/MLOps examples under the tag &ldquo;other&rdquo;. Real, live text tends can be messier than this.


## Modeling

We&rsquo;ll now develop a couple potential models for this classification task. They will be simple models because our goal is not to develop cutting-edge models, it&rsquo;s to develop and deploy production systems.

First, we&rsquo;ll get the example document texts (`X`) and their associated tags (`y`).


In [12]:
import numpy as np
import random

In [13]:
X = df.text.to_numpy()
y = df.tag

### Splitting the Dataset

We now write a re-usable function that splits out the training and testing sets (see Mohandas&rsquo;s discussion of this process in [Splitting](https://madewithml.com/courses/mlops/splitting/)).

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
def get_data_splits(X, y, train_size=0.7):
    """Generate balanced data splits, with the given train/text split
    and an even validation/test split.
    """
    X_train, X_, y_train, y_ = train_test_split(X, y, train_size=train_size, stratify=y)
    X_val, X_test, y_val, y_test = train_test_split(X_, y_, train_size=0.5, stratify=y_)
    return X_train, X_val, X_test, y_train, y_val, y_test

In [16]:
X_train, X_val, X_test, y_train, y_val, y_test = get_data_splits(X, y)

print(
    f"train: {len(X_train)} ({len(X_train)/len(X):.2f})\n"
    f"  val: {len(X_val)} ({len(X_val)/len(X):.2f})\n"
    f" test: {len(X_test)} ({len(X_test)/len(X):.2f})"
)

train: 534 (0.70)
  val: 115 (0.15)
 test: 115 (0.15)


This gives us our 70-15-15 train-validate-test split.

In [17]:
unique_train_examples = np.unique(y_train, return_counts=True)
pd.DataFrame({
    "Train": unique_train_examples[1],
    "Validate": np.unique(y_val, return_counts=True)[1],
    "Test": np.unique(y_test, return_counts=True)[1],
}, index=unique_train_examples[0])

Unnamed: 0,Train,Validate,Test
computer-vision,199,43,43
mlops,44,9,10
natural-language-processing,217,47,46
other,74,16,16


And we can see here that the proportions between the labels are generally balanced across the three datasets. Because the different sizes of the datasets makes it hard to assess this, Mohandas goes on to balance the dataset sizes and then to compute the standard deviation of each split's class counts from the mean (ideal split). We&rsquo;ll skip these additional steps.

### Building the Models

We build two SciKit-Learn models:

- Random &mdash; a naive random-choice model
- SGD &mdash; a slightly more sophisticated stochastic gradient descent model

The random model will serve as the first of potentially many models of increasing sophistication. This approach helps us to decide whether more sophisticated models are even necessary, thus preventing us from wasting time building unnecessarily-complex models, and also whether our more sophisticated models are actually learning anything, thus helping us avoid deploying models that are trained on leaked data or are mis-configured.

We&rsquo;ve resurrected these simpler models from Mohandas&rsquo;s original materials (see this [archived link](https://web.archive.org/web/20230610050424/https://madewithml.com/courses/mlops/baselines/)). His current course builds a more sophisticated model (see [Distributed training](https://madewithml.com/courses/mlops/training/)), which uses more memory and training time. Feel free to experiment with that newer model as your system resources allow.

To help ensure reproducible results, we&rsquo;ll be fixing the random seeds for each model.

In [18]:
def set_seeds(seed=42):
    """Set seeds for reproducibility."""
    np.random.seed(seed)
    random.seed(seed)

#### Random Model

Mohandas&rsquo;s original random model was hand-built, which was interesting, but this version is built using SciKit-Learn&rsquo;s [DummyClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html).

We start over with the merged, labeled dataset that we saved earlier, rerun the preprocessing and the data splitting, build the model, and then test it.

In [19]:
from sklearn.metrics import precision_recall_fscore_support
from sklearn.dummy import DummyClassifier
import pickle

In [20]:
# Reload the data and preprocess it.
df = pd.read_csv("../data/labeled_projects.csv")
df = df.sample(frac=1).reset_index(drop=True)
df["text"] = df.title + " " + df.description
df.text = df.text.apply(clean_text, lower=True, stem=False)

In [21]:
# Split out the datasets.
set_seeds()
X_train, X_val, X_test, y_train, y_val, y_test = get_data_splits(
    X=df.text.to_numpy(), y=df.tag
)

In [22]:
# Build the model.
model_random = DummyClassifier(strategy="uniform", random_state=42)
# model_random = DummyClassifier(strategy="most_frequent")

# Random models ignore the data when making predictions,
# so this really isn't training or fitting anything.
y_pred = model_random.fit(X_train, y_train).predict(X_test)

In [23]:
# Evaluate the model
metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted")
{"precision": metrics[0], "recall": metrics[1], "f1": metrics[2]}

{'precision': 0.31669578503338097,
 'recall': 0.21739130434782608,
 'f1': 0.23746448861110137}

It&rsquo;s not a good model, but it can serve as a useful baseline &mdash; we definitely need our models to do better than this one &mdash; so we&rsquo;ll save a pickled version of it anyway.

In [24]:
pickle.dump(model_random, open("../models/model_random.pkl", "wb"))

#### Vectorization

So far, we&rsquo;ve treated the words in our input document texts as isolated tokens and, thus, have not captured any meaningful relationships between those tokens. Mohandas uses [term frequency - inverse-document frequency (TF-IDF)](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (via Scikit-learn's [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)) to capture the significance of a token to a particular document with respect to all the documents (see Scikit-Learn&rsquo;s presentation of [Tf-idf term weighting](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)).

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [26]:
# Reload the data and preprocess it (again).
df = pd.read_csv("../data/labeled_projects.csv")
df = df.sample(frac=1).reset_index(drop=True)
df["text"] = df.title + " " + df.description
df.text = df.text.apply(clean_text, lower=True, stem=False)

In [27]:
# Split out the datasets (again).
set_seeds()
X_train, X_val, X_test, y_train, y_val, y_test = get_data_splits(
    X=df.text.to_numpy(), y=df.tag
)

In [28]:
# Build a TF-IDF vectorization and use it to transform the datasets.
vectorizer = TfidfVectorizer(analyzer="char", ngram_range=(2, 7))  # char n-grams
X_train_transformed = vectorizer.fit_transform(X_train)
X_val_transformed = vectorizer.transform(X_val)
print(X_train_transformed.shape)

(534, 86130)


Here, we can see that the vectorizer has built a TF-IDF model on the training data (using `fit`) and used that model to vectorize the training/validation/testing datasets (using `transform`). The transformed datasets are (sparse) matrixes with one row for each of the example documents (e.g., here we see 534 training examples) and one column for each of the words (e.g., here we see 86130 words), in which the values represent the significance of each word to each example document.

#### Data Imbalance

As we&rsquo;ve seen, our dataset is imbalanced with respect to the labels, which can lead to models that are biased toward the majority label.

In [29]:
counts = np.unique(y_train, return_counts=True)
weights = [1.0 / count for i, count in enumerate(counts[1])]
pd.DataFrame({
    "Count": counts[1],
    "Weight": weights,
}, index=counts[0])

Unnamed: 0,Count,Weight
computer-vision,199,0.005025
mlops,44,0.022727
natural-language-processing,217,0.004608
other,74,0.013514


Clearly, the best solution to this problem is to collect more data for the minority classes, but when that&rsquo;s not possible, we can also use techniques like resampling, augmentation, etc.

We&rsquo;ll use the [Imbalanced-Learn](https://imbalanced-learn.org/stable/) to *oversample* the minority classes in the training set to create `X_over` and `y_over`, and then use this more balanced dataset to train the SGD model below. This is a common technique for dealing with imbalanced datasets.

In [30]:
from imblearn.over_sampling import RandomOverSampler

oversample = RandomOverSampler(sampling_strategy="all")
X_train_over, y_train_over = oversample.fit_resample(X_train_transformed, y_train)

In [31]:
counts = np.unique(y_train_over, return_counts=True)
weights = [1.0 / count for i, count in enumerate(counts[1])]
pd.DataFrame({
    "Count": counts[1],
    "Weight": weights,
}, index=counts[0])

Unnamed: 0,Count,Weight
computer-vision,217,0.004608
mlops,217,0.004608
natural-language-processing,217,0.004608
other,217,0.004608


That&rsquo;s a better balance!

#### Stochastic Gradient Descent Model

Mohandas uses a stochastic gradient descent classifier ([SGDClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)) as a (somewhat) more sophisticated model. He uses log loss, which he characterizes as effectively logistic regression with SGD.

In [32]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import log_loss
from sklearn.pipeline import make_pipeline

In [33]:
# Build the model.
model_sgd = SGDClassifier(
    loss="log_loss",
    penalty="l2",
    alpha=1e-4,
    max_iter=75,
    learning_rate="constant",
    eta0=1e-1,
    power_t=0.1,
    warm_start=True,
)

In [34]:
# Train the model.
num_epochs = 100
for epoch in range(num_epochs):
    # Training
    model_sgd.fit(X_train_over, y_train_over)

    # Evaluation
    train_loss = log_loss(y_train, model_sgd.predict_proba(X_train_transformed))
    val_loss = log_loss(y_val, model_sgd.predict_proba(X_val_transformed))

    if not epoch % 10:
        print(
            f"Epoch: {epoch:02d} | "
            f"train_loss: {train_loss:.5f}, "
            f"val_loss: {val_loss:.5f}"
        )

Epoch: 00 | train_loss: 0.20215, val_loss: 0.49884
Epoch: 10 | train_loss: 0.14933, val_loss: 0.46155
Epoch: 20 | train_loss: 0.13937, val_loss: 0.45524
Epoch: 30 | train_loss: 0.13590, val_loss: 0.45401
Epoch: 40 | train_loss: 0.13414, val_loss: 0.45393
Epoch: 50 | train_loss: 0.13290, val_loss: 0.45413
Epoch: 60 | train_loss: 0.13196, val_loss: 0.45448
Epoch: 70 | train_loss: 0.13104, val_loss: 0.45465
Epoch: 80 | train_loss: 0.13033, val_loss: 0.45491
Epoch: 90 | train_loss: 0.12960, val_loss: 0.45508


Now that we have all the artifacts required to run our model, we&rsquo;ll bundle them into a [Scikit Learn pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

In [35]:
model_sgd_pipeline = make_pipeline(vectorizer, model_sgd)

This pipeline will allow us to easily save and load the model, and to run the model on new data. Here, we&rsquo;ll run it on the test data.

In [36]:
y_pred = model_sgd_pipeline.predict(X_test)
metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted")
{"precision": metrics[0], "recall": metrics[1], "f1": metrics[2]}

{'precision': 0.9089527449775897,
 'recall': 0.8956521739130435,
 'f1': 0.8979150604415378}

That&rsquo;s a much better performance than the baseline, random model.

Here, we use the new model to classify one made-up example. Note that input text has been manually cleaned so that it matches what the model was trained on.

In [37]:
# Classify an example.
model_sgd_pipeline.predict(["hello mlops"])[0]

'mlops'

And we&rsquo;ll save a pickled version of the model for later use.

In [38]:
pickle.dump(model_sgd_pipeline, open("../models/model_sgd_pipeline.pkl", "wb"))

At this point, the notebook is complete. We&rsquo;ve demonstrated that we can wrangle this data, create reasonable models, and use them to classify new texts.

This serves as the starting point for this course, in which we&rsquo;ll rebuild the wrangling and modeling demonstrated in this notebook into a more properly engineered ML system.