tools

DataFlow Tools

Keith VanderLinden
Calvin University

Demos

Basic Tools
- Python Scripts
- Typer
- Make
SLO Example

We’ll do separate slides for the basic tools and then return here for the SLO example.
The SLO system is more complicated than TagIfAI and the team project, too complicated to run here in real time.
- The pre-processing stack requires tools that are hard to install, so I stick with the container, which has a Dockerfile carefully designed to install the tools properly.
- It’s slow and it takes lots of memory.
- It had a Kafka realtime dataflow, which is just patched here.
./data files - All of these files are DVC controlled, even the ones that can be reproduced. This is complicated enough, that it was hard to remember how the dataflow worked.
src/dataset-preprocessor.py - This script uses Fire rather than Typer, and does considerably more text pre-processing that the class examples. We won’t go through all that here, but note the full documentation of the Fire functions.
Makefile - This has the same basic structure as the example, but has 7 processing steps, not counting the model.
- I standardized the dataset filenames to help keep things sane.
- Walk back through the prerequisite - target structure, starting at the top, with data/dataset.json.dvc.
- There are named targets in here, e.g., datasets, which makes targets easier to understand, but perhaps isn’t necessary for simpler examples. I PHONY’d them to make sure they always run.
Review the SLO data stack/flow.

Python Scripts

We’re moving code chunks from Jupyter notebooks to Python scripts such as this one.

import pandas as pd

data_df = pd.read_csv("data/test.csv")

# Add a (pointless) column for the appropriate way to address the solder.
data_df["address"] = data_df["rank"] + " " + data_df["name"]

data_df.to_csv("data/dataset.csv", index=False)

We can execute this on the CLI.

python src/process_script.py

Typer

We can use this equivalent Typer script.

import pandas as pd
import typer

app = typer.Typer()

@app.command()
def dataset_build(raw_data_filename, dataset_filename):
    data_df = pd.read_csv(raw_data_filename)
    data_df["address"] = data_df["rank"] + " " + data_df["name"]
    data_df.to_csv(dataset_filename, index=False)

if __name__ == "__main__":
    app()

And execute it on the CLI.

python src/process.py \
    --raw-data-filename "data/test.csv" \
    --dataset-filename="data/dataset.csv"

Make

We can now orchestrate the dataflow with a Makefile.

SHELL = /bin/bash
DATA_DIR  := ./data
SRC_DIR   := ./src

ALL: $(DATA_DIR)/dataset.csv

$(DATA_DIR)/test.csv: $(DATA_DIR)/test.csv.dvc
    dvc pull

$(DATA_DIR)/dataset.csv: $(DATA_DIR)/test.csv
    python $(SRC_DIR)/process.py \
        --raw-data-filename $(DATA_DIR)/test.csv \
        --dataset-filename $(DATA_DIR)/dataset.csv

.PHONY: clean
clean:
    rm -f $(DATA_DIR)/test.csv
    rm -f $(DATA_DIR)/dataset.csv

Snapshot

See the live demo.