Feature Engineering

Keith VanderLinden
Calvin University

Feature Types

Features, important inputs to the modeling process, can be:

  • Engineered
    • Handling missing values (cf. MCAR, MAR, MNAR)
    • Feature scaling
    • Feature crossing
  • Learned

C. Huyen argues that we’re not ready to learn all features.

Data Leakage

Data leakage occurs when information from outside the training set is used to create the model.

  • Random selection from correlated sub-sets of the data
    • time-sequenced data
    • grouped data
  • Processing data before splitting training and testing data
    • Scaling
    • Imputation

Data leaks are notoriously hard to prevent.

Feature Stores

Feature stores connect data and model engineering workflows, providing these capabilities.

  • Feature management
  • Feature computation
  • Feature consistency

They’re useful when multiple applications share feature definition and construction.

DMLS Figure 7-8