Entry 38: Regularization

3 minute read

Regularization is used to help address overfitting.

Description

There are basically two strategies:

  • Reduce the magnitude/values of the theta array
    • This method retains all features
    • It works well when there are a lot of features and each contirbutes at least marginally to the prediction ability
  • Reduce the number of features
    • This method removes features
    • It can be done two ways:
      • Manually: select which features to keep by hand
      • Automatatically: use mathematics to automate feature selection

L1 and L2 Regularization

L1 and L2 regularization are common regularization techinques. Each of the techniques covers one of the regularzation strategies. They are based on L1 and L2 norms. The thing that helped me understand why they’re called L1 and L2 was in a Towards Data Science article:

  • $\text{L1 norm} = \lvert \lvert w \rvert \rvert_{1} = \lvert w_{1} \rvert + \lvert w_{2}\rvert + \dotsb + \lvert w_{n}\rvert$
  • $\text{L2 norm} = \lvert \lvert w\rvert \rvert_{2} = \sqrt{\lvert w_{1}\rvert ^2 + \lvert w_{2}\rvert ^2 + \dotsb + \lvert w_{n}\rvert^2}$
  • $\text{Lp norm} = \lvert \lvert w\rvert \rvert_{p} = \sqrt[p]{\lvert w_{1}\rvert ^p + \lvert w_{2}\rvert ^p + \dotsb + \lvert w_{n} \rvert ^p}$

They are basically all the same equation (the third one), but to different powers: 1, 2, and n.

Purpose

A regularization term is added to the cost function, which makes it look like the feature is more incorrect than it actually is, which lowers the theta term giving the feature less weight.

Cost functions from Hands-On Machine Learning with Scikit-Learn on pages 114, 137, and 135:

  • Base cost function:
    • $J(\theta) = MSE(\theta) = \frac{1}{m} \displaystyle\sum_{i=1}^m (\theta^{T}x^{(i)} - y^{(i)})^{2}$
  • Cost function with L1 regularization:
    • $J(\theta) = MSE(\theta) + \alpha \displaystyle\sum_{i=1}^n \lvert \theta_{i}\rvert$
  • Cost function with L2 regularization:
    • $J(\theta) = MSE(\theta) + \alpha \frac{1}{2} \displaystyle\sum_{i=1}^n \theta_{i}^{2}$

Behavior

At first I was like “Why are you adding the penalty? Won’t that make the weight larger? Shouldn’t you be subtracting the penalty and making the weight smaller?”

Here’s how I think about it:

  • The weights (i.e. the theta array) show which features contribute most strongly to the prediction; The larger the weight, the more important that feature.
    • This concept is exemplified in Applied Predictive Modeling on page 101, during the discussion on intrepretability:

      […] if the estimated coefficient of a predictor is 2.5, then a 1 unit increase in that predictor’s value would, on average, increate the response by 2.5 units.

      • Where the estimate coefficient is the same as what I’ve been referring to as a weight in the theta array.
  • Features are more important (i.e. higher weights) when the predicted value is closer to the observed value ($\frac{1}{n} \sum (y_{i} - \hat{y_{i}})^2$ (i.e. mean squared error) is small.
  • Features that are less important (i.e. lower weights) when the predicted value is farther from the observed value ($\frac{1}{n} \sum (y_{i} - \hat{y_{i}})^2$ (i.e. mean squared error) is large).

“These aren’t the droids you’re looking for.”

The key here is that the penalty is added to the cost function, or mean squared error, not the weight. This makes the feature look like it’s worse than it is and is thus is assigned a lower weight.

Strengths and Weaknesses

Information from Medium article L1 and L2 Regularization:

L1 Regularization L2 Regularization
Penalizes sum of absolute value of weights Penalizes sum of square weights
Sparse solution Non sparse solution
Has multiple solutions Has one solution
Automatic feature selection Retains all features
Robust to outliers Not robust to outliers
Generates models that are simple and interpretable Gives better predictions when output variable is a function of all input features
Can't learn complex patterns Can learn complex data patterns

Up Next

Lasso Regression

Resources