Entry 21: Scoring Regression Models - Theory

7 minute read

I need a way to measure a model’s performance. To do that, first I need to break it down by the type of prediction.

The Problem

Supervised learning, wherein predictions are made with the help of labelled data, tend to come in two flavors. The first is to predict a continuous number (like the price of a house - the predictions would be things like 104,684.54). The second is a class (like whether a customer is likely to purchase a product - yes or no). The prediction depends on the target values. When the target is continuous so is the prediction. When the target is a class, so is the prediction.

I’m going to look at measures for continuous targets first.

Concepts

There are quite a few terms and concepts that the solutions to this problem rely on. Since I appreciate when terms are clearly defined instead of left to the imagination, here is a list with definitions and equations.

Symbols

  • $y_{i}$ is an observed value
  • $\hat{y_{i}}$ is a predicted value
  • $\bar{y_{i}}$ is the mean value
  • $\mu$ is also the mean value
  • $n$ is the number of observations
  • $\sum$ means to sum (add) things together
  • $\lvert x\rvert$ is to take the absolute value (of x in this case)
  • The rest should be basic mathematical symbols

Bias and Variance

Hands-On Machine Learning has a very succinct sidebar with information on bias, variance, and irreducible error. These are the three components that make up a model’s generalization error.

Bias

This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training data.

Variance

This part is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance, and thus to overfit the training data.

Irreducible error

This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data (e.g., fix the data sources, such as broken sensors, or detect and remove outliers).

A visual reference for bias and variance from Scott Fortmann-Roe’s Understanding the Bias-Variance Tradeoff.

bias and variance

Bias

Bias is the difference between the expected value and the true value.

In the picture above, this means how close (low bias) or far (high bias) the values are to the bulls eye.

Variance

Variance measures how far a set of numbers are spread out from their average value. The formal definition is: the average of the squared differences from the mean.

$Var = \frac{\sum{(x - \mu)}^2}{n}$

In the picture above, this means how close (grouped together) or far apart (scattered around) the points are from each other.

Residuals

Also called error. Residuals quantify how far off a prediction was from the actual value. If the value is 0 the prediction matched the observed value perfectly. There is no minimum or maximum range. The larger the value, the further the prediction is from the actual value.

The formula is the observed value minus the predicted value:

$e = y_{i} - \hat{y_{i}}$

Keep in mind that the impact of how large this number is depends on the variance of the underlying data. For house prices that range from 100,000 to 10,000,000 a residual of 10,000 would be considered good. However, for a bottle of soap where the prices range from 0.99 to 20.49, a residual of 10,000 would be a really bad prediction.

Example of residuals from Wikimedia Commons Residuals for Linear Regression Fit. The black lines show the residual of the prediction (the red linear fit line) to the observed value (the dark blue dots)

Absolute Error

Technically, the distinction between a residual and an error is that residuals are the difference between the observed value and the predicted value, whereas errors are the difference between the observed value and the true value (ex: the true height of a person may be 5.68741878 feet, but the observed/recorded value would be 5’5”. See Errors and residuals on Wikipedia, the answers to this StackExchange question, and the definition of Mean Absolute Error on Springer Link for more information).

According to one of the answers on StackExchange: “Error term is a theoretical concept that can never be observed, but the residual is a real world value[…].”

The practical use of the absolute error in machine learning is:

$absolute\text{ }error = \lvert y_{i} - \hat{y_{i}}\rvert$

Squared Error

$squared\text{ }error = (y_{i} - \hat{y_{i}})^2$

The Options

While the three definitions above (residuals, absolute error, and squared error) applied to each individual prediction, there are other methods that measure a model’s overall performance. These methods include:

Mean Absolute Error (MAE)

This sums the absolute errors of all predictions and divides by the number of observations (or multiplies by the reciprocal of the number of observations). I.e., it gets the mean of the absolute errors.

$MAE = \frac{\sum \lvert y_{i} - \hat{y_{i}}\rvert}{n} = \frac{1}{n} \sum \lvert y_{i} - \hat{y_{i}}\rvert$

Sum of Squared Errors (SSE)

Also known as the residual sum of squares (RSS) and the sum of squared residuals (SSR). This is just adding up the squared errors for all the predictions.

$SSE = \sum (y_{i} - \hat{y_{i}})^2$

Mean Squared Error (MSE)

This basically takes the sum of squared errors and divides by the number of observations (or multiplies by the reciprocal of the number of observations) to give the mean.

$MSE = \frac{\sum (y_{i} - \hat{y_{i}})^2}{n} = \frac{1}{n} \sum (y_{i} - \hat{y_{i}})^2$

Root Mean Squared Error (RMSE)

By now I can guess that this is just the square root of the MSE, but I looked it up just to be safe (Hands-On Machine Learning pg 37). The purpose of taking the square root of the MSE is to put the result back into the same units as the original data (Applied Predictive Modeling pg 95).

$RMSE = \sqrt{\frac{1}{n} \sum (y_{i} - \hat{y_{i}})^2}$

$R^2$ or coefficient of determination

Per Applied Predictive Modeling page 95:

Proportion of the information in the data explained by the model

  • There are multiple ways to calculate $R^2$, but the one used by Scikit-Learn is:

$R^2 = 1 - \frac{\sum{(y_{i} - \hat{y_{i}})^2}}{\sum{(y_{i} - \bar{y_{i}})^2}}$

An easier to read version is:

$R^2 = \frac{MSE}{Var_{yactual}}$

  • Range of values is generally 0 - 1, but values outside this range can occur
    • A value of 1 means the predictions explain the data perfectly (predictions match the observations)
    • A value of 0 means the predictions explain none of the variability of the data
    • Negative values indicate the mean of the data provides a better fit to the outcomes than the predicted values
  • Caveats

Applied Predictive Modeling page 96:

[…] the practitioner must remember that $R^2$ is a measure of correlation, not accuracy</br> $R^2$ is dependent on the variation in the outcome

Explained variance

$explained\text{ }variance = 1 - \frac{Var(y - \hat{y} )}{Var(y)}$

Where $Var(y-\hat{y} ) = \frac{\sum{(error^2)} - mean(error)}{n}$

According to this StackExchange answer the only difference between $R^2$ and explained variance is the mean(error). So if the mean(error) is 0, explained variance and $R^2$ will be the same. A more thorough answer can also be found on StackOverflow.

The Proposed Solution

RMSE and $R^2$ are the most popular measures according to Applied Predictive Modeling and Hands-On Machine Learning. In the spirit of getting a better intuitive understanding of the metrics, I’m going to run all of them and see what they tell me.

The Fail

The number of books and websites I had to go through to cobble together all this information was ridiculous. I expected it all to be in one place. Part of the problem was that most of the books I own spend more time on classification metrics. Part was that I didn’t remember things that the authors took for granted (thus the Concepts section). Yet another part was that the notation varied in some books. For example, the authors of The Elements of Statistical Learning obviously love mathematics way more than I do.

Getting better at reading math equations is a hurdle I’m going to have to tackle at some point. Eventually I should be able to read a change in notation without having to dig through the internet. All of which sounds like a great, extended, broken out by subject, topic for another (advanced) series of entries.

Up Next

Regression metrics - implementation

Resources