Entry 31: Training Size

2 minute read

At the very end of Entry 30 I mentioned that learning curves can be used to diagnose problems other than just high bias and high variance. Another easy problem to check for is the appropriateness of the training data.

The Problem

Algorithms require a minimum amount of data in order to train a model. The amount of data required can be a difficult thing to guess.

More is(n’t always) better

“More data is better” tends to be the catch phrase of machine learning. Aurelien gives two good examples in Hands-On Machine Learning:

Scaling to very very large corpora for natural language disambiguation by Michele Banko and Eric Brill
The Unreasonable Effectiveness of Data by Alon Halevy, Peter Norvig, and Fernando Pereira

These two papers both discuss natural language processing (NLP) problems, but I’ve heard similar sentiments in other areas as well, such as computer vision.

The learning curves in the Entry 30 notebook proved that more data doesn’t necessarily mean better models. The scaled house_16H dataset had trained the best model it could after only seeing about 25% of the data.

The information in Entry 30 obtained from Machine Learning Mastery also pointed out that in cases of overfitting, more data can actually make a model worse.

Size (probably) matters

Not only can overfitting be avoided by using only as much as data as is needed, but by using less data the algorithm will train faster.

However, there is no free lunch in data science. The flip side to using a subset of data is that the way the training data is split can effect the model’s predictive power. I described data splitting methods in Entry 17 and also discussed the potential dangers of selecting a poor subset in relation to classification problems.

Problems in data splitting are also present for predicting on continuous values. Aurelien has a great visualization of this on page 25 of Hands-On Machine Learning:

Training sample

Training on all the data in the chart produced the solid line. When the red points (slight outliers) were removed from the training data it produced the dashed line. These two models would produce predictions that are quite different.

The Options

This is where the learning curves from the last entry again come in handy.

The only place I saw learning curves explicitly called out as a method to assess the training data size was on page 201 of Machine Learning with Python Cookbook - 11.11 Visualizing the Effect of Training Set Size.

The Proposed Solution

The shape formed by the training and test learning curves provide information about the sufficiency of the training dataset. Once the algorithm has learned all it can from the data, the two curves flattens out and proceed in parallel. If the curve doesn’t flatten out, that generally means the model doesn’t have enough data.

Training sample

If the decision is made to reduce the amount of training data, the thing to keep in mind is how the data is split. There may be subsets of the training data that will generate a different model than the one generated by the full dataset.

The Fail

This still leaves the question of whether the learning curve can also show:

Whether there is sufficient test data
Whether the data split is appropriate
How sensitive the model is to data splits

These are perfect questions to address once I get into the modeling series of entries.

Up Next

Model algorithms

Resources

Twitter Facebook LinkedIn

Julie Fisher

Entry 31: Training Size

The Problem

More is(n’t always) better

Size (probably) matters

The Options

The Proposed Solution

The Fail

Up Next

Resources

You May Also Enjoy

Entry SM02: Clean Data

Entry SM01: Using S3 from AWS’s SageMaker

Entry NLP4: Frequencies and Comparison

Entry NLP3: Clean Data and Split into N-grams