Entry 32: Modeling data
The challenge in this, the third series of entries, is to become familiar with the major categories of algorithms, how to implement them, determine the common strengths and weaknesses of each category, and figure out what parameters are available for tuning the model.
The notebook where I created the little example charts can be found on my github page in the Entry 32 notebook.
The Problem
There are a lot of machine learning algorithms. And there are even more programmatic or mathematic implementations of them. All of my machine learning books group these algorithms by either learning style (Supervised, Unsupervised, and Semisupervised) or by functional similarity (regression, classification, clustering, SVM, neural networks, trees, etc).
One thing to keep in mind when exploring the strengths and weaknesses of these algorithms is to understand that different algorithms are designed to find different kinds of patterns. Some will find linear patterns, others exponential/logarithmic/polynomial (curve) patterns. Yet others can find patterns such as a cluster, or a circle.
Of course, all of this becomes much more theoretical when there are over 400 features and some of them are numerical, some are categorical, some are representations of categorical, others are only 10% populated, etc. Introduction to Machine Learning with Python points out on page 34 that
Any intuition derived from datasets with few features (also called low-dimensional datasets) might not hold in datasets with many features (high-dimensional datasets). As long as you keep that in mind, inspecting algorithms on low-dimensional datasets can be very instructive.
The Options
In keeping with the spirit of prior entries, I want to devote one or more entries to each major category of algorithm. The number of entries dedicated to a particular category will depend on the number of subcategories, how different those subcategories are from each other, and whether there are any other fundamental concepts that need to be explored/explained.
As for organization of the categories, a combination of learning style and functional similarity seems to fit the bill. While most of my books break the algorithms into categories by functional similarity, each of these categories generally falls under either Supervised or Unsupervised learning.
- Supervised learning is when there is a correct answer that is known
- This is the type of algorithm I’ve been running in my previous entries
- For example, in the horse colic dataset, either the horse died of it or didn’t die of it - yes or no
- Unsupervised learning is when the answer (known or not) isn’t used
- The MNIST digits dataset is often used this way. By clustering numbers that the model thinks look the same, groups can be formed that include similar looking numbers - like all 9s. Such groupings can also include other similar looking numbers like 4s
- During the creation of the groupings, the algorithm doesn’t take into account the labels (i.e. either 9 or 4)
- A real world use case would be when the true answer isn’t known, or is partially known, such as in the case of fraud. By creating groups of similar claims, fraud that had previously gone unrecognized (and thus unlabelled), could then be identified and acted upon
I reviewed all the algorithms discussed in my books and broke them into categories by functional similarity and then organized them by learning style. This process gave me the following list (Note: this list will be continually updated as I explore the algorithms and refine the categories):
Supervised learning
- Regression
- Linear regression
- Ordinary least squares regression (OLSR)
- Normal Equation
- Gradient Descent
- Ordinary least squares regression (OLSR)
- Regularization algorithms
- Least absolute shrinkage and selection operator (LASSO)
- Ridge regression
- Elastic net
- Logistic regression (* the output would group this with classification algorithms. However, the algorithm itself is based on regression with a minor alteration to get binary output, as such, I’ve put it with the regression algorithms)
- Stepwise regression
- Multivariate adaptive regression splines (MARS)
- Locally estimated scatterplot smoothing (LOESS)
- Polynomial regression
- Polynomials and splines
- Splines
- Linear regression
- Classification
- Decision trees
- Classification and regression tree (CART)
- Iterative dichotomiser 3 (ID3)
- C4.5 and c5.0
- Chi-squared automatic interaction detection (CHAID)
- Decision stump
- M5
- Conditional decision trees
- Ensemble models
- Boosting
- Bagging (Bootstrapped aggregation)
- AdaBoost
- Weighted average (Blending)
- Stacked generalization (Stacking)
- Gradient boosting machines
- Gradient boosted regression trees
- Random forests
- Decision trees
- Support vector machines
- Bayesian
- Naïve bayes
- Gaussian naïve bays
- Multinomial naïve bays
- Average one-dependence estimators (AODE)
- Bayesian belief network (BBN)
- Bayesian network (BN)
- Discriminant analysis
- Linear discriminant analysis
- Nonlinear discriminant analysis
- Flexible discriminant analysis
- Mixture discriminant analysis
- Quadratic discriminant analysis
Unsupervised learning
- Clustering
- K-Means clustering
- K-Medians
- DBSCAN
- K-Nearest neighbors
- Nearest shrunken centroids
- Expectation maximization (EM)
- Hierarchical clustering
- Associated rule learning algorithms
- Apriori algorithm
- Market basket analysis
- Eclat algorithm
- Gaussian mixture models
- Dimensionality reduction
- Principal component analysis (PCA)
- Principal component regression (PCR)
- Partial least squares regression (PLSR)
- Multidimensional scaling (MDS)
- Projection pursuit
Other
- Kernel methods
- Manifold learning
- Markov graphs
- Markov models
- Network graphs
- Undirected graphical models
- Outlier detection
- Partial least squares
- Penalized models
- Neural networks
- Deep learning
The Proposed Solution
In order to make these entries easy to reference in the future, I’m going to need to change my typical Problem - Options - Solution - Fail layout. As a start, I used some of the advice in Jason Brownlee’s “How to Study Machine Learning Algorithms” series of posts, mostly the post How to Learn a Machine Learning Algorithm (see “Machine Learning Mastery resources” in the Resources section below for the list of the full series). In reading those posts, I also added other considerations as they occurred to me.
- Learning style
- Supervised vs unsupervised
- If supervised, what kind of target does it accept (numerical, categorical)
- Information processing strategy
- Description
- The general description of the algorithm
- Name and standard abbreviation(s)
- Methaphors or analogies commonly used to describe the algorithm’s behavior
- Purpose
- Why the algorithm was created and what it is designed to find
- General class of problem the algorithm is suited to address
- Behavior
- Heuristics or rules of thumb for using the algorithm
- Parameters
- What parameters can be used to fine tune the model
- Strengths
- What does it do well
- In what ways is it better or easier to use than other algorithms
- Limitations
- Can it accept null values
- Do all features need to be numerical
- Do features need to be normalized (standardized 0-1 value ranges give very different results than unstandardized ranging between 0 and 1 million)
- In what ways is it worse or harder to use than other algorithms
- Resources
- Where to go to learn more about the algorithm
- Google Scholar and GitHub
The Fail
Alright, this is an anticipated fail, but in order to develop an intuitive understanding and have examples of how each algorithm works (and what it works on), I’ll need multiple, well understood datasets for each algorithm. This sounds like a huge time suck to find the datasets and layout (or discover) what kinds of patterns they characterize. Maybe I’ll get lucky and find a paper that does all that for me. Or maybe there’s something in one of my many books.
Up Next
Resources
- Machine Learning Mastery resources:
- Introduction to Machine Learning with Python