Posts by Tag

machine learning

Entry 51: Ensemble Learning

9 minute read

Ensemble techniques began to appear in the 1990s according to page 192 of Applied Predictive Modeling. Most people think of Random Forests (basically a bunch...

Entry 48: Decision Tree Impurity Measures

4 minute read

Impurity seems like it should be a simple calculation. However, depending on prevalence of classes and quirks in the data, it’s usually not as straight forwa...

Entry 47: Pruning Decision Trees

8 minute read

If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:

Entry 45: Visualizing Decision Trees

2 minute read

A major benefit of tree-based models is how easy they are to visualize. This visualization aspect is also vital to discussing how trees work.

Entry 44: Decision Trees

14 minute read

In entries 35 through 42 we learned how to find linear patterns using Regression algorithms. But what if the patterns in the data aren’t linear? This is wher...

Entry 42: Logistic Regression

5 minute read

Page 143 of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow states that Logistic Regression is “just like a Linear Regression model” excep...

Entry 41: Elastic Net

2 minute read

The notebook where I did my code for this entry can be found on my github page in the Entry 41 notebook.

Entry 39: Lasso Regression

2 minute read

Least Absolute Shrinkage and Selection Operator (LASSO) Regression is a form of regression regularization using L1 regularization.

Entry 37b: Gradient Descent

13 minute read

The general idea of Gradient Descent is to iteratively minimize the errors (usually using the mean squared error) to arrive at better and better solutions. K...

Entry 37a: Normal Equation

2 minute read

The Normal Equation generates the weights that are used in Linear Regression. It calculates the theta array directly without having to iterate through differ...

Entry 36: Ordinary Least Squares (OLS)

6 minute read

Ordinary Leas Squares is usually the default method of Linear Regression and is the method used in the sklearn.linear_model.LinearRegression function. On pag...

Entry 35: Regression

8 minute read

Regression is used to predict on continuous data, for example housing prices. Logistic Regression is the subcategory that is the exception, as it does classi...

Entry 34: Supervised Learning Datasets

less than 1 minute read

My anticipated fail from Entry 32 was correct, finding datasets and figuring out what kinds of patterns they have was a time suck. As such, “datasets” gets i...

Entry 32: Modeling data

6 minute read

The challenge in this, the third series of entries, is to become familiar with the major categories of algorithms, how to implement them, determine the commo...

Entry 31: Training Size

2 minute read

At the very end of Entry 30 I mentioned that learning curves can be used to diagnose problems other than just high bias and high variance. Another easy probl...

Entry 27: Figuring out openml.org

5 minute read

Figuring out how to get the data from openml.org into the Entry 26e notebook (MNIST was surprisingly difficult. All the datasets are saved as arff files, an ...

Entry 25: Baseline Models

1 minute read

As discussed in Entry 16, certain characteristics in the data can make a model look like it’s performing better than it is. One of these characteristics is c...

Entry 20: Scikit-Learn Pipeline

4 minute read

There are quite a few steps between pulling data and having a trained model. I need a good way to string it all together.

Entry 18: Cross-validation

5 minute read

In Entry 17, I decided I want to use a hybrid cross-validation method. I intend to split out a hold-out set, then perform cross-validation on the training se...

Entry 17: Resampling

9 minute read

I have no intention of simply training a model and calling it quits. Once the model is trained, I need to now how well it did and what areas to concentrate o...

Entry 16: Model Evaluation

6 minute read

Max Kuhn and Kjell Johnson make the case in Applied Predictive Modeling that models can easily overemphasize patterns that are not reproducible.

Entry 13: Categorical Preliminaries

6 minute read

Real world data often includes categorical variables like: Male or female Continent: North America, South America, Europe, Asia, Australia, Antarctica, ...

Entry 12: Missing Values

6 minute read

Wikipedia has a succinct definition of missing values: “missing data, or missing values, occur when no data value is stored for the variable in an observatio...

Entry 11: Consolidate Pre-processing Steps

5 minute read

Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.

Entry 9: Train Model

3 minute read

At the end of the Entry 8 noteboook I had a standardized dataset. Now it’s time to try training a model.

Entry 8: Centering and Scaling

6 minute read

By the end of Entry 7 I’d finalized the feature set to train a model and make predictions. One last step is needed before a prediction can be made.

Entry 7: Collinearity

5 minute read

In Entry 6 I looked at the correlation between the prediction features and the feature of interest (well, kinda. This problem question is a little weird in t...

Entry 6: Correlations

3 minute read

I need to find a way to automate some of the process outlined in Entry 5. My coworker Sabber suggested using correlation to help sort out the wheat from the ...

Entry 5: Explore the Data

5 minute read

Now that I have the dataset I put together in Entry 4, it’s time to see what’s in it.

Entry 4: Get the Data

3 minute read

In Entry 3 I defined my problem as: Holding all other factors constant, what mass is needed to retain an atmosphere on Mars? I need data to solve it.

Entry 3: Frame the Problem

6 minute read

In Entry 2 I decided to follow the ML Project Checklist from Hands on Machine Learning with Scikit-Learn & TensorFlow. The first step in that process is ...

Entry 2: Define the Process

5 minute read

In Entry 1, I decided to take the smallest problem I could find and solve just one aspect. Only after I’ve solved that single problem am I allowed to move on...

Entry 1: Impostor Syndrome

4 minute read

I am qualified. Quantitatively, I know this is true: I have a Master of Science degree in Data Science I have a job on a machine learning team at a larg...

Back to Top ↑

graph

Entry G21: Diameter

3 minute read

Having encountered the limit of connections between node pairs in Entry G19, I couldn’t resist taking a closer look. I’ve been trying to calculate the diamet...

Entry G20: Shortest Path

4 minute read

In an unweighted graph, the fewest steps between any node (A) to another node (B) is called the shortest path.

Entry G19: Neighborhood Node Counts

11 minute read

Now that we’ve established how to define an egocentric neighborhood and differentiated our people nodes, it’s time to start calculating metrics for the diffe...

Entry G18: Egocentric Networks

4 minute read

Now that we’ve gone through the global metrics and have a feel for the structure and composition of our data, it’s time to start running local metrics.

Entry G17: Add Villains

1 minute read

The use case I’m interested in is locating fraud within a graph. This is difficult to do when all the people nodes in your data are undifferentiated.

Entry G16: Components Comparison

1 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G8, addressing just the graph components...

Entry G15: Global Density Comparison

4 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G7, addressing only the number of possib...

Entry G14: Global Counts Comparison

3 minute read

The notebook that accompanies this entry is the cleaned up, concise version of the three notebooks that accompanied Entry G6, but limited to just the global ...

Entry G13: Weighted Degree Comparison

2 minute read

Like Entry G12 this is a redo of part of Entry G10. This entry addresses the weighted degrees. If you need a reminder as to what weighted relationships are s...

Entry G12: Degree Comparison

5 minute read

This is essentially a redo of the unweighted degree metrics from Entry G10. I ran the same metrics and queries from that entry, except I used the multigraph ...

Entry G10: Local Metrics

6 minute read

Now that I know what the larger graph looks like, I need metrics at the node level. The reason for these metrics is to be able to locate outlier nodes within...

Entry G9: Measuring Performance

4 minute read

As I may or may not have mentioned, the last time I tried running many of the calculations that I’ll be exploring in this series, I ran into timeout errors …...

Entry G8: Components

3 minute read

Now that I have a general feel for the graph database with counts and density, I want to look at components.

Entry G7: Density and Diameter

4 minute read

Density and diameter give you a feel for how strongly connected the full graph is: whether it’s dense or sparse. To get these measures, I need to calculate t...

Entry G6: Global Graph Counts

5 minute read

The first thing I want to do is get some global measures on the overall graph. I’m breaking these up into three categories: counts, density and diameter, and...

Entry G5.5: Analysis Metrics

2 minute read

I am 100% adding this entry retroactively. And yes, I dated it wrong so that it would show up in the right order in the Post list. That’s the nice thing abou...

Entry G5: Projecting Bimodal to Unimodal

9 minute read

I need to understand graph structure better and the repercussions of using the different model types. Specifically, I’m interested in memory use, processing ...

Entry G4: Modeling Relationships

3 minute read

Relationships are the lines that connect nodes. Neo4j usually describes nodes as nouns (a person, place, or thing - i.e. the subject) and relationships as ve...

Entry G2: Create a Neo4j Database

3 minute read

To begin any data science project, I need a data to play with. For the graph project I’m going to start with the Marvel Universe Social Network available on ...

Entry G1: Connected Entities

6 minute read

The last post was the 52nd entry, meaning that I averaged one entry a week for an entire year. Now, just a little under a year into my chronicling journey, I...

Back to Top ↑

neo4j

Entry G21: Diameter

3 minute read

Having encountered the limit of connections between node pairs in Entry G19, I couldn’t resist taking a closer look. I’ve been trying to calculate the diamet...

Entry G20: Shortest Path

4 minute read

In an unweighted graph, the fewest steps between any node (A) to another node (B) is called the shortest path.

Entry G19: Neighborhood Node Counts

11 minute read

Now that we’ve established how to define an egocentric neighborhood and differentiated our people nodes, it’s time to start calculating metrics for the diffe...

Entry G18: Egocentric Networks

4 minute read

Now that we’ve gone through the global metrics and have a feel for the structure and composition of our data, it’s time to start running local metrics.

Entry G17: Add Villains

1 minute read

The use case I’m interested in is locating fraud within a graph. This is difficult to do when all the people nodes in your data are undifferentiated.

Entry G16: Components Comparison

1 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G8, addressing just the graph components...

Entry G15: Global Density Comparison

4 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G7, addressing only the number of possib...

Entry G14: Global Counts Comparison

3 minute read

The notebook that accompanies this entry is the cleaned up, concise version of the three notebooks that accompanied Entry G6, but limited to just the global ...

Entry G13: Weighted Degree Comparison

2 minute read

Like Entry G12 this is a redo of part of Entry G10. This entry addresses the weighted degrees. If you need a reminder as to what weighted relationships are s...

Entry G12: Degree Comparison

5 minute read

This is essentially a redo of the unweighted degree metrics from Entry G10. I ran the same metrics and queries from that entry, except I used the multigraph ...

Entry G10: Local Metrics

6 minute read

Now that I know what the larger graph looks like, I need metrics at the node level. The reason for these metrics is to be able to locate outlier nodes within...

Entry G9: Measuring Performance

4 minute read

As I may or may not have mentioned, the last time I tried running many of the calculations that I’ll be exploring in this series, I ran into timeout errors …...

Entry G8: Components

3 minute read

Now that I have a general feel for the graph database with counts and density, I want to look at components.

Entry G7: Density and Diameter

4 minute read

Density and diameter give you a feel for how strongly connected the full graph is: whether it’s dense or sparse. To get these measures, I need to calculate t...

Entry G6: Global Graph Counts

5 minute read

The first thing I want to do is get some global measures on the overall graph. I’m breaking these up into three categories: counts, density and diameter, and...

Entry G5.5: Analysis Metrics

2 minute read

I am 100% adding this entry retroactively. And yes, I dated it wrong so that it would show up in the right order in the Post list. That’s the nice thing abou...

Entry G5: Projecting Bimodal to Unimodal

9 minute read

I need to understand graph structure better and the repercussions of using the different model types. Specifically, I’m interested in memory use, processing ...

Entry G4: Modeling Relationships

3 minute read

Relationships are the lines that connect nodes. Neo4j usually describes nodes as nouns (a person, place, or thing - i.e. the subject) and relationships as ve...

Entry G2: Create a Neo4j Database

3 minute read

To begin any data science project, I need a data to play with. For the graph project I’m going to start with the Marvel Universe Social Network available on ...

Entry G1: Connected Entities

6 minute read

The last post was the 52nd entry, meaning that I averaged one entry a week for an entire year. Now, just a little under a year into my chronicling journey, I...

Back to Top ↑

supervised learning

Entry 51: Ensemble Learning

9 minute read

Ensemble techniques began to appear in the 1990s according to page 192 of Applied Predictive Modeling. Most people think of Random Forests (basically a bunch...

Entry 48: Decision Tree Impurity Measures

4 minute read

Impurity seems like it should be a simple calculation. However, depending on prevalence of classes and quirks in the data, it’s usually not as straight forwa...

Entry 47: Pruning Decision Trees

8 minute read

If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:

Entry 45: Visualizing Decision Trees

2 minute read

A major benefit of tree-based models is how easy they are to visualize. This visualization aspect is also vital to discussing how trees work.

Entry 44: Decision Trees

14 minute read

In entries 35 through 42 we learned how to find linear patterns using Regression algorithms. But what if the patterns in the data aren’t linear? This is wher...

Entry 42: Logistic Regression

5 minute read

Page 143 of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow states that Logistic Regression is “just like a Linear Regression model” excep...

Entry 41: Elastic Net

2 minute read

The notebook where I did my code for this entry can be found on my github page in the Entry 41 notebook.

Entry 39: Lasso Regression

2 minute read

Least Absolute Shrinkage and Selection Operator (LASSO) Regression is a form of regression regularization using L1 regularization.

Entry 37b: Gradient Descent

13 minute read

The general idea of Gradient Descent is to iteratively minimize the errors (usually using the mean squared error) to arrive at better and better solutions. K...

Entry 37a: Normal Equation

2 minute read

The Normal Equation generates the weights that are used in Linear Regression. It calculates the theta array directly without having to iterate through differ...

Entry 36: Ordinary Least Squares (OLS)

6 minute read

Ordinary Leas Squares is usually the default method of Linear Regression and is the method used in the sklearn.linear_model.LinearRegression function. On pag...

Entry 35: Regression

8 minute read

Regression is used to predict on continuous data, for example housing prices. Logistic Regression is the subcategory that is the exception, as it does classi...

Back to Top ↑

graph analytics

Entry G21: Diameter

3 minute read

Having encountered the limit of connections between node pairs in Entry G19, I couldn’t resist taking a closer look. I’ve been trying to calculate the diamet...

Entry G20: Shortest Path

4 minute read

In an unweighted graph, the fewest steps between any node (A) to another node (B) is called the shortest path.

Entry G19: Neighborhood Node Counts

11 minute read

Now that we’ve established how to define an egocentric neighborhood and differentiated our people nodes, it’s time to start calculating metrics for the diffe...

Entry G18: Egocentric Networks

4 minute read

Now that we’ve gone through the global metrics and have a feel for the structure and composition of our data, it’s time to start running local metrics.

Entry G16: Components Comparison

1 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G8, addressing just the graph components...

Entry G15: Global Density Comparison

4 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G7, addressing only the number of possib...

Entry G14: Global Counts Comparison

3 minute read

The notebook that accompanies this entry is the cleaned up, concise version of the three notebooks that accompanied Entry G6, but limited to just the global ...

Entry G13: Weighted Degree Comparison

2 minute read

Like Entry G12 this is a redo of part of Entry G10. This entry addresses the weighted degrees. If you need a reminder as to what weighted relationships are s...

Entry G12: Degree Comparison

5 minute read

This is essentially a redo of the unweighted degree metrics from Entry G10. I ran the same metrics and queries from that entry, except I used the multigraph ...

Entry G10: Local Metrics

6 minute read

Now that I know what the larger graph looks like, I need metrics at the node level. The reason for these metrics is to be able to locate outlier nodes within...

Entry G9: Measuring Performance

4 minute read

As I may or may not have mentioned, the last time I tried running many of the calculations that I’ll be exploring in this series, I ran into timeout errors …...

Entry G8: Components

3 minute read

Now that I have a general feel for the graph database with counts and density, I want to look at components.

Entry G7: Density and Diameter

4 minute read

Density and diameter give you a feel for how strongly connected the full graph is: whether it’s dense or sparse. To get these measures, I need to calculate t...

Entry G6: Global Graph Counts

5 minute read

The first thing I want to do is get some global measures on the overall graph. I’m breaking these up into three categories: counts, density and diameter, and...

Entry G5.5: Analysis Metrics

2 minute read

I am 100% adding this entry retroactively. And yes, I dated it wrong so that it would show up in the right order in the Post list. That’s the nice thing abou...

Back to Top ↑

model-eval

Entry 31: Training Size

2 minute read

At the very end of Entry 30 I mentioned that learning curves can be used to diagnose problems other than just high bias and high variance. Another easy probl...

Entry 25: Baseline Models

1 minute read

As discussed in Entry 16, certain characteristics in the data can make a model look like it’s performing better than it is. One of these characteristics is c...

Entry 20: Scikit-Learn Pipeline

4 minute read

There are quite a few steps between pulling data and having a trained model. I need a good way to string it all together.

Entry 18: Cross-validation

5 minute read

In Entry 17, I decided I want to use a hybrid cross-validation method. I intend to split out a hold-out set, then perform cross-validation on the training se...

Entry 17: Resampling

9 minute read

I have no intention of simply training a model and calling it quits. Once the model is trained, I need to now how well it did and what areas to concentrate o...

Entry 16: Model Evaluation

6 minute read

Max Kuhn and Kjell Johnson make the case in Applied Predictive Modeling that models can easily overemphasize patterns that are not reproducible.

Back to Top ↑

pre-process

Entry 13: Categorical Preliminaries

6 minute read

Real world data often includes categorical variables like: Male or female Continent: North America, South America, Europe, Asia, Australia, Antarctica, ...

Entry 12: Missing Values

6 minute read

Wikipedia has a succinct definition of missing values: “missing data, or missing values, occur when no data value is stored for the variable in an observatio...

Entry 11: Consolidate Pre-processing Steps

5 minute read

Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.

Entry 9: Train Model

3 minute read

At the end of the Entry 8 noteboook I had a standardized dataset. Now it’s time to try training a model.

Entry 8: Centering and Scaling

6 minute read

By the end of Entry 7 I’d finalized the feature set to train a model and make predictions. One last step is needed before a prediction can be made.

Entry 7: Collinearity

5 minute read

In Entry 6 I looked at the correlation between the prediction features and the feature of interest (well, kinda. This problem question is a little weird in t...

Entry 6: Correlations

3 minute read

I need to find a way to automate some of the process outlined in Entry 5. My coworker Sabber suggested using correlation to help sort out the wheat from the ...

Back to Top ↑

regression

Entry 42: Logistic Regression

5 minute read

Page 143 of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow states that Logistic Regression is “just like a Linear Regression model” excep...

Entry 41: Elastic Net

2 minute read

The notebook where I did my code for this entry can be found on my github page in the Entry 41 notebook.

Entry 39: Lasso Regression

2 minute read

Least Absolute Shrinkage and Selection Operator (LASSO) Regression is a form of regression regularization using L1 regularization.

Entry 37b: Gradient Descent

13 minute read

The general idea of Gradient Descent is to iteratively minimize the errors (usually using the mean squared error) to arrive at better and better solutions. K...

Entry 37a: Normal Equation

2 minute read

The Normal Equation generates the weights that are used in Linear Regression. It calculates the theta array directly without having to iterate through differ...

Entry 36: Ordinary Least Squares (OLS)

6 minute read

Ordinary Leas Squares is usually the default method of Linear Regression and is the method used in the sklearn.linear_model.LinearRegression function. On pag...

Entry 35: Regression

8 minute read

Regression is used to predict on continuous data, for example housing prices. Logistic Regression is the subcategory that is the exception, as it does classi...

Back to Top ↑

trees

Entry 51: Ensemble Learning

9 minute read

Ensemble techniques began to appear in the 1990s according to page 192 of Applied Predictive Modeling. Most people think of Random Forests (basically a bunch...

Entry 48: Decision Tree Impurity Measures

4 minute read

Impurity seems like it should be a simple calculation. However, depending on prevalence of classes and quirks in the data, it’s usually not as straight forwa...

Entry 47: Pruning Decision Trees

8 minute read

If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:

Entry 45: Visualizing Decision Trees

2 minute read

A major benefit of tree-based models is how easy they are to visualize. This visualization aspect is also vital to discussing how trees work.

Entry 44: Decision Trees

14 minute read

In entries 35 through 42 we learned how to find linear patterns using Regression algorithms. But what if the patterns in the data aren’t linear? This is wher...

Back to Top ↑

process

Entry 11: Consolidate Pre-processing Steps

5 minute read

Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.

Entry 7: Collinearity

5 minute read

In Entry 6 I looked at the correlation between the prediction features and the feature of interest (well, kinda. This problem question is a little weird in t...

Entry 6: Correlations

3 minute read

I need to find a way to automate some of the process outlined in Entry 5. My coworker Sabber suggested using correlation to help sort out the wheat from the ...

Entry 5: Explore the Data

5 minute read

Now that I have the dataset I put together in Entry 4, it’s time to see what’s in it.

Entry 4: Get the Data

3 minute read

In Entry 3 I defined my problem as: Holding all other factors constant, what mass is needed to retain an atmosphere on Mars? I need data to solve it.

Entry 3: Frame the Problem

6 minute read

In Entry 2 I decided to follow the ML Project Checklist from Hands on Machine Learning with Scikit-Learn & TensorFlow. The first step in that process is ...

Entry 2: Define the Process

5 minute read

In Entry 1, I decided to take the smallest problem I could find and solve just one aspect. Only after I’ve solved that single problem am I allowed to move on...

Entry 1: Impostor Syndrome

4 minute read

I am qualified. Quantitatively, I know this is true: I have a Master of Science degree in Data Science I have a job on a machine learning team at a larg...

Back to Top ↑

dataset titanic

Entry 47: Pruning Decision Trees

8 minute read

If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:

Entry 44: Decision Trees

14 minute read

In entries 35 through 42 we learned how to find linear patterns using Regression algorithms. But what if the patterns in the data aren’t linear? This is wher...

Entry 27: Figuring out openml.org

5 minute read

Figuring out how to get the data from openml.org into the Entry 26e notebook (MNIST was surprisingly difficult. All the datasets are saved as arff files, an ...

Entry 12: Missing Values

6 minute read

Wikipedia has a succinct definition of missing values: “missing data, or missing values, occur when no data value is stored for the variable in an observatio...

Back to Top ↑

dataset breast cancer

Entry 47: Pruning Decision Trees

8 minute read

If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:

Entry 44: Decision Trees

14 minute read

In entries 35 through 42 we learned how to find linear patterns using Regression algorithms. But what if the patterns in the data aren’t linear? This is wher...

Entry 25: Baseline Models

1 minute read

As discussed in Entry 16, certain characteristics in the data can make a model look like it’s performing better than it is. One of these characteristics is c...

Back to Top ↑

dataset planets

Entry 9: Train Model

3 minute read

At the end of the Entry 8 noteboook I had a standardized dataset. Now it’s time to try training a model.

Entry 8: Centering and Scaling

6 minute read

By the end of Entry 7 I’d finalized the feature set to train a model and make predictions. One last step is needed before a prediction can be made.

Entry 7: Collinearity

5 minute read

In Entry 6 I looked at the correlation between the prediction features and the feature of interest (well, kinda. This problem question is a little weird in t...

Entry 6: Correlations

3 minute read

I need to find a way to automate some of the process outlined in Entry 5. My coworker Sabber suggested using correlation to help sort out the wheat from the ...

Entry 5: Explore the Data

5 minute read

Now that I have the dataset I put together in Entry 4, it’s time to see what’s in it.

Back to Top ↑

dataset auto mpg

Entry 25: Baseline Models

1 minute read

As discussed in Entry 16, certain characteristics in the data can make a model look like it’s performing better than it is. One of these characteristics is c...

Entry 20: Scikit-Learn Pipeline

4 minute read

There are quite a few steps between pulling data and having a trained model. I need a good way to string it all together.

Entry 18: Cross-validation

5 minute read

In Entry 17, I decided I want to use a hybrid cross-validation method. I intend to split out a hold-out set, then perform cross-validation on the training se...

Back to Top ↑

overfitting

Entry 17: Resampling

9 minute read

I have no intention of simply training a model and calling it quits. Once the model is trained, I need to now how well it did and what areas to concentrate o...

Back to Top ↑

create database

Entry G5: Projecting Bimodal to Unimodal

9 minute read

I need to understand graph structure better and the repercussions of using the different model types. Specifically, I’m interested in memory use, processing ...

Entry G4: Modeling Relationships

3 minute read

Relationships are the lines that connect nodes. Neo4j usually describes nodes as nouns (a person, place, or thing - i.e. the subject) and relationships as ve...

Entry G2: Create a Neo4j Database

3 minute read

To begin any data science project, I need a data to play with. For the graph project I’m going to start with the Marvel Universe Social Network available on ...

Back to Top ↑

nlp

Entry NLP4: Frequencies and Comparison

26 minute read

In the previous entries in this series, I loaded all the files in a directory, processed the data, and transformed it into ngrams. Now it’s time to do math a...

Entry NLP2: Load All Files in a Directory

5 minute read

In the previous entry, I figured out how to process individual files, removing many of the items on the “Remove lines/characters” list specified in the homew...

Entry NLP1: Corpus Cleaning with RegEx

14 minute read

Recently I’ve been working my way through one of the older versions of CSE 140 (now CSE 160) offered at the University of Washington. Homework 8 is a nice ex...

Back to Top ↑

underfitting

Entry 17: Resampling

9 minute read

I have no intention of simply training a model and calling it quits. Once the model is trained, I need to now how well it did and what areas to concentrate o...

Back to Top ↑

regularization

Entry 39: Lasso Regression

2 minute read

Least Absolute Shrinkage and Selection Operator (LASSO) Regression is a form of regression regularization using L1 regularization.

Back to Top ↑

dataset iris

Entry 47: Pruning Decision Trees

8 minute read

If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:

Entry 45: Visualizing Decision Trees

2 minute read

A major benefit of tree-based models is how easy they are to visualize. This visualization aspect is also vital to discussing how trees work.

Entry 44: Decision Trees

14 minute read

In entries 35 through 42 we learned how to find linear patterns using Regression algorithms. But what if the patterns in the data aren’t linear? This is wher...

Back to Top ↑

latex

Entry 8: Centering and Scaling

6 minute read

By the end of Entry 7 I’d finalized the feature set to train a model and make predictions. One last step is needed before a prediction can be made.

Back to Top ↑

cat encoding

Entry 13: Categorical Preliminaries

6 minute read

Real world data often includes categorical variables like: Male or female Continent: North America, South America, Europe, Asia, Australia, Antarctica, ...

Back to Top ↑

cross-validation

Entry 18: Cross-validation

5 minute read

In Entry 17, I decided I want to use a hybrid cross-validation method. I intend to split out a hold-out set, then perform cross-validation on the training se...

Entry 17: Resampling

9 minute read

I have no intention of simply training a model and calling it quits. Once the model is trained, I need to now how well it did and what areas to concentrate o...

Back to Top ↑

dataset horse colic

Entry 20: Scikit-Learn Pipeline

4 minute read

There are quite a few steps between pulling data and having a trained model. I need a good way to string it all together.

Back to Top ↑

regression models

Back to Top ↑

classification models

Back to Top ↑

thresholds

Back to Top ↑

dataset mnist

Back to Top ↑

dataset boston housing

Back to Top ↑

algorithms

Entry 32: Modeling data

6 minute read

The challenge in this, the third series of entries, is to become familiar with the major categories of algorithms, how to implement them, determine the commo...

Back to Top ↑

regression calculation

Entry 37b: Gradient Descent

13 minute read

The general idea of Gradient Descent is to iteratively minimize the errors (usually using the mean squared error) to arrive at better and better solutions. K...

Entry 37a: Normal Equation

2 minute read

The Normal Equation generates the weights that are used in Linear Regression. It calculates the theta array directly without having to iterate through differ...

Back to Top ↑

ensembles

Entry 51: Ensemble Learning

9 minute read

Ensemble techniques began to appear in the 1990s according to page 192 of Applied Predictive Modeling. Most people think of Random Forests (basically a bunch...

Back to Top ↑

aws

Entry SM02: Clean Data

20 minute read

Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are general...

Entry SM01: Using S3 from AWS’s SageMaker

10 minute read

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...

Back to Top ↑

sagemaker

Entry SM02: Clean Data

20 minute read

Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are general...

Entry SM01: Using S3 from AWS’s SageMaker

10 minute read

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...

Back to Top ↑

production pipeline

Entry SM02: Clean Data

20 minute read

Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are general...

Entry SM01: Using S3 from AWS’s SageMaker

10 minute read

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...

Back to Top ↑

frame problem

Entry 3: Frame the Problem

6 minute read

In Entry 2 I decided to follow the ML Project Checklist from Hands on Machine Learning with Scikit-Learn & TensorFlow. The first step in that process is ...

Back to Top ↑

dataset computer hardware

Entry 11: Consolidate Pre-processing Steps

5 minute read

Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.

Back to Top ↑

dataset qsar fish toxicity

Entry 11: Consolidate Pre-processing Steps

5 minute read

Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.

Back to Top ↑

dataset qsar aquatic toxicity

Entry 11: Consolidate Pre-processing Steps

5 minute read

Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.

Back to Top ↑

dataset online news popularity

Entry 11: Consolidate Pre-processing Steps

5 minute read

Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.

Back to Top ↑

dataset csm (conventional and social media movies)

Entry 12: Missing Values

6 minute read

Wikipedia has a succinct definition of missing values: “missing data, or missing values, occur when no data value is stored for the variable in an observatio...

Back to Top ↑

dataset mushrooms

Back to Top ↑

dataset mushroom

Back to Top ↑

dataset solar flare

Back to Top ↑

dataset nursery

Back to Top ↑

dataset chess

Back to Top ↑

pipeline

Entry 20: Scikit-Learn Pipeline

4 minute read

There are quite a few steps between pulling data and having a trained model. I need a good way to string it all together.

Back to Top ↑

bias

Back to Top ↑

variance

Back to Top ↑

load data

Entry 27: Figuring out openml.org

5 minute read

Figuring out how to get the data from openml.org into the Entry 26e notebook (MNIST was surprisingly difficult. All the datasets are saved as arff files, an ...

Back to Top ↑

dataset MNIST

Back to Top ↑

dataset click prediction

Back to Top ↑

dataset house_16H

Back to Top ↑

data size

Entry 31: Training Size

2 minute read

At the very end of Entry 30 I mentioned that learning curves can be used to diagnose problems other than just high bias and high variance. Another easy probl...

Back to Top ↑

dataset forge

Back to Top ↑

dataset wave

Back to Top ↑

dataset california housing

Back to Top ↑

dataset bonanza

Entry 34: Supervised Learning Datasets

less than 1 minute read

My anticipated fail from Entry 32 was correct, finding datasets and figuring out what kinds of patterns they have was a time suck. As such, “datasets” gets i...

Back to Top ↑

datasets

Entry 34: Supervised Learning Datasets

less than 1 minute read

My anticipated fail from Entry 32 was correct, finding datasets and figuring out what kinds of patterns they have was a time suck. As such, “datasets” gets i...

Back to Top ↑

supervised learning datasets

Entry 34: Supervised Learning Datasets

less than 1 minute read

My anticipated fail from Entry 32 was correct, finding datasets and figuring out what kinds of patterns they have was a time suck. As such, “datasets” gets i...

Back to Top ↑

logistic regression

Entry 42: Logistic Regression

5 minute read

Page 143 of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow states that Logistic Regression is “just like a Linear Regression model” excep...

Back to Top ↑

graph visualization

Back to Top ↑

talks

Back to Top ↑

visualization

Entry 45: Visualizing Decision Trees

2 minute read

A major benefit of tree-based models is how easy they are to visualize. This visualization aspect is also vital to discussing how trees work.

Back to Top ↑

data sensitivity

Back to Top ↑

n-grams

Entry 48: Decision Tree Impurity Measures

4 minute read

Impurity seems like it should be a simple calculation. However, depending on prevalence of classes and quirks in the data, it’s usually not as straight forwa...

Back to Top ↑

entity resolution

Entry G1: Connected Entities

6 minute read

The last post was the 52nd entry, meaning that I averaged one entry a week for an entire year. Now, just a little under a year into my chronicling journey, I...

Back to Top ↑

graph model

Back to Top ↑

measuring performance

Entry G9: Measuring Performance

4 minute read

As I may or may not have mentioned, the last time I tried running many of the calculations that I’ll be exploring in this series, I ran into timeout errors …...

Back to Top ↑

regex

Entry NLP1: Corpus Cleaning with RegEx

14 minute read

Recently I’ve been working my way through one of the older versions of CSE 140 (now CSE 160) offered at the University of Washington. Homework 8 is a nice ex...

Back to Top ↑

reading files

Entry NLP2: Load All Files in a Directory

5 minute read

In the previous entry, I figured out how to process individual files, removing many of the items on the “Remove lines/characters” list specified in the homew...

Back to Top ↑

ngrams

Back to Top ↑

text analysis

Entry NLP4: Frequencies and Comparison

26 minute read

In the previous entries in this series, I loaded all the files in a directory, processed the data, and transformed it into ngrams. Now it’s time to do math a...

Back to Top ↑

boto3

Entry SM01: Using S3 from AWS’s SageMaker

10 minute read

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...

Back to Top ↑

s3

Entry SM01: Using S3 from AWS’s SageMaker

10 minute read

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...

Back to Top ↑

read from s3

Entry SM01: Using S3 from AWS’s SageMaker

10 minute read

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...

Back to Top ↑

write to s3

Entry SM01: Using S3 from AWS’s SageMaker

10 minute read

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...

Back to Top ↑

data cleaning

Entry SM02: Clean Data

20 minute read

Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are general...

Back to Top ↑

data wrangling

Entry SM02: Clean Data

20 minute read

Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are general...

Back to Top ↑