Posts by Year

2022

Entry SM02: Clean Data

20 minute read

Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are general...

Entry SM01: Using S3 from AWS’s SageMaker

10 minute read

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...

Entry NLP4: Frequencies and Comparison

26 minute read

In the previous entries in this series, I loaded all the files in a directory, processed the data, and transformed it into ngrams. Now it’s time to do math a...

Entry NLP2: Load All Files in a Directory

5 minute read

In the previous entry, I figured out how to process individual files, removing many of the items on the “Remove lines/characters” list specified in the homew...

Entry NLP1: Corpus Cleaning with RegEx

14 minute read

Recently I’ve been working my way through one of the older versions of CSE 140 (now CSE 160) offered at the University of Washington. Homework 8 is a nice ex...

Back to Top ↑

2021

Entry G21: Diameter

3 minute read

Having encountered the limit of connections between node pairs in Entry G19, I couldn’t resist taking a closer look. I’ve been trying to calculate the diamet...

Entry G20: Shortest Path

4 minute read

In an unweighted graph, the fewest steps between any node (A) to another node (B) is called the shortest path.

Entry G19: Neighborhood Node Counts

11 minute read

Now that we’ve established how to define an egocentric neighborhood and differentiated our people nodes, it’s time to start calculating metrics for the diffe...

Entry G18: Egocentric Networks

4 minute read

Now that we’ve gone through the global metrics and have a feel for the structure and composition of our data, it’s time to start running local metrics.

Entry G17: Add Villains

1 minute read

The use case I’m interested in is locating fraud within a graph. This is difficult to do when all the people nodes in your data are undifferentiated.

Entry G16: Components Comparison

1 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G8, addressing just the graph components...

Entry G15: Global Density Comparison

4 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G7, addressing only the number of possib...

Entry G14: Global Counts Comparison

3 minute read

The notebook that accompanies this entry is the cleaned up, concise version of the three notebooks that accompanied Entry G6, but limited to just the global ...

Entry G13: Weighted Degree Comparison

2 minute read

Like Entry G12 this is a redo of part of Entry G10. This entry addresses the weighted degrees. If you need a reminder as to what weighted relationships are s...

Entry G12: Degree Comparison

5 minute read

This is essentially a redo of the unweighted degree metrics from Entry G10. I ran the same metrics and queries from that entry, except I used the multigraph ...

Entry G10: Local Metrics

6 minute read

Now that I know what the larger graph looks like, I need metrics at the node level. The reason for these metrics is to be able to locate outlier nodes within...

Entry G9: Measuring Performance

4 minute read

As I may or may not have mentioned, the last time I tried running many of the calculations that I’ll be exploring in this series, I ran into timeout errors …...

Entry G8: Components

3 minute read

Now that I have a general feel for the graph database with counts and density, I want to look at components.

Entry G7: Density and Diameter

4 minute read

Density and diameter give you a feel for how strongly connected the full graph is: whether it’s dense or sparse. To get these measures, I need to calculate t...

Entry G6: Global Graph Counts

5 minute read

The first thing I want to do is get some global measures on the overall graph. I’m breaking these up into three categories: counts, density and diameter, and...

Entry G5.5: Analysis Metrics

2 minute read

I am 100% adding this entry retroactively. And yes, I dated it wrong so that it would show up in the right order in the Post list. That’s the nice thing abou...

Entry G5: Projecting Bimodal to Unimodal

9 minute read

I need to understand graph structure better and the repercussions of using the different model types. Specifically, I’m interested in memory use, processing ...

Entry G4: Modeling Relationships

3 minute read

Relationships are the lines that connect nodes. Neo4j usually describes nodes as nouns (a person, place, or thing - i.e. the subject) and relationships as ve...

Entry G2: Create a Neo4j Database

3 minute read

To begin any data science project, I need a data to play with. For the graph project I’m going to start with the Marvel Universe Social Network available on ...

Entry G1: Connected Entities

6 minute read

The last post was the 52nd entry, meaning that I averaged one entry a week for an entire year. Now, just a little under a year into my chronicling journey, I...

Entry 51: Ensemble Learning

9 minute read

Ensemble techniques began to appear in the 1990s according to page 192 of Applied Predictive Modeling. Most people think of Random Forests (basically a bunch...

Back to Top ↑

2020

Entry 48: Decision Tree Impurity Measures

4 minute read

Impurity seems like it should be a simple calculation. However, depending on prevalence of classes and quirks in the data, it’s usually not as straight forwa...

Entry 47: Pruning Decision Trees

8 minute read

If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:

Entry 45: Visualizing Decision Trees

2 minute read

A major benefit of tree-based models is how easy they are to visualize. This visualization aspect is also vital to discussing how trees work.

Entry 44: Decision Trees

14 minute read

In entries 35 through 42 we learned how to find linear patterns using Regression algorithms. But what if the patterns in the data aren’t linear? This is wher...

Entry 42: Logistic Regression

5 minute read

Page 143 of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow states that Logistic Regression is “just like a Linear Regression model” excep...

Entry 41: Elastic Net

2 minute read

The notebook where I did my code for this entry can be found on my github page in the Entry 41 notebook.

Entry 39: Lasso Regression

2 minute read

Least Absolute Shrinkage and Selection Operator (LASSO) Regression is a form of regression regularization using L1 regularization.

Entry 37b: Gradient Descent

13 minute read

The general idea of Gradient Descent is to iteratively minimize the errors (usually using the mean squared error) to arrive at better and better solutions. K...

Entry 37a: Normal Equation

2 minute read

The Normal Equation generates the weights that are used in Linear Regression. It calculates the theta array directly without having to iterate through differ...

Entry 36: Ordinary Least Squares (OLS)

6 minute read

Ordinary Leas Squares is usually the default method of Linear Regression and is the method used in the sklearn.linear_model.LinearRegression function. On pag...

Entry 35: Regression

8 minute read

Regression is used to predict on continuous data, for example housing prices. Logistic Regression is the subcategory that is the exception, as it does classi...

Entry 34: Supervised Learning Datasets

less than 1 minute read

My anticipated fail from Entry 32 was correct, finding datasets and figuring out what kinds of patterns they have was a time suck. As such, “datasets” gets i...

Entry 32: Modeling data

6 minute read

The challenge in this, the third series of entries, is to become familiar with the major categories of algorithms, how to implement them, determine the commo...

Entry 31: Training Size

2 minute read

At the very end of Entry 30 I mentioned that learning curves can be used to diagnose problems other than just high bias and high variance. Another easy probl...

Entry 27: Figuring out openml.org

5 minute read

Figuring out how to get the data from openml.org into the Entry 26e notebook (MNIST was surprisingly difficult. All the datasets are saved as arff files, an ...

Entry 25: Baseline Models

1 minute read

As discussed in Entry 16, certain characteristics in the data can make a model look like it’s performing better than it is. One of these characteristics is c...

Entry 20: Scikit-Learn Pipeline

4 minute read

There are quite a few steps between pulling data and having a trained model. I need a good way to string it all together.

Entry 18: Cross-validation

5 minute read

In Entry 17, I decided I want to use a hybrid cross-validation method. I intend to split out a hold-out set, then perform cross-validation on the training se...

Entry 17: Resampling

9 minute read

I have no intention of simply training a model and calling it quits. Once the model is trained, I need to now how well it did and what areas to concentrate o...

Entry 16: Model Evaluation

6 minute read

Max Kuhn and Kjell Johnson make the case in Applied Predictive Modeling that models can easily overemphasize patterns that are not reproducible.

Entry 13: Categorical Preliminaries

6 minute read

Real world data often includes categorical variables like: Male or female Continent: North America, South America, Europe, Asia, Australia, Antarctica, ...

Entry 12: Missing Values

6 minute read

Wikipedia has a succinct definition of missing values: “missing data, or missing values, occur when no data value is stored for the variable in an observatio...

Entry 11: Consolidate Pre-processing Steps

5 minute read

Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.

Entry 9: Train Model

3 minute read

At the end of the Entry 8 noteboook I had a standardized dataset. Now it’s time to try training a model.

Entry 8: Centering and Scaling

6 minute read

By the end of Entry 7 I’d finalized the feature set to train a model and make predictions. One last step is needed before a prediction can be made.

Entry 7: Collinearity

5 minute read

In Entry 6 I looked at the correlation between the prediction features and the feature of interest (well, kinda. This problem question is a little weird in t...

Entry 6: Correlations

3 minute read

I need to find a way to automate some of the process outlined in Entry 5. My coworker Sabber suggested using correlation to help sort out the wheat from the ...

Entry 5: Explore the Data

5 minute read

Now that I have the dataset I put together in Entry 4, it’s time to see what’s in it.

Entry 4: Get the Data

3 minute read

In Entry 3 I defined my problem as: Holding all other factors constant, what mass is needed to retain an atmosphere on Mars? I need data to solve it.

Entry 3: Frame the Problem

6 minute read

In Entry 2 I decided to follow the ML Project Checklist from Hands on Machine Learning with Scikit-Learn & TensorFlow. The first step in that process is ...

Entry 2: Define the Process

5 minute read

In Entry 1, I decided to take the smallest problem I could find and solve just one aspect. Only after I’ve solved that single problem am I allowed to move on...

Entry 1: Impostor Syndrome

4 minute read

I am qualified. Quantitatively, I know this is true: I have a Master of Science degree in Data Science I have a job on a machine learning team at a larg...

Back to Top ↑