Posts by Category

Entry SM02: Clean Data

20 minute read

Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are general...

Entry SM01: Using S3 from AWS’s SageMaker

10 minute read

There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...

Entry NLP4: Frequencies and Comparison

26 minute read

In the previous entries in this series, I loaded all the files in a directory, processed the data, and transformed it into ngrams. Now it’s time to do math a...

Entry NLP3: Clean Data and Split into N-grams

14 minute read

In the first entry of this series, I figured out how to process the raw files. In the second entry, I figured out how to load all files in a directory (even ...

Entry NLP2: Load All Files in a Directory

5 minute read

In the previous entry, I figured out how to process individual files, removing many of the items on the “Remove lines/characters” list specified in the homew...

Entry NLP1: Corpus Cleaning with RegEx

14 minute read

Recently I’ve been working my way through one of the older versions of CSE 140 (now CSE 160) offered at the University of Washington. Homework 8 is a nice ex...

Entry G22: Mean Distance Between Connected Nodes

3 minute read

The apoc.path.expandingTree algorithm in Entry G19 revealed a gold mine of information. Once I had that table of results I knew that not only could I grab th...

Entry G21: Diameter

3 minute read

Having encountered the limit of connections between node pairs in Entry G19, I couldn’t resist taking a closer look. I’ve been trying to calculate the diamet...

Entry G20: Shortest Path

4 minute read

In an unweighted graph, the fewest steps between any node (A) to another node (B) is called the shortest path.

Entry G19: Neighborhood Node Counts

11 minute read

Now that we’ve established how to define an egocentric neighborhood and differentiated our people nodes, it’s time to start calculating metrics for the diffe...

Entry G18: Egocentric Networks

4 minute read

Now that we’ve gone through the global metrics and have a feel for the structure and composition of our data, it’s time to start running local metrics.

Entry G17: Add Villains

1 minute read

The use case I’m interested in is locating fraud within a graph. This is difficult to do when all the people nodes in your data are undifferentiated.

Entry G16: Components Comparison

1 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G8, addressing just the graph components...

Entry G15: Global Density Comparison

4 minute read

The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G7, addressing only the number of possib...

Entry G14: Global Counts Comparison

3 minute read

The notebook that accompanies this entry is the cleaned up, concise version of the three notebooks that accompanied Entry G6, but limited to just the global ...

Entry G13: Weighted Degree Comparison

2 minute read

Like Entry G12 this is a redo of part of Entry G10. This entry addresses the weighted degrees. If you need a reminder as to what weighted relationships are s...

Entry G12: Degree Comparison

5 minute read

This is essentially a redo of the unweighted degree metrics from Entry G10. I ran the same metrics and queries from that entry, except I used the multigraph ...

Entry G11: Create the Marvel Multigraph Database

5 minute read

Creating a multigraph database was actually way easier than I expected.

Entry G10: Local Metrics

6 minute read

Now that I know what the larger graph looks like, I need metrics at the node level. The reason for these metrics is to be able to locate outlier nodes within...

Entry G9: Measuring Performance

4 minute read

As I may or may not have mentioned, the last time I tried running many of the calculations that I’ll be exploring in this series, I ran into timeout errors …...

Entry G8: Components

3 minute read

Now that I have a general feel for the graph database with counts and density, I want to look at components.

Entry G7: Density and Diameter

4 minute read

Density and diameter give you a feel for how strongly connected the full graph is: whether it’s dense or sparse. To get these measures, I need to calculate t...

Entry G6: Global Graph Counts

5 minute read

The first thing I want to do is get some global measures on the overall graph. I’m breaking these up into three categories: counts, density and diameter, and...

Entry G5.5: Analysis Metrics

2 minute read

I am 100% adding this entry retroactively. And yes, I dated it wrong so that it would show up in the right order in the Post list. That’s the nice thing abou...

Entry G5: Projecting Bimodal to Unimodal

9 minute read

I need to understand graph structure better and the repercussions of using the different model types. Specifically, I’m interested in memory use, processing ...

Entry G4: Modeling Relationships

3 minute read

Relationships are the lines that connect nodes. Neo4j usually describes nodes as nouns (a person, place, or thing - i.e. the subject) and relationships as ve...

Entry G3: Choosing a Graph Model (schema)

4 minute read

To harness the power of graph, data first has to be organized to fit the node/relationship format.

Entry G2: Create a Neo4j Database

3 minute read

To begin any data science project, I need a data to play with. For the graph project I’m going to start with the Marvel Universe Social Network available on ...

Entry G1: Connected Entities

6 minute read

The last post was the 52nd entry, meaning that I averaged one entry a week for an entire year. Now, just a little under a year into my chronicling journey, I...

Entry 52: Ensemble Methods: Bagging, Boosting, Stacking

3 minute read

There are multiple ways to reduce the bias and variance of Ensemble Learning, the three most common are bagging, boosting, and stacking. For more on bias and...

Entry 51: Ensemble Learning

9 minute read

Ensemble techniques began to appear in the 1990s according to page 192 of Applied Predictive Modeling. Most people think of Random Forests (basically a bunch...

Entry 50: Decision Tree Subtypes

11 minute read

The Complete Guide to Decision Trees sums up the similarities between Decision Tree Algorithms very succinctly:

Entry 49: Decision Trees Analysis and Interpretation

5 minute read

One of the major benefits of using Decision Trees is their interpretability. To take advantage of this benefit, you need to know how to pull out the informat...

Entry 48: Decision Tree Impurity Measures

4 minute read

Impurity seems like it should be a simple calculation. However, depending on prevalence of classes and quirks in the data, it’s usually not as straight forwa...

Entry 47: Pruning Decision Trees

8 minute read

If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:

Entry 46: Overfitting, Underfitting, and Data Sensitivity

2 minute read

I wanted to point a co-worker to information about overfitting the other day. While I’ve discussed it in entries 17 and 30, I realized I haven’t covered it i...

Entry 45: Visualizing Decision Trees

2 minute read

A major benefit of tree-based models is how easy they are to visualize. This visualization aspect is also vital to discussing how trees work.

Entry 44: Decision Trees

14 minute read

In entries 35 through 42 we learned how to find linear patterns using Regression algorithms. But what if the patterns in the data aren’t linear? This is wher...

Entry 43: NODES2020 Graph Visualization Talk

1 minute read

This is completely off topic for the machine learning algorithms series I’ve been working through, but it took up a lot of my time the last couple of weeks, ...

Entry 42: Logistic Regression

5 minute read

Page 143 of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow states that Logistic Regression is “just like a Linear Regression model” excep...

Entry 41: Elastic Net

2 minute read

The notebook where I did my code for this entry can be found on my github page in the Entry 41 notebook.

Entry 40: Ridge Regression

2 minute read

Ridge Regression is a form of regression regularization using L2 regularization.

Entry 39: Lasso Regression

2 minute read

Least Absolute Shrinkage and Selection Operator (LASSO) Regression is a form of regression regularization using L1 regularization.

Entry 38: Regularization

3 minute read

Regularization is used to help address overfitting.

Entry 37b: Gradient Descent

13 minute read

The general idea of Gradient Descent is to iteratively minimize the errors (usually using the mean squared error) to arrive at better and better solutions. K...

Entry 37a: Normal Equation

2 minute read

The Normal Equation generates the weights that are used in Linear Regression. It calculates the theta array directly without having to iterate through differ...

Entry 36: Ordinary Least Squares (OLS)

6 minute read

Ordinary Leas Squares is usually the default method of Linear Regression and is the method used in the sklearn.linear_model.LinearRegression function. On pag...

Entry 35: Regression

8 minute read

Regression is used to predict on continuous data, for example housing prices. Logistic Regression is the subcategory that is the exception, as it does classi...

Entry 34: Supervised Learning Datasets

less than 1 minute read

My anticipated fail from Entry 32 was correct, finding datasets and figuring out what kinds of patterns they have was a time suck. As such, “datasets” gets i...

Entry 33: Supervised Learning

3 minute read

Description

Entry 32: Modeling data

6 minute read

The challenge in this, the third series of entries, is to become familiar with the major categories of algorithms, how to implement them, determine the commo...

Entry 31: Training Size

2 minute read

At the very end of Entry 30 I mentioned that learning curves can be used to diagnose problems other than just high bias and high variance. Another easy probl...

Entry 30: Improve Performance - Learning Curves

6 minute read

Learning curves can help determine what avenues to pursue if a model isn’t up to expectations, or worse, is completely unusable. They can also be used to det...

Entry 29: Thresholds - Profit and cost

5 minute read

Sometimes considerations other than model performance need to be accounted for when choosing a threshold.

Entry 28: Cumulative gains and lift

5 minute read

Remember back in Entry 16 when I said I wasn’t planning to cover lift? Well, plans change.

Entry 27: Figuring out openml.org

5 minute read

Figuring out how to get the data from openml.org into the Entry 26e notebook (MNIST was surprisingly difficult. All the datasets are saved as arff files, an ...

Entry 26: Setting thresholds - precision, recall, and ROC

9 minute read

I don’t always want the default threshold for determining the classification (negative or positive) the way I did in Entry 24. As discussed in the precision ...

Entry 25: Baseline Models

1 minute read

As discussed in Entry 16, certain characteristics in the data can make a model look like it’s performing better than it is. One of these characteristics is c...

Entry 24: Scoring Classification Models - Implementation

8 minute read

Now that I’ve got a handle on the measurement options and equations for classification problems, it’s time to implement those measures on actual models.

Entry 23: Scoring Classification Models - Theory

14 minute read

Classification models present a different challenge than regression models. Because a numeric value isn’t returned, another way of measuring goodness of fit ...

Entry 22: Scoring Regression models - Implementation

5 minute read

Now that I’ve got a handle on the measurement options and equations, it’s time to implement those measures on actual models.

Entry 21: Scoring Regression Models - Theory

7 minute read

I need a way to measure a model’s performance. To do that, first I need to break it down by the type of prediction.

Entry 20: Scikit-Learn Pipeline

4 minute read

There are quite a few steps between pulling data and having a trained model. I need a good way to string it all together.

Entry 19: Implementing Cross-validation

4 minute read

In Entry 18 I finalized the decision to use a hybrid approach to validating models. Now I have to implement it.

Entry 18: Cross-validation

5 minute read

In Entry 17, I decided I want to use a hybrid cross-validation method. I intend to split out a hold-out set, then perform cross-validation on the training se...

Entry 17: Resampling

9 minute read

I have no intention of simply training a model and calling it quits. Once the model is trained, I need to now how well it did and what areas to concentrate o...

Entry 16: Model Evaluation

6 minute read

Max Kuhn and Kjell Johnson make the case in Applied Predictive Modeling that models can easily overemphasize patterns that are not reproducible.

Entry 15: Categorical Correlation/Collinearity

5 minute read

To ensure my process worked, I used it on multiple datasets. The code from notebook to notebook is mostly the same, just run on different data. Per usual, th...

Entry 14: Encoding Categorical Variables

3 minute read

The Problem

Entry 13: Categorical Preliminaries

6 minute read

Real world data often includes categorical variables like: Male or female Continent: North America, South America, Europe, Asia, Australia, Antarctica, ...

Entry 12: Missing Values

6 minute read

Wikipedia has a succinct definition of missing values: “missing data, or missing values, occur when no data value is stored for the variable in an observatio...

Entry 11: Consolidate Pre-processing Steps

5 minute read

Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.

Entry 10: Reorder Pre-processing and Make Predictions

6 minute read

Using the pre-processing steps I worked through in entries 6-8, I can now predict mass while only changing the surface pressure.

Entry 9: Train Model

3 minute read

At the end of the Entry 8 noteboook I had a standardized dataset. Now it’s time to try training a model.

Entry 8: Centering and Scaling

6 minute read

By the end of Entry 7 I’d finalized the feature set to train a model and make predictions. One last step is needed before a prediction can be made.

Entry 7: Collinearity

5 minute read

In Entry 6 I looked at the correlation between the prediction features and the feature of interest (well, kinda. This problem question is a little weird in t...

Entry 6: Correlations

3 minute read

I need to find a way to automate some of the process outlined in Entry 5. My coworker Sabber suggested using correlation to help sort out the wheat from the ...

Entry 5: Explore the Data

5 minute read

Now that I have the dataset I put together in Entry 4, it’s time to see what’s in it.

Entry 4: Get the Data

3 minute read

In Entry 3 I defined my problem as: Holding all other factors constant, what mass is needed to retain an atmosphere on Mars? I need data to solve it.

Entry 3: Frame the Problem

6 minute read

In Entry 2 I decided to follow the ML Project Checklist from Hands on Machine Learning with Scikit-Learn & TensorFlow. The first step in that process is ...

Entry 2: Define the Process

5 minute read

In Entry 1, I decided to take the smallest problem I could find and solve just one aspect. Only after I’ve solved that single problem am I allowed to move on...

Entry 1: Impostor Syndrome

4 minute read

I am qualified. Quantitatively, I know this is true: I have a Master of Science degree in Data Science I have a job on a machine learning team at a larg...

Julie Fisher

Posts by Category

Blog