Entry SM02: Clean Data
Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are general...
Wrangling data into a usable form is a big part of any real world machine learning problem. When tackling these types of problems, several things are general...
There are a lot of considerations in moving from a local model used to train and predict on batch data to a production model. This series of posts explores h...
In the previous entries in this series, I loaded all the files in a directory, processed the data, and transformed it into ngrams. Now it’s time to do math a...
In the first entry of this series, I figured out how to process the raw files. In the second entry, I figured out how to load all files in a directory (even ...
In the previous entry, I figured out how to process individual files, removing many of the items on the “Remove lines/characters” list specified in the homew...
Recently I’ve been working my way through one of the older versions of CSE 140 (now CSE 160) offered at the University of Washington. Homework 8 is a nice ex...
The apoc.path.expandingTree algorithm in Entry G19 revealed a gold mine of information. Once I had that table of results I knew that not only could I grab th...
Having encountered the limit of connections between node pairs in Entry G19, I couldn’t resist taking a closer look. I’ve been trying to calculate the diamet...
In an unweighted graph, the fewest steps between any node (A) to another node (B) is called the shortest path.
Now that we’ve established how to define an egocentric neighborhood and differentiated our people nodes, it’s time to start calculating metrics for the diffe...
Now that we’ve gone through the global metrics and have a feel for the structure and composition of our data, it’s time to start running local metrics.
The use case I’m interested in is locating fraud within a graph. This is difficult to do when all the people nodes in your data are undifferentiated.
The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G8, addressing just the graph components...
The notebook that accompanies this entry is a cleaned up, concise version of the three notebooks I created for Entry G7, addressing only the number of possib...
The notebook that accompanies this entry is the cleaned up, concise version of the three notebooks that accompanied Entry G6, but limited to just the global ...
Like Entry G12 this is a redo of part of Entry G10. This entry addresses the weighted degrees. If you need a reminder as to what weighted relationships are s...
This is essentially a redo of the unweighted degree metrics from Entry G10. I ran the same metrics and queries from that entry, except I used the multigraph ...
Creating a multigraph database was actually way easier than I expected.
Now that I know what the larger graph looks like, I need metrics at the node level. The reason for these metrics is to be able to locate outlier nodes within...
As I may or may not have mentioned, the last time I tried running many of the calculations that I’ll be exploring in this series, I ran into timeout errors …...
Now that I have a general feel for the graph database with counts and density, I want to look at components.
Density and diameter give you a feel for how strongly connected the full graph is: whether it’s dense or sparse. To get these measures, I need to calculate t...
The first thing I want to do is get some global measures on the overall graph. I’m breaking these up into three categories: counts, density and diameter, and...
I am 100% adding this entry retroactively. And yes, I dated it wrong so that it would show up in the right order in the Post list. That’s the nice thing abou...
I need to understand graph structure better and the repercussions of using the different model types. Specifically, I’m interested in memory use, processing ...
Relationships are the lines that connect nodes. Neo4j usually describes nodes as nouns (a person, place, or thing - i.e. the subject) and relationships as ve...
To harness the power of graph, data first has to be organized to fit the node/relationship format.
To begin any data science project, I need a data to play with. For the graph project I’m going to start with the Marvel Universe Social Network available on ...
The last post was the 52nd entry, meaning that I averaged one entry a week for an entire year. Now, just a little under a year into my chronicling journey, I...
There are multiple ways to reduce the bias and variance of Ensemble Learning, the three most common are bagging, boosting, and stacking. For more on bias and...
Ensemble techniques began to appear in the 1990s according to page 192 of Applied Predictive Modeling. Most people think of Random Forests (basically a bunch...
The Complete Guide to Decision Trees sums up the similarities between Decision Tree Algorithms very succinctly:
One of the major benefits of using Decision Trees is their interpretability. To take advantage of this benefit, you need to know how to pull out the informat...
Impurity seems like it should be a simple calculation. However, depending on prevalence of classes and quirks in the data, it’s usually not as straight forwa...
If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:
I wanted to point a co-worker to information about overfitting the other day. While I’ve discussed it in entries 17 and 30, I realized I haven’t covered it i...
A major benefit of tree-based models is how easy they are to visualize. This visualization aspect is also vital to discussing how trees work.
In entries 35 through 42 we learned how to find linear patterns using Regression algorithms. But what if the patterns in the data aren’t linear? This is wher...
This is completely off topic for the machine learning algorithms series I’ve been working through, but it took up a lot of my time the last couple of weeks, ...
Page 143 of Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow states that Logistic Regression is “just like a Linear Regression model” excep...
The notebook where I did my code for this entry can be found on my github page in the Entry 41 notebook.
Ridge Regression is a form of regression regularization using L2 regularization.
Least Absolute Shrinkage and Selection Operator (LASSO) Regression is a form of regression regularization using L1 regularization.
Regularization is used to help address overfitting.
The general idea of Gradient Descent is to iteratively minimize the errors (usually using the mean squared error) to arrive at better and better solutions. K...
The Normal Equation generates the weights that are used in Linear Regression. It calculates the theta array directly without having to iterate through differ...
Ordinary Leas Squares is usually the default method of Linear Regression and is the method used in the sklearn.linear_model.LinearRegression function. On pag...
Regression is used to predict on continuous data, for example housing prices. Logistic Regression is the subcategory that is the exception, as it does classi...
My anticipated fail from Entry 32 was correct, finding datasets and figuring out what kinds of patterns they have was a time suck. As such, “datasets” gets i...
Description
The challenge in this, the third series of entries, is to become familiar with the major categories of algorithms, how to implement them, determine the commo...
At the very end of Entry 30 I mentioned that learning curves can be used to diagnose problems other than just high bias and high variance. Another easy probl...
Learning curves can help determine what avenues to pursue if a model isn’t up to expectations, or worse, is completely unusable. They can also be used to det...
Sometimes considerations other than model performance need to be accounted for when choosing a threshold.
Remember back in Entry 16 when I said I wasn’t planning to cover lift? Well, plans change.
Figuring out how to get the data from openml.org into the Entry 26e notebook (MNIST was surprisingly difficult. All the datasets are saved as arff files, an ...
I don’t always want the default threshold for determining the classification (negative or positive) the way I did in Entry 24. As discussed in the precision ...
As discussed in Entry 16, certain characteristics in the data can make a model look like it’s performing better than it is. One of these characteristics is c...
Now that I’ve got a handle on the measurement options and equations for classification problems, it’s time to implement those measures on actual models.
Classification models present a different challenge than regression models. Because a numeric value isn’t returned, another way of measuring goodness of fit ...
Now that I’ve got a handle on the measurement options and equations, it’s time to implement those measures on actual models.
I need a way to measure a model’s performance. To do that, first I need to break it down by the type of prediction.
There are quite a few steps between pulling data and having a trained model. I need a good way to string it all together.
In Entry 18 I finalized the decision to use a hybrid approach to validating models. Now I have to implement it.
In Entry 17, I decided I want to use a hybrid cross-validation method. I intend to split out a hold-out set, then perform cross-validation on the training se...
I have no intention of simply training a model and calling it quits. Once the model is trained, I need to now how well it did and what areas to concentrate o...
Max Kuhn and Kjell Johnson make the case in Applied Predictive Modeling that models can easily overemphasize patterns that are not reproducible.
To ensure my process worked, I used it on multiple datasets. The code from notebook to notebook is mostly the same, just run on different data. Per usual, th...
The Problem
Real world data often includes categorical variables like: Male or female Continent: North America, South America, Europe, Asia, Australia, Antarctica, ...
Wikipedia has a succinct definition of missing values: “missing data, or missing values, occur when no data value is stored for the variable in an observatio...
Working on the machine learning process over 10 entries left my code fragmented with different versions of key functions scattered across multiple notebooks.
Using the pre-processing steps I worked through in entries 6-8, I can now predict mass while only changing the surface pressure.
At the end of the Entry 8 noteboook I had a standardized dataset. Now it’s time to try training a model.
By the end of Entry 7 I’d finalized the feature set to train a model and make predictions. One last step is needed before a prediction can be made.
In Entry 6 I looked at the correlation between the prediction features and the feature of interest (well, kinda. This problem question is a little weird in t...
I need to find a way to automate some of the process outlined in Entry 5. My coworker Sabber suggested using correlation to help sort out the wheat from the ...
Now that I have the dataset I put together in Entry 4, it’s time to see what’s in it.
In Entry 3 I defined my problem as: Holding all other factors constant, what mass is needed to retain an atmosphere on Mars? I need data to solve it.
In Entry 2 I decided to follow the ML Project Checklist from Hands on Machine Learning with Scikit-Learn & TensorFlow. The first step in that process is ...
In Entry 1, I decided to take the smallest problem I could find and solve just one aspect. Only after I’ve solved that single problem am I allowed to move on...
I am qualified. Quantitatively, I know this is true: I have a Master of Science degree in Data Science I have a job on a machine learning team at a larg...