Entry 5: Explore the Data
Now that I have the dataset I put together in Entry 4, it’s time to see what’s in it.
The notebook with the code that accompanies the concepts discussed below can be found in the Entry 5 notebook.
The Problem
This is the time to get a feel for the data. It’s called Exploratory Data Analysis (EDA). The kinds of things to examine include:
- Are the variables structured, unstructured, numerical, categorical, etc?
- Are there missing values?
- What are the distributions?
- Are there outliers or noise?
- Which variable is the target and which are the features?
- How do the features and target all relate to each other?
- Which features are possibly useful for predicting the target?
- Are there any transformations that would be useful?
- Square
- Square root
- Logarithmic
- Normalization
- Standardization
- Is supplementary data needed?
The Options
There are a lot of ways to explore data. Descriptive statistics and visualization are the two most common methods. These two methods include things like:
Descriptive statistics:
|
Visualization:
|
Descriptive statistics are easy to calculate and provide definitive values that can be used programmatically. However, they can also hide nuances in the data.
Visualizing data can reveal differences that are invisible upon inspection with descriptive statistics. This is usually demonstrated using Anscombe’s quartet. The mean of x, mean of y, variance of x, variance of y, and correlation between x and y are nearly identical for all four of the datasets.
The Proposed Solution
Look at non-null entries
First I used .info()
to look at the count of non-null entries. A non-null count lower than the number of entries indicates missing values. Missing values were one of those things I talked about in Entry 2. They cause some machine learning algorithms to choke.
The planets dataset has no missing values. Of course, I curated this very small dataset by hand and made sure there were no missing values. I’ll deal with missing values and the different ways to deal with it in Entry 12.
Review variable types
The .info()
method also gives information about the data type. The reason I like looking at this is: just because I thought a column was a particular type of variable doesn’t mean pandas thought the same.
I’ve had a numeric column load as a string because a value like ‘None’ was entered. When NULL
values are present, integer columns will load as floats. Double checking small things like this can save a lot of time further down the line.
There are three non-numeric columns: type
, rings
, and magnetic_field
. The rest are numeric (float or int).
Descriptive Statistics
Next I take a glance at the descriptive statistics. The fastest way to get basic descriptive statistics is to use .describe()
. It shows the count, mean, standard deviation, min, max, and 25/50/75 quartiles for numeric data. This information is useful to determine what kind of data pre-processing will be needed. Pre-processing will be covered in multiple future entries, but some examples include:
- Scaling
- Normalization
- Selection
- Transformations
- Aggregation
- Cleaning
Visualizations
I started with a pairplot of all the numeric data (hint, pairplot
generally only plots the numeric values). That took several minutes to run (the more variables there are the longer it takes to generate the plot) and gave me a 23 x 23 gridplot (529 plots).
That’s a lot of plots. They were tiny and the text was so small it was unreadable.
I manually narrowed down the features to just what seemed unique and relevant. That left me with eight features, which gave me an 8x8 grid (64 plots). From these, I created iterations of different combinations of features, but mainly focused on how the features related to atmospheric mass.
Disclaimer
This problem is unusual because what I want to know about is atmospheric mass, but what I’m predicting on is planetary mass. I.E. my variable of interest (atmospheric mass) is different than my target value (planetary mass). Generally, the variable of interest is the one that’s being predicted on.
I set it up this way to allow me to predict on a range of planetary mass values based on a set of atmospheric mass values.
The Fail
Whittling down features by hand was useful in this case, but doesn’t necessarily (or at all) sound like a task I want to do for the nearly 600 features at work. Nor do I want to do this for every Kaggle competition or UCI dataset. On top of that, it’s easy to get carried away with visualizations and end up with tens of plots that may or may not say anything about pertinent relationships.
For this 11 observation dataset, I ended up with around 22 plots (I counted each pairplot as a single plot instead of including every plot in the grid as its own plot). And that number includes having narrowed my focus to 5-8 features after the initial look at the pairplots.
Exploring the data is important and can teach me things I didn’t know about the data, but what I need is a way to automatically determine feature importance, then concentrate on those features.
Side note
When working on a dataset with a large number of features, like the one I have at work, there is usually feature engineering that goes into creating the features. A lot of exploratory analysis goes into this process. That’s probably where I would expect EDA to occur.
Feature engineering is a topic for another series of entries, but as an introduction some examples of feature engineering include:
- Counts
- Ratios
- Similarities
- Mathematical transformations
- Log
- Square
- Exponentiation
- Etc
- Aggregations
- NLP
- Things as simple as character counts
- Things as complicated as n-grams
At this stage in a production pipeline, I’m most interested in how to rank the features and concentrate on just the useful ones.
Enter my fantastic co-worker Sabber. He proposed using correlation to determine important features and narrow the focus.