About

Data Science Diaries

I graduated from Penn State in 2006 with a bachelor’s degree in English. English major writing a data science blog? Shouldn’t I have been a computer science or statistics major? Physics even?

Well, a little over a decade after graduating, I finally figured out what I want to be when I grow up and went back to school. I graduated from Lipscomb University with a master’s degree in Data Science in 2018.

The master’s program was a whirlwind fifteen months and I didn’t enter as well prepared as I probably should have. For example, I might have been able to spend more time on advanced topics if I hadn’t been learning computer programming in general, both R and Python specifically, as well as how databases work and how to query them with SQL.

However, I love a challenge and have never let a little under-preparedness stop me before. That may be an exaggeration, but nevertheless I finished on time and even had a job in machine learning before graduation.

Since then, I’ve been working to expand and improve my data science skills. I’ve worked my way through multiple courses on Coursera, Udemy, EdX, and Udacity, and have a plethora of books on the topic. However, I’ve found all these published and polished sources to be missing one very important thing: the messy details.

Jose Portilla has a lot of really great courses on Udemy and repos on github. I’ve learned a lot from him, he’s one of my favorite instructors. But when I struck out on my own and tried to tackle new datasets, I found myself a bit adrift.

For example, in the visualization section of one of his courses there were great examples of two features that predicted the target with relatively high accuracy. How did he find those two features? A correlation matrix type plot that plots every feature against every other feature? That’s all well and good on a dataset with ten features or less, but what do you do when you get a dataset with hundreds of features? I’d end up spending weeks just creating and analyzing plots.

Another example would be any tutorial with a Jupyter notebook. The notebook ends up having around 30 cells. But when I’m working on my own, my cell run count ends up over 100. Regularly. Where did all the iterations go? How did they get to the pretty product that got posted?

I found myself wanting to answer these questions and record the process - and so Data Science Diaries was born.

I start at the beginning and work a single problem at a time. The entries can be read in sequence or used individually as quick references. So I don’t wind up restating everything multiple times, each entry assumes knowledge of what came before it. These are at an intermediate level - I don’t cover basics like how to install Anaconda, run a Jupyter notebook, or write code and little if anything will be worthy of an academic paper.

Hopefully, other fledgling data scientists will find my journey through the nitty-gritty details useful and I won’t just be whistling Dixie to myself.